Resilient Data: Exploring Replication And Recovery In Apache Ozone

The ASF

7 Dec 202329:37

Summary

TLDRThis talk explores Apache Ozone, a scalable, distributed object store designed for fault-tolerant data storage and disaster recovery. The speaker details Ozone’s architecture, including Ozone Manager, Storage Container Manager, and Recon for observability, highlighting how metadata and data blocks are managed separately. Key features such as three-way replication, Erasure Coding, and high availability mechanisms ensure consistency and durability. The session also covers failure recovery, backup strategies like trash, cross-cluster replication, and the snapshot feature leveraging RocksDB checkpoints for incremental replication. Overall, the talk emphasizes Ozone’s ability to provide resilient, highly available data infrastructure for critical, large-scale applications.

Takeaways

😀 Apache Ozone is a distributed object store designed to handle billions of objects and address HDFS's small file problem by separating namespace and block management.
😀 Ozone uses containers (default size 5 GB) as larger storage units to reduce metadata overhead and improve scalability.
😀 The architecture includes Ozone Manager (OM) for metadata, Storage Container Manager (SCM) for block management, and DataNodes for actual data storage.
😀 Users can access Ozone via O3FS, CFS, S3 API, or shell clients, with load balancing for S3 client requests.
😀 Replication is fundamental for fault tolerance, with 3-way replication as default and optional Erasure Coding to reduce storage overhead.
😀 Distributed consensus and consistency are maintained using the Raft protocol via Apache Ratis, ensuring reliable replication across nodes.
😀 High availability is implemented for master nodes (OM and SCM), storing metadata in RocksDB and replicating across multiple nodes with leader election for write propagation.
😀 Failure recovery involves monitoring DataNodes via heartbeats and container reports, closing pipelines on node failure, and using the Replication Manager to maintain desired replica counts.
😀 Backup and recovery strategies include Trash (bucket-level deletion protection), cross-cluster replication with DCP, support for different bucket types, and file system checksums to maintain data integrity.
😀 Snapshots provide immutable, point-in-time copies of the object store using RocksDB checkpointing and are essential for incremental replication, backup, and disaster recovery.
😀 Data in Ozone follows a hierarchy of storage units: Chunk (4 MB) → Block → Container (5 GB), optimizing write operations and replication management.
😀 Snapshots and DCP allow efficient incremental replication by copying only changes between snapshots and applying renames or deletions correctly to the target system.

Q & A

What is Apache Ozone and why was it developed?
-Apache Ozone is a distributed object store capable of handling billions of objects. It was developed to address scalability issues in HDFS, particularly the small file problem, by separating namespace management (handled by Ozone Manager) and block storage (handled by Storage Container Manager).
How does Ozone differ from traditional HDFS in terms of architecture?
-Unlike HDFS, which uses a single NameNode for both namespace and block management, Ozone separates these roles into Ozone Manager (OM) for metadata and namespace and Storage Container Manager (SCM) for managing storage containers. This reduces load and improves scalability.
What are containers in Ozone and how do they compare to blocks?
-Containers are large storage units in Ozone, typically 5GB by default, that encapsulate multiple blocks. They reduce the number of reports sent to SCM, unlike smaller blocks in HDFS, improving performance and manageability.
How can Ozone be accessed by users or applications?
-Ozone can be accessed as a file system using O3FS or Global File System views, as an object store via S3 API, or through its inbuilt shell client. It supports integration with Hadoop ecosystem tools like Hive and Spark.
What role does the Recon server play in Ozone?
-The Recon server provides observability and analytics for the Ozone system, offering a comprehensive view of keys, volumes, buckets, and the state of distributed components such as Ozone Manager and Storage Container Manager.
How does Ozone ensure data replication and fault tolerance?
-Ozone maintains three-way replication of containers by default and can also use erasure coding to reduce storage overhead. Distributed consensus for consistency is achieved via the Raft protocol, which ensures replicated data remains consistent across nodes.
What mechanisms does Ozone use to achieve high availability for master nodes?
-Both Ozone Manager and Storage Container Manager are highly available using three-way replication of their metadata in RocksDB. Leader election via the Raft protocol ensures continuous availability and consistency of metadata.
How does Ozone recover from failures of data nodes or pipelines?
-When a data node fails or a pipeline is detected as unresponsive, SCM closes the pipeline, marking containers as closed. The Replication Manager ensures the desired number of replicas are maintained by copying or removing container replicas as needed.
What backup and disaster recovery strategies are available in Ozone?
-Ozone offers several strategies including: Trash for accidental deletion prevention, cross-cluster replication using DCP for distributed backups, and snapshots for point-in-time, immutable copies of the object store that support incremental replication and disaster recovery.
How do snapshots work in Ozone and what are their applications?
-Snapshots are read-only, point-in-time copies of the object store created using RocksDB checkpoints. They are immutable and used for incremental replication, backup, and disaster recovery, allowing tools like DCP to copy only the differences between snapshots to target clusters.
What is the role of RocksDB in Ozone’s metadata management?
-RocksDB is an LSM-tree based embedded key-value store used by Ozone Manager and SCM to store metadata such as keys, buckets, and volumes. It batches write operations in memory and periodically flushes them to disk, creating SST files, which enable checkpointing and snapshots.
How does Ozone support incremental replication using snapshots?
-Ozone uses snapshots to identify changes between two points in time. Tools like DCP calculate the differences (snap diff) between snapshots and copy only modified or new files to the target cluster, applying rename and delete operations as needed to maintain consistency.