Snapshot Diff Lifecycle
The lifecycle of a Snapshot Diff in Ozone is an asynchronous, job-based process designed to handle potentially millions of key changes without blocking the Ozone Manager.
Based on SnapshotDiffManager.java, here is the step-by-step lifecycle:
1. Initiation & Job Queuing (getSnapshotDiffReport)
- Request: A user requests a diff between
fromSnapshotandtoSnapshot. - Job Key: A unique
snapDiffJobKeyis generated using the UUIDs of the two snapshots. - Check/Create Job:
- If a job already exists for this pair, its current status is returned.
- If no job exists, a new
SnapshotDiffJobis created with statusQUEUEDand persisted in thesnapDiffJobTable(RocksDB).
- Submission: The job is submitted to the
snapDiffExecutor(a thread pool). The status transitions fromQUEUEDtoIN_PROGRESS.
2. Preparation & Snapshots Opening (generateSnapshotDiffReport)
- Metrics: Increments the
numSnapshotDiffJobsmetric. - Snapshot Handles: The service acquires "Active Snapshot" handles (
rcFromSnapshot,rcToSnapshot) which are reference-counted to prevent the snapshots from being deleted while the diff is running. - Temporary Tables: Three temporary RocksDB Column Families (CF) are created specifically for this Job ID:
fromSnapshotColumnFamily: Maps Object IDs to Key Names in the source.toSnapshotColumnFamily: Maps Object IDs to Key Names in the target.objectIDsColumnFamily: Tracks the union of unique Object IDs and whether they are directories.
3. Delta Discovery (getDeltaFilesAndDiffKeysToObjectIdToKeyMap)
- Efficient Diff (Native): If enabled, it uses the
RocksDBCheckpointDifferto identify the exact SST files that changed between the two snapshot checkpoints. - Full Diff (Fallback): If native libs are missing or a full diff is forced, it performs a complete scan of the key tables.
- SST Parsing: It streams keys from the delta SST files.
- Object Mapping: It populates the temporary "Object ID to Key Name" maps. This is critical for detecting Renames (same Object ID, different Key Name).
4. Path Resolution (FSO specific)
- FSO Layout: If the bucket is File System Optimized (FSO), keys are stored as
ParentID/FileName. - Path Resolver: The
FSODirectoryPathResolveris used to recursively resolve theParentIDinto a full human-readable path (e.g.,vol1/bucket1/dir1/file1).
5. Report Generation (generateDiffReport)
- Comparison: The service iterates through the union of all Object IDs found in the delta.
- Classification: It classifies changes into four types:
- CREATE: Present in
toSnapshotmap but notfromSnapshot. - DELETE: Present in
fromSnapshotmap but nottoSnapshot. - RENAME: Present in both maps with the same Object ID but different paths/names.
- MODIFY: Present in both with the same path, but metadata (like ACLs or Block Versions) has changed.
- CREATE: Present in
- Persistence: The resulting
DiffReportEntryobjects are serialized and stored in thesnapDiffReportTable.
6. Finalization
- Status Update: On success, the job status in
snapDiffJobTableis updated toDONE, and thetotalDiffEntriescount is saved. - Cleanup: The temporary Column Families used for Object ID mapping are dropped and closed. The snapshot reference counts are decremented.
7. Consumption & Pagination (createPageResponse)
- Polling: The client polls the OM for the report.
- Paged Results: Since reports can be massive, the OM returns them in pages.
- Token: Each response includes a token (the index of the next entry). The client provides this token in the next call to fetch the subsequent page.
8. Garbage Collection (Background)
- Cleanup Service: (Referenced as
DiffCleanupService) After a configurable interval, a background task deletes the job fromsnapDiffJobTableand the associated report entries fromsnapDiffReportTableto reclaim space.
Summary of Job Statuses
- QUEUED: Job registered, waiting for a thread.
- IN_PROGRESS: Currently scanning SSTs or generating the report.
- DONE: Report is ready for paging.
- FAILED: An error occurred (e.g., I/O error); includes a reason.
- REJECTED: Too many concurrent jobs; the client should retry later.
- CANCELLED: The job was manually stopped via a cancel request.
