Introduce snapshots repository: integrate snapshot writing with S3 via object_store #2274

pcholakov · 2024-11-12T12:33:44Z

This change introduces the SnapshotRepository which encapsulates publishing the raw RocksDB column family exported data + custom metadata to a [potentially] remote destination. We add a dependency on object_store which supports all major cloud providers' blob stores, plus local filesystem.

By default, the snapshot repository is the local restate-data/pp-snapshots directory, but a new optional config key allows this to be configured to an S3 URL. We can trivially support other destinations by enabling additional features in the object_store create.

Sample snapshot publishing (triggered with restatectl snap create-snapshot -p 1):

metadata-node	| 2024-11-12T12:11:46.203371Z  INFO restate_admin::cluster_controller::service: Create snapshot command received partition_id=PartitionId(0)
node-1	| 2024-11-12T12:11:46.204191Z DEBUG network-reactor: restate_worker::partition_processor_manager: Received 'CreateSnapshotRequest { partition_id: PartitionId(0) }' from N0:6 peer_node_id=N0:6 protocol_version=1 task_id=28
node-1	| 2024-11-12T12:11:46.204970Z TRACE run:create-snapshot: restate_worker::partition::snapshot_producer: Creating partition snapshot export directory: "/Users/pavel/restate/restate/restate-data/node-1/db-snapshots/0/snap_15YNkGdyO9YatUZwRidwr7j" snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j partition_id=0 is_leader=true
node-1	| 2024-11-12T12:11:46.211044Z DEBUG run:create-snapshot: restate_worker::partition::snapshot_producer: Publishing partition snapshot to: s3://pavel-restate-snapshots-test/test-cluster-snapshots/ snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j partition_id=PartitionId(0) lsn=6 partition_id=0 is_leader=true
node-1	| 2024-11-12T12:11:46.211445Z DEBUG package-snapshot: restate_worker::partition::snapshot_producer: Creating snapshot tarball of "/Users/pavel/restate/restate/restate-data" in: NamedTempFile("/Users/pavel/restate/restate/restate-data/.tmpFUqwbe")... snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j
node-1	| 2024-11-12T12:11:48.007104Z DEBUG run:create-snapshot: restate_worker::partition::snapshot_producer: Successfully published snapshot to repository as: 0/fffffffffffffff9/snap_15YNkGdyO9YatUZwRidwr7j_6.tar snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j etag="\"6a1cd48ee82fcce1e1ec437a9ff380c5\"" partition_id=0 is_leader=true
node-1	| 2024-11-12T12:11:48.016576Z TRACE run:create-snapshot: restate_worker::partition::snapshot_producer: Updated persisted archived snapshot LSN snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j previous_archived_lsn=Some(Lsn(5)) updated_archived_lsn=Lsn(6) partition_id=0 is_leader=true
node-1	| 2024-11-12T12:11:48.018866Z DEBUG run:create-snapshot: restate_worker::partition::snapshot_producer: Cleaned up snapshot export directory: "/Users/pavel/restate/restate/restate-data/node-1/db-snapshots/0/snap_15YNkGdyO9YatUZwRidwr7j" snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j partition_id=0 is_leader=true
node-1	| 2024-11-12T12:11:48.019551Z  INFO run:create-snapshot: restate_worker::partition::snapshot_producer: Successfully published partition snapshot snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j partition_id=PartitionId(0) snapshot_lsn=Lsn(6) partition_id=0 is_leader=true
node-1	| 2024-11-12T12:11:48.020410Z DEBUG restate_worker::partition_processor_manager: Create snapshot completed partition_id=PartitionId(0) result=Ok(SnapshotId(Ulid(2093150492079866388857063190669787332)))

The relevant configuration for this is:

[worker.snapshots]
destination = "s3://pavel-restate-snapshots-test/test-cluster-snapshots/"

The store layout is currently:

restate-data/pp-snapshots
└── 0
    └── fffffffffffffffa
        └── snap_14xkL7UxFzvDGHlWhvl4Wzf_5.tar

The key structure is: [<prefix>/]<partition_id>/<sort_key>/<snapshot_id>_<lsn>.tar.

An optional prefix is provided by the operator as part of the desination url, e.g. s3://my-bucket/custom/cluster/prefix
The partition id
An inverse-sort key to ensure that the latest entry always appears first when listing a bucket prefix; it appears that all major object stores copied S3 and offer ascending lexicographical sorting only
The unique snapshot id ensures that even if we did somehow take multiple snapshots at the same LSN, the will land up as distinct objects in the store. This might be useful to operators if a specific RocksDB instance has a corrupt local DB for example, and they're trying to export the same LSN from a different PP
The LSN is there purely to make it easy for the operator (snapshot ids being ULIDs should naturally sort in ascending timestamp order)

Open questions

how does the snapshots layout look to you? keep in mind it'll probably change a bit to accommodate standalone metadata objects (see TODO list below)
should we keep persisting the FSM variable in the partition store? it bugs me a bit that this will cause drift across the different PPs, since these updates are local-only
are we okay with creating a tarball on disk? I didn't see a good way to stream the archive directly to disk, and we have to have the SSTs on disk first (though those being hardlinks to RocksDB files shouldn't cause too much overhead); I don't want to over-invest here and would be happy to accept the disk overhead and rather focus on implementing incremental snapshots sooner
any suggestions for how best to create and then upload the local archive most efficiently?

Outstanding tasks

upload a separate metadata.json object to the object store, separate from the tar file; I think this is a must as it will allow us to determine e.g. the version of the snapshot before we download the data; this gives us an easy migration path to support different data formats in the future
stream the snapshot to the object store in chunks, preferably as a parallel multi-part upload

Introduce a new configurable number of records property after which a snapshot of the partition store will be automatically taken.

pcholakov · 2024-11-12T12:38:06Z

crates/worker/src/partition/snapshots/repository.rs

+                .with_credentials(Arc::new(AwsSdkCredentialsProvider {
+                    credentials_provider: DefaultCredentialsChain::builder().build().await,
+                }))


This makes object_store much nicer to use with S3; it natively integrates with IMDS out of the box but doesn't understand things like AWS_PROFILE and the commonly used dotfiles used to configure AWS CLI/SDKs. It also ensures dynamic refresh when e.g. SSO session expires and the user renews it.

pcholakov · 2024-11-12T12:40:01Z

crates/worker/src/partition/snapshots/repository.rs

+    staging_path: PathBuf,
+}
+
+impl SnapshotRepository {


In a follow-up PR, I will introduce the ability to list recent snapshots available from the repository.

pcholakov · 2024-11-12T12:40:29Z

crates/worker/src/partition/snapshots/repository.rs

+}
+
+#[async_trait]
+impl object_store::CredentialProvider for AwsSdkCredentialsProvider {


This may be something useful to contribute back upstream to object_store.

…ots-every-n-records

AhmedSoliman · 2024-11-14T12:56:33Z

Is this ready for review? (noticed that I'm a review and it's a draft still)

pcholakov · 2024-11-14T13:14:47Z

@AhmedSoliman - early feedback welcome but only if you want to! I still need to rebase this on the latest #2253 so maybe don't until that's done. I'll remove the reviewers until it is.

pcholakov added 2 commits November 8, 2024 15:14

Create snapshots on leading Partition Processors

d230f09

Introduce a new configurable number of records property after which a snapshot of the partition store will be automatically taken.

Revert to deriving Default for SnapshotsOptions

52029ba

pcholakov requested review from AhmedSoliman, tillrohrmann and jackkleeman November 12, 2024 12:33

pcholakov commented Nov 12, 2024

View reviewed changes

pcholakov added 6 commits November 12, 2024 16:20

Merge remote-tracking branch 'origin/main' into feat/automatic-snapsh…

3889982

…ots-every-n-records

First stab at S3 upload with object_store

eb1783e

Package up partition snapshots into a single tarball

8231e7b

Cleanup: factor out SnapshotRepository component

61ed11c

Refactoring: new "snapshots" module

f613938

Self-review fixups

24facc8

pcholakov force-pushed the feat/snapshot-upload branch from 451cb20 to 24facc8 Compare November 12, 2024 14:20

pcholakov added 2 commits November 13, 2024 12:01

Minor cleanups

c310317

Minor cleanups

ff544d8

pcholakov force-pushed the feat/automatic-snapshots-every-n-records branch 2 times, most recently from 1d97374 to 17e437f Compare November 14, 2024 09:13

pcholakov removed request for AhmedSoliman, tillrohrmann and jackkleeman November 14, 2024 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce snapshots repository: integrate snapshot writing with S3 via object_store #2274

Introduce snapshots repository: integrate snapshot writing with S3 via object_store #2274

pcholakov commented Nov 12, 2024 •

edited

Loading

pcholakov Nov 12, 2024

pcholakov Nov 12, 2024

pcholakov Nov 12, 2024

AhmedSoliman commented Nov 14, 2024

pcholakov commented Nov 14, 2024

Introduce snapshots repository: integrate snapshot writing with S3 via object_store #2274

Are you sure you want to change the base?

Introduce snapshots repository: integrate snapshot writing with S3 via object_store #2274

Conversation

pcholakov commented Nov 12, 2024 • edited Loading

Open questions

Outstanding tasks

pcholakov Nov 12, 2024

Choose a reason for hiding this comment

pcholakov Nov 12, 2024

Choose a reason for hiding this comment

pcholakov Nov 12, 2024

Choose a reason for hiding this comment

AhmedSoliman commented Nov 14, 2024

pcholakov commented Nov 14, 2024

pcholakov commented Nov 12, 2024 •

edited

Loading