Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce snapshots repository: integrate snapshot writing with S3 via object_store #2274

Draft
wants to merge 10 commits into
base: feat/automatic-snapshots-every-n-records
Choose a base branch
from

Conversation

pcholakov
Copy link
Contributor

@pcholakov pcholakov commented Nov 12, 2024

This change introduces the SnapshotRepository which encapsulates publishing the raw RocksDB column family exported data + custom metadata to a [potentially] remote destination. We add a dependency on object_store which supports all major cloud providers' blob stores, plus local filesystem.

By default, the snapshot repository is the local restate-data/pp-snapshots directory, but a new optional config key allows this to be configured to an S3 URL. We can trivially support other destinations by enabling additional features in the object_store create.

Sample snapshot publishing (triggered with restatectl snap create-snapshot -p 1):

metadata-node	| 2024-11-12T12:11:46.203371Z  INFO restate_admin::cluster_controller::service: Create snapshot command received partition_id=PartitionId(0)
node-1	| 2024-11-12T12:11:46.204191Z DEBUG network-reactor: restate_worker::partition_processor_manager: Received 'CreateSnapshotRequest { partition_id: PartitionId(0) }' from N0:6 peer_node_id=N0:6 protocol_version=1 task_id=28
node-1	| 2024-11-12T12:11:46.204970Z TRACE run:create-snapshot: restate_worker::partition::snapshot_producer: Creating partition snapshot export directory: "/Users/pavel/restate/restate/restate-data/node-1/db-snapshots/0/snap_15YNkGdyO9YatUZwRidwr7j" snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j partition_id=0 is_leader=true
node-1	| 2024-11-12T12:11:46.211044Z DEBUG run:create-snapshot: restate_worker::partition::snapshot_producer: Publishing partition snapshot to: s3://pavel-restate-snapshots-test/test-cluster-snapshots/ snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j partition_id=PartitionId(0) lsn=6 partition_id=0 is_leader=true
node-1	| 2024-11-12T12:11:46.211445Z DEBUG package-snapshot: restate_worker::partition::snapshot_producer: Creating snapshot tarball of "/Users/pavel/restate/restate/restate-data" in: NamedTempFile("/Users/pavel/restate/restate/restate-data/.tmpFUqwbe")... snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j
node-1	| 2024-11-12T12:11:48.007104Z DEBUG run:create-snapshot: restate_worker::partition::snapshot_producer: Successfully published snapshot to repository as: 0/fffffffffffffff9/snap_15YNkGdyO9YatUZwRidwr7j_6.tar snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j etag="\"6a1cd48ee82fcce1e1ec437a9ff380c5\"" partition_id=0 is_leader=true
node-1	| 2024-11-12T12:11:48.016576Z TRACE run:create-snapshot: restate_worker::partition::snapshot_producer: Updated persisted archived snapshot LSN snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j previous_archived_lsn=Some(Lsn(5)) updated_archived_lsn=Lsn(6) partition_id=0 is_leader=true
node-1	| 2024-11-12T12:11:48.018866Z DEBUG run:create-snapshot: restate_worker::partition::snapshot_producer: Cleaned up snapshot export directory: "/Users/pavel/restate/restate/restate-data/node-1/db-snapshots/0/snap_15YNkGdyO9YatUZwRidwr7j" snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j partition_id=0 is_leader=true
node-1	| 2024-11-12T12:11:48.019551Z  INFO run:create-snapshot: restate_worker::partition::snapshot_producer: Successfully published partition snapshot snapshot_id=snap_15YNkGdyO9YatUZwRidwr7j partition_id=PartitionId(0) snapshot_lsn=Lsn(6) partition_id=0 is_leader=true
node-1	| 2024-11-12T12:11:48.020410Z DEBUG restate_worker::partition_processor_manager: Create snapshot completed partition_id=PartitionId(0) result=Ok(SnapshotId(Ulid(2093150492079866388857063190669787332)))

The relevant configuration for this is:

[worker.snapshots]
destination = "s3://pavel-restate-snapshots-test/test-cluster-snapshots/"

The store layout is currently:

restate-data/pp-snapshots
└── 0
    └── fffffffffffffffa
        └── snap_14xkL7UxFzvDGHlWhvl4Wzf_5.tar

The key structure is: [<prefix>/]<partition_id>/<sort_key>/<snapshot_id>_<lsn>.tar.

  1. An optional prefix is provided by the operator as part of the desination url, e.g. s3://my-bucket/custom/cluster/prefix
  2. The partition id
  3. An inverse-sort key to ensure that the latest entry always appears first when listing a bucket prefix; it appears that all major object stores copied S3 and offer ascending lexicographical sorting only
  4. The unique snapshot id ensures that even if we did somehow take multiple snapshots at the same LSN, the will land up as distinct objects in the store. This might be useful to operators if a specific RocksDB instance has a corrupt local DB for example, and they're trying to export the same LSN from a different PP
  5. The LSN is there purely to make it easy for the operator (snapshot ids being ULIDs should naturally sort in ascending timestamp order)

Open questions

  • how does the snapshots layout look to you? keep in mind it'll probably change a bit to accommodate standalone metadata objects (see TODO list below)
  • should we keep persisting the FSM variable in the partition store? it bugs me a bit that this will cause drift across the different PPs, since these updates are local-only
  • are we okay with creating a tarball on disk? I didn't see a good way to stream the archive directly to disk, and we have to have the SSTs on disk first (though those being hardlinks to RocksDB files shouldn't cause too much overhead); I don't want to over-invest here and would be happy to accept the disk overhead and rather focus on implementing incremental snapshots sooner
  • any suggestions for how best to create and then upload the local archive most efficiently?

Outstanding tasks

  • upload a separate metadata.json object to the object store, separate from the tar file; I think this is a must as it will allow us to determine e.g. the version of the snapshot before we download the data; this gives us an easy migration path to support different data formats in the future
  • stream the snapshot to the object store in chunks, preferably as a parallel multi-part upload

Introduce a new configurable number of records property after which a snapshot
of the partition store will be automatically taken.
Comment on lines +65 to +77
.with_credentials(Arc::new(AwsSdkCredentialsProvider {
credentials_provider: DefaultCredentialsChain::builder().build().await,
}))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes object_store much nicer to use with S3; it natively integrates with IMDS out of the box but doesn't understand things like AWS_PROFILE and the commonly used dotfiles used to configure AWS CLI/SDKs. It also ensures dynamic refresh when e.g. SSO session expires and the user renews it.

staging_path: PathBuf,
}

impl SnapshotRepository {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up PR, I will introduce the ability to list recent snapshots available from the repository.

}

#[async_trait]
impl object_store::CredentialProvider for AwsSdkCredentialsProvider {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be something useful to contribute back upstream to object_store.

@pcholakov pcholakov force-pushed the feat/automatic-snapshots-every-n-records branch 2 times, most recently from 1d97374 to 17e437f Compare November 14, 2024 09:13
@AhmedSoliman
Copy link
Contributor

Is this ready for review? (noticed that I'm a review and it's a draft still)

@pcholakov
Copy link
Contributor Author

@AhmedSoliman - early feedback welcome but only if you want to! I still need to rebase this on the latest #2253 so maybe don't until that's done. I'll remove the reviewers until it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants