Create snapshots on leading Partition Processors #2253

pcholakov · 2024-11-08T13:17:05Z

Introduce a new configurable number of records property after which a snapshot of the partition store will be automatically taken.

This change introduces the following new config section (default is none, sample value to illustrate):

[worker.snapshots]
snapshot_interval_num_records = 100000

Testing with a local test cluster with 4 partitions across 3 worker nodes:

 3     N1:10  Follower  Active  N2:10   e10               822          -              0                -             1 second and 430 ms ago
 3     N2:10  Leader    Active  N2:10   e10    N2:10      822          -              0                821           1 second and 263 ms ago
 3     N3:10  Follower  Active  N2:10   e10               822          -              0                -             678 ms ago
Alive partition processors (nodes config v43, partition table v1)
 P-ID  NODE   MODE      STATUS  LEADER  EPOCH  SEQUENCER  APPLIED-LSN  PERSISTED-LSN  SKIPPED-RECORDS  ARCHIVED-LSN  LAST-UPDATE
 0     N1:10  Follower  Active  N3:10   e10               758          -              0                -             1 second and 873 ms ago
 0     N2:10  Follower  Active  N3:10   e10               758          -              0                -             1 second and 316 ms ago
 0     N3:10  Leader    Active  N3:10   e10    N3:10      758          -              0                757           1 second and 882 ms ago
 1     N1:10  Leader    Active  N1:10   e10    N1:10      722          -              0                721           1 second and 972 ms ago
 1     N2:10  Follower  Active  N1:10   e10               722          -              0                -             1 second and 503 ms ago
 1     N3:10  Follower  Active  N1:10   e10               722          -              0                -             1 second and 594 ms ago
 2     N1:10  Follower  Active  N3:10   e10               806          -              0                -             1 second and 324 ms ago
 2     N2:10  Follower  Active  N3:10   e10               806          -              0                -             1 second and 270 ms ago
 2     N3:10  Leader    Active  N3:10   e10    N3:10      806          -              0                805           1 second and 899 ms ago
 3     N1:10  Follower  Active  N2:10   e10               822          -              0                -             1 second and 724 ms ago
 3     N2:10  Leader    Active  N2:10   e10    N2:10      822          -              0                821           1 second and 551 ms ago
 3     N3:10  Follower  Active  N2:10   e10               822          -              0                -             1 second and 772 ms ago

Closes #2246

crates/types/protobuf/restate/cluster.proto

pcholakov · 2024-11-08T13:18:29Z

crates/worker/src/partition/mod.rs

@@ -255,6 +256,7 @@ pub struct PartitionProcessor<Codec, InvokerSender> {
    max_command_batch_size: usize,
    partition_store: PartitionStore,
    inflight_create_snapshot_task: Option<TaskHandle<anyhow::Result<()>>>,
+    last_snapshot_lsn_watch: (watch::Sender<Option<Lsn>>, watch::Receiver<Option<Lsn>>),


Is there a better way to do this? I want to update the LSN from a background task spawned to manage the snapshot creation (and soon, upload to blob store).

github-actions · 2024-11-08T13:33:37Z

Test Results

7 files ±0 7 suites ±0 4m 22s ⏱️ -10s
47 tests ±0 46 ✅ ±0 1 💤 ±0 0 ❌ ±0
182 runs ±0 179 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit 17e437f. ± Comparison against base commit 484a0df.

♻️ This comment has been updated with latest results.

AhmedSoliman

It feels to me that the entire snapshotting mechanism should be external to the partition processor. The only two things that PP needs to know:
1- On start, which archived LSN (if known) it was started from. The caller (PPM) would decide on this value based on whether it has downloaded a snapshot or not (albeit, PP will need to update its FSM in this case, and it's subject to signal loss if node crashed before rocksdb flush)
2- After a snapshot is uploaded, PPM would inform PP that the last archived is updated to a newer lsn.

AhmedSoliman · 2024-11-12T13:21:14Z

crates/types/src/config/worker.rs

+    /// When using the replicated loglet in a distributed cluster, snapshots provide a mechanism for
+    /// safely trimming the log and for bootstrapping new worker nodes.


It might be confusing to mention the replicated loglet in this context. This feature is independent from replicated-loglet in principle. We'd be able to trim the log if we can take snapshots in a cluster setup.

Great point, udpated!

AhmedSoliman · 2024-11-12T13:39:39Z

crates/types/src/config/worker.rs

+    /// snapshots for a given partition.
+    ///
+    /// Default: `None`
+    pub records_per_snapshot: Option<u64>,


This reads as if the snapshot "contains records". I would prefer if we describe this in terms of a trigger for new snapshot every N log records.

Gotcha, will come up with something better!

Updated to snapshot_interval_num_records.

crates/types/protobuf/restate/cluster.proto

tillrohrmann

Thanks for creating this PR @pcholakov. I am wondering whether we can move the snapshot creation logic completely to the PPM to free the PP from the burden of thinking about snapshots?

crates/types/src/config/worker.rs

tillrohrmann · 2024-11-12T14:45:16Z

crates/worker/src/partition/snapshot_producer.rs

@@ -69,6 +71,16 @@ impl SnapshotProducer {
            "Partition snapshot written"
        );

+        let previous_archived_snapshot_lsn = partition_store.get_archived_lsn().await?;
+        let mut tx = partition_store.transaction();
+        tx.put_archived_lsn(snapshot.min_applied_lsn).await;


I assume that you are relying on the PartitionProcessor that there is no concurrency between SnapshotProducers. Otherwise there is the risk that an older snapshot overwrites a newer one here because there can now be multiple writers.

Note that partition_store.transaction() only gives you transaction like guarantees if there is a single writer. Concurrent writers can overwrite changes from each other w/o the system noticing.

Yes, indeed, that is the case. I'll have to think about this a bit more carefully when we move this responsibility up into the PartitionProcessorManager.

tillrohrmann · 2024-11-12T14:49:38Z

crates/worker/src/partition_processor_manager.rs

+                // ignore errors and don't request further snapshots if internal queue is full; we will try again later
+                if self
+                    .tx
+                    .try_send(ProcessorsManagerCommand::CreateSnapshot(*partition_id, tx))
+                    .is_err()
+                {
+                    break;
+                }


Why are we going through the PartitionProcessor to create the snapshot? Could this be handled completely outside of the PartitionProcessor? That way, it happens completely asynchronous to the partition processor event loop.

No strong reason to do so immediately, though two weak intuitions lead me to think that ParitionProcessor is the better owner of this responsibility than the PartitionProcessorManager:

PP is the logical owner of the partition's PartitionStore once running. Sure, we can clone the handle and pass references around, but I feel that I want to keep interactions with it mostly clustered in the PartitionProcessor. To my eyes this is different to restoring a snapshot on bootstrap, which happens before we start the PartitionProcessor (hence why that responsibility currently resides in PartitionStoreManager::restore_partition_store_snapshot)

at some point we might want to create snapshots from a deterministic LSN by sending a CreateSnapshot command through the log, and PartitionProcessor is the place to handle that

To be sure, neither of these is a very strong argument - and I'm not attached to it remaining here! It's a pretty easy change to make given that the snapshot logic mostly lives in SnapshotProducer.

Regarding creating snapshots for deterministic LSNs, I think we need a few more things (a) being able to create a snapshot for a given LSN and (b) probably the globally total ordered log to be able to give a LSN a global meaning. Given that it's not very clear when these things are going to happen, I wouldn't over index on it too much right now. Moreover, it could also be solved differently by a combination of snapshots and a log segment that brings all nodes to a specific LSN.

After seeing how the snapshot restore/bootstrap process works, I'm now much more convinced that ownership of snapshotting overall resides with the PartitionProcessorManager. Let's keep the PP state machine as pure and focused as possible.

tillrohrmann · 2024-11-12T14:51:48Z

crates/worker/src/partition_processor_manager.rs

+                        .add(Lsn::from(records_per_snapshot))
+            {
+                debug!(%partition_id, "Triggering snapshot on partition leader after records_per_snapshot exceeded (last snapshot: {}, current LSN: {})", status.last_archived_log_lsn.unwrap_or(Lsn::OLDEST), status.last_applied_log_lsn.unwrap_or(Lsn::INVALID));
+                let (tx, _) = oneshot::channel();


I guess we are not interested in the result of the CreateSnapshot call here?

No - and I possibly should have left a comment explaining why :-) (which is essentially that we can't do much other than log it, and that already happens in the SnapshotProducer)

tillrohrmann · 2024-11-12T14:53:16Z

crates/worker/src/partition_processor_manager.rs

+                _ = latest_snapshot_check_interval.tick() => {
+                    self.request_partition_snapshots();
+                }


nit: I assume we could also react to status updates to trigger the check instead of a time based policy?

That would be nice, I'll consider this when I move owning this into the PartitionProcessorManager.

pcholakov · 2024-11-12T15:23:15Z

It feels to me that the entire snapshotting mechanism should be external to the partition processor.

Interesting, both you and @tillrohrmann picked up on this. My intuition was exactly the opposite: #2253 (comment). I'm happy to go with the majority view but LMK if you're swayed by those arguments :-)

The only two things that PP needs to know: 1- On start, which archived LSN (if known) it was started from. The caller (PPM) would decide on this value based on whether it has downloaded a snapshot or not (albeit, PP will need to update its FSM in this case, and it's subject to signal loss if node crashed before rocksdb flush)

I'll revisit this when I start work on bootstrap again; right now the FSM variable is implicitly loaded from the snapshot and left untouched, so we have exactly the behavior you describe. Explicit >> implicit though, so I'll make sure it's clear what's happening.

2- After a snapshot is uploaded, PPM would inform PP that the last archived is updated to a newer lsn.

That makes sense if we move the snapshotting responsibility to PPM. (Or more accurately, have the PPM own the SnapshotProducer reference and the responsibility to trigger snapshots.)

Thanks for the feedback, @AhmedSoliman!

pcholakov · 2024-11-14T09:02:21Z

Apologies for the force-push, it was easier to rebase on main than merge.

I am wondering whether we can move the snapshot creation logic completely to the PPM to free the PP from the burden of thinking about snapshots?

I'm convinced, but let me do that as a separate follow-up PR to keep this focused on just the config change, please.

A new configurable snapshot interval is added, expressed as a minimum number of records after which a snapshot of the partition store will be automatically taken.

pcholakov added the snapshots label Nov 8, 2024

pcholakov requested review from AhmedSoliman and tillrohrmann November 8, 2024 13:17

pcholakov commented Nov 8, 2024

View reviewed changes

pcholakov marked this pull request as ready for review November 8, 2024 13:18

AhmedSoliman reviewed Nov 12, 2024

View reviewed changes

tillrohrmann reviewed Nov 12, 2024

View reviewed changes

pcholakov force-pushed the feat/automatic-snapshots-every-n-records branch from 3889982 to 1d97374 Compare November 14, 2024 09:00

Introduce config option to automatically create partition snapshots

17e437f

A new configurable snapshot interval is added, expressed as a minimum number of records after which a snapshot of the partition store will be automatically taken.

pcholakov force-pushed the feat/automatic-snapshots-every-n-records branch from 1d97374 to 17e437f Compare November 14, 2024 09:13

pcholakov mentioned this pull request Nov 14, 2024

Introduce snapshots repository: integrate snapshot writing with S3 via object_store #2274

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create snapshots on leading Partition Processors #2253

Create snapshots on leading Partition Processors #2253

pcholakov commented Nov 8, 2024 •

edited

Loading

pcholakov Nov 8, 2024

github-actions bot commented Nov 8, 2024 •

edited

Loading

AhmedSoliman left a comment

AhmedSoliman Nov 12, 2024

pcholakov Nov 14, 2024

AhmedSoliman Nov 12, 2024

pcholakov Nov 12, 2024

pcholakov Nov 14, 2024

tillrohrmann left a comment

tillrohrmann Nov 12, 2024

tillrohrmann Nov 12, 2024

pcholakov Nov 14, 2024

tillrohrmann Nov 12, 2024

pcholakov Nov 12, 2024

tillrohrmann Nov 13, 2024

pcholakov Nov 14, 2024

tillrohrmann Nov 12, 2024

pcholakov Nov 12, 2024

tillrohrmann Nov 12, 2024

pcholakov Nov 14, 2024

pcholakov commented Nov 12, 2024

pcholakov commented Nov 14, 2024

		/// When using the replicated loglet in a distributed cluster, snapshots provide a mechanism for
		/// safely trimming the log and for bootstrapping new worker nodes.

Create snapshots on leading Partition Processors #2253

Are you sure you want to change the base?

Create snapshots on leading Partition Processors #2253

Conversation

pcholakov commented Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Nov 8, 2024 • edited Loading

Test Results

AhmedSoliman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov commented Nov 12, 2024

pcholakov commented Nov 14, 2024

pcholakov commented Nov 8, 2024 •

edited

Loading

github-actions bot commented Nov 8, 2024 •

edited

Loading