Add a PartitionRouting partition-id-to-node mapping holder and background refresher #2166

pcholakov · 2024-10-23T12:02:42Z

New approach as an alternative to #2149, which exposes a dedicated PartitionRouting lookup interface. Its implementation is initially backed by the Metadata Store but it is not expected to stay that way for long.

pcholakov · 2024-10-23T12:09:24Z

crates/core/src/metadata/manager.rs

@@ -45,9 +45,6 @@ use crate::task_center;
 pub(super) type CommandSender = mpsc::UnboundedSender<Command>;
 pub(super) type CommandReceiver = mpsc::UnboundedReceiver<Command>;

-#[derive(Debug, thiserror::Error)]
-pub enum SyncError {}


Noticed this is unused, so removed it.

pcholakov · 2024-10-23T12:10:44Z

crates/core/src/routing_info/mod.rs

+        let update_interval = Configuration::pinned()
+            .common
+            .metadata_update_interval
+            .into();


Wasn't sure what to pick for this. It seems reasonable to start with; it might be a nice optimisation to add a get_if_newer_version to the metadata store to make it cheaper to poll. Eventually we should gossip/push this info to nodes without needing to poll all the time.

Wondering whether the periodic update is really required if users of this struct refresh explicitly on outdated leader information.

I don't believe we have any explicit mechanism today to provide outdated feedback; I exposed the refresh method just in case, but let's keep this for now please. Happy to reduce the refresh interval to something like 3-5x the heartbeat interval initially.

Let's see whether the load on the metadata store will be too high. Something to keep an eye on.

Yeah! Not entirely certain what signals to monitor that will tell us that this is a problem, before it becomes an actual problem, if that makes sense. Probably some metrics around metadata store service?

I'm pretty keen on the get_if_version_newer optimisation - I think basically any backing store we would choose for metadata should easily support this feature, and worst case, just fall back on always doing a full read if it can't. At least saves some bytes over the wire.

crates/core/src/routing_info/mod.rs

pcholakov · 2024-10-23T12:11:39Z

crates/core/src/routing_info/mod.rs

+            let partition_to_node_mappings = self.inner.clone();
+            let metadata_store_client = self.metadata_store_client.clone();
+
+            let task = task_center().spawn_unmanaged(


I use spawn_unmanaged to get a task handle which allows me to check if the previous one is finished.

Good idea 👍

pcholakov · 2024-10-23T12:13:10Z

crates/core/src/routing_info/mod.rs

+/// requests to for a given partition. Compared to the partition table, this view is more dynamic as
+/// it changes based on cluster nodes' operational status. This handle can be cheaply cloned.
+#[derive(Clone)]
+pub struct PartitionRouting {


I'm not in love with this name but hate everything else I've come up with more. Any suggestions? PartitionNodeRoutes is a more verbose version I also like.

github-actions · 2024-10-23T12:20:14Z

Test Results

5 files ±0 5 suites ±0 3m 15s ⏱️ -9s
45 tests ±0 45 ✅ ±0 0 💤 ±0 0 ❌ ±0
114 runs ±0 114 ✅ ±0 0 💤 ±0 0 ❌ ±0

Results for commit f0c4851. ± Comparison against base commit 7dbca38.

♻️ This comment has been updated with latest results.

tillrohrmann

Thanks for creating this PR @pcholakov. I think this goes exactly in the direction I had in mind for the simple solution to unblock Igal's and Francesco's work. I left a few comments and suggestions we could do before merging.

crates/core/src/routing_info/mod.rs

tillrohrmann · 2024-10-23T14:18:26Z

crates/core/src/routing_info/mod.rs

+    /// Request a refresh without waiting for it to complete. This is useful when the caller
+    /// discovers via some other mechanism that the local view might be outdated - for example, when
+    /// a request to a node previously returned by `get_node_by_partition` fails with a response
+    /// that indicates that this routing information is no longer valid.


How will a user of PartitionRouting use it if they encounter invalid routing information?

Just call it? :-) It's super dumb right now, just a fire and forget to trigger the refresh. We don't yet have anything errors that would signify that one is working with outdated routing info but I imagine that instead of TargetVersion::Latest, the caller might specify something they got from a peer, such as monotonic a partition mapping version/epoch number. But the expectation is that whoever calls this already encountered an unrecoverable error, and calling this is just a nice to have signal. Let me see if I can work this into a doc comment.

What do I do after I called this method? Right now, I cannot wait until the update has been processed. So I can only continue and hope that by the time I retry, the information has been updated. If not, then I'll talk to the outdated leader again.

My thinking here was that the caller has likely already given up, and this is a pure courtesy notification. Because there's no guarantee whatsoever that waiting for a change will fix anything - and conversely, the same node might later respond successfully even if nothing has changed from the routing table's point of view. Happy to add a blocking option here that returns when the version changes, just didn't see it as a very useful thing to do right now. What do you see we can do differently?

I was imagining the caller having a number of retries. Whenever it receives a "wrong node" response indicating that it is talking to a wrong node, it would like to refresh the routing information hoping that it was operating against wrong information. Once new information is obtained, it could continue hoping that this time the information is up to date.

If the caller stopped retrying, then only notifying the RoutingInformation that the data is outdated is probably enough.

So I guess I am wondering whether an API that allows us to await for a refresh to complete is missing or not.

@tillrohrmann I think this is the main outstanding concern with this PR - would you like to see some specific improvements before we can merge this? I thought about this a fair amount yesterday and don't believe we can meaningfully improve on "just back off and retry" at the moment, though we'll definitely be able to do better in the future once we have better view of cluster node status.

We can take this as a follow-up. One idea could be to have a watch that gets updated after the refresh task completes. A caller could then await this update, for example. I think this might be relevant for the work that @slinkydeveloper is doing (depending a bit on how he builds the ingress retries on wrong leadership information).

Sounds good - I'd love to do that follow-up!

tillrohrmann · 2024-10-23T14:19:25Z

crates/core/src/routing_info/mod.rs

+// Implementation note: the sole authority for node-level routing is currently the metadata store,
+// and in particular, the cluster scheduling plan. This will change in the future so avoid leaking
+// implementation details or assumptions about the source of truth.


👍 thanks for stating it here.

crates/core/src/routing_info/mod.rs

tillrohrmann · 2024-10-23T14:28:39Z

crates/core/src/routing_info/mod.rs

+    let scheduling_plan: SchedulingPlan = metadata_store_client
+        .get(SCHEDULING_PLAN_KEY.clone())
+        .await?
+        .context("Scheduling plan not found")?;


What if there is no scheduler yet that has created a SchedulingPlan?

I figure we'll just fail some requests until we have one - if a caller is trying to look up routing info for a partition, and we don't yet know how to address any of its nodes, it should return an appropriate error. E.g. the ingress layer might just return a 500 server-internal retryable error indicating to the upstream caller that the condition should be transient.

I will add a debug-level log message about this at least.

crates/core/src/routing_info/mod.rs

crates/types/src/cluster_controller.rs

tillrohrmann · 2024-10-23T14:33:03Z

crates/core/src/routing_info/mod.rs

+            let partition_to_node_mappings = self.inner.clone();
+            let metadata_store_client = self.metadata_store_client.clone();
+
+            let task = task_center().spawn_unmanaged(


Good idea 👍

slinkydeveloper · 2024-10-24T12:51:45Z

In this PR I can't find where I would be able to acces this PartitionRouting? Is it something I need to instantiate inside the ingress? Or will I be able to get a handle to that with a thread local or alike?

tillrohrmann · 2024-10-24T13:55:51Z

In this PR I can't find where I would be able to acces this PartitionRouting? Is it something I need to instantiate inside the ingress? Or will I be able to get a handle to that with a thread local or alike?

I guess it will be passed into the Ingress.

pcholakov · 2024-10-24T13:59:14Z

@slinkydeveloper see how I've wired it up in this PR: #2172

tillrohrmann

Thanks for implementing the PartitionRouting struct @pcholakov. The changes look really good. +1 for merging.

tillrohrmann · 2024-10-24T14:45:27Z

crates/core/src/routing_info/mod.rs

+        let update_interval = Configuration::pinned()
+            .common
+            .metadata_update_interval
+            .into();


Let's see whether the load on the metadata store will be too high. Something to keep an eye on.

tillrohrmann · 2024-10-24T14:47:46Z

crates/core/src/routing_info/mod.rs

+) -> anyhow::Result<()> {
+    let scheduling_plan: Option<SchedulingPlan> = metadata_store_client
+        .get(SCHEDULING_PLAN_KEY.clone())
+        .await?;


Maybe we also log on debug if an error occurred here. That way things might become easier to debug if the routing does not update.

tillrohrmann · 2024-10-24T14:48:52Z

crates/core/src/routing_info/mod.rs

+    let _ = partition_to_node_mappings.compare_and_swap(
+        current_mappings,
+        Arc::new(PartitionToNodesRoutingTable {
+            version: scheduling_plan.version(),
+            inner: partition_nodes,
+        }),
+    );


tillrohrmann · 2024-10-24T14:54:04Z

crates/core/src/routing_info/mod.rs

+        self.sender
+            .send(Command::SyncRoutingInformation)
+            .await
+            .expect("Failed to send refresh request");


Nit: Since the Command no longer carries any specific value, one could use a watch to signal the refresh request. That way, the request_refresh call could become synchronous and users of this method wouldn't block on the availability of send permits.

That's a neat use of watch! I do think the sync command will grow to carry some information again in the future - perhaps a node id, partition id, or some other context from the caller - as we learn more about the conditions that should trigger a refresh. So, I'd rather keep in as is for the time being, even though it's a totally valid nit!

pcholakov requested a review from tillrohrmann October 23, 2024 12:02

pcholakov commented Oct 23, 2024

View reviewed changes

tillrohrmann reviewed Oct 23, 2024

View reviewed changes

pcholakov force-pushed the feat/partition-node-routing branch from ce52cc6 to 4e2c5e2 Compare October 23, 2024 15:32

pcholakov mentioned this pull request Oct 23, 2024

Expose a Metadata::get_partition_processor_node(PartitionId) accessor #2149

Closed

pcholakov marked this pull request as ready for review October 23, 2024 15:40

pcholakov requested review from igalshilman and slinkydeveloper October 23, 2024 15:40

pcholakov mentioned this pull request Oct 23, 2024

Setup remote datafusion scanners #2068

Merged

pcholakov added 2 commits October 24, 2024 15:54

Add a new PartitionRouting accessor and background refresher

c49ca41

PR feedback

9f259e3

pcholakov force-pushed the feat/partition-node-routing branch from 4e2c5e2 to 9f259e3 Compare October 24, 2024 13:54

pcholakov requested a review from tillrohrmann October 24, 2024 13:55

tillrohrmann approved these changes Oct 24, 2024

View reviewed changes

PR feedback (2)

f0c4851

pcholakov merged commit a09bcc8 into main Oct 25, 2024
13 checks passed

pcholakov deleted the feat/partition-node-routing branch October 25, 2024 04:30

tillrohrmann mentioned this pull request Nov 4, 2024

Distribute information about who's leader for a given partition #1907

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a PartitionRouting partition-id-to-node mapping holder and background refresher #2166

Add a PartitionRouting partition-id-to-node mapping holder and background refresher #2166

pcholakov commented Oct 23, 2024

pcholakov Oct 23, 2024

pcholakov Oct 23, 2024

tillrohrmann Oct 23, 2024

pcholakov Oct 23, 2024

tillrohrmann Oct 24, 2024 •

edited

Loading

pcholakov Oct 24, 2024

pcholakov Oct 23, 2024

tillrohrmann Oct 23, 2024

pcholakov Oct 23, 2024

github-actions bot commented Oct 23, 2024 •

edited

Loading

tillrohrmann left a comment

tillrohrmann Oct 23, 2024

pcholakov Oct 23, 2024

tillrohrmann Oct 23, 2024

pcholakov Oct 23, 2024

tillrohrmann Oct 23, 2024 •

edited

Loading

pcholakov Oct 24, 2024

tillrohrmann Oct 24, 2024 •

edited

Loading

pcholakov Oct 24, 2024

tillrohrmann Oct 23, 2024

tillrohrmann Oct 23, 2024

pcholakov Oct 23, 2024

pcholakov Oct 23, 2024

tillrohrmann Oct 23, 2024

slinkydeveloper commented Oct 24, 2024

tillrohrmann commented Oct 24, 2024

pcholakov commented Oct 24, 2024

tillrohrmann left a comment

tillrohrmann Oct 24, 2024 •

edited

Loading

tillrohrmann Oct 24, 2024

tillrohrmann Oct 24, 2024

tillrohrmann Oct 24, 2024

pcholakov Oct 24, 2024

Add a PartitionRouting partition-id-to-node mapping holder and background refresher #2166

Add a PartitionRouting partition-id-to-node mapping holder and background refresher #2166

Conversation

pcholakov commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 23, 2024 • edited Loading

Test Results

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slinkydeveloper commented Oct 24, 2024

tillrohrmann commented Oct 24, 2024

pcholakov commented Oct 24, 2024

tillrohrmann left a comment

Choose a reason for hiding this comment

tillrohrmann Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann Oct 24, 2024 •

edited

Loading

github-actions bot commented Oct 23, 2024 •

edited

Loading

tillrohrmann Oct 23, 2024 •

edited

Loading

tillrohrmann Oct 24, 2024 •

edited

Loading

tillrohrmann Oct 24, 2024 •

edited

Loading