Add an optional bandwidth cap to `TieredMergePolicy`? #14148

mikemccand · 2025-01-17T13:06:13Z

Description

TL;DR: TieredMergePolicy can create massive snapshots if you configure it for aggressive deletesPctAllowed, which can hurt searchers (cause page fault storms) in a near-real-time replication world. Maybe we could add an optional (off by default) "rate limit" on how many amortized bytes/sec TMP is merging? This is just an idea / brainstorming / design discussion so far ... no PR.

Full context:

At Amazon (Product Search team) we use near-real-time segment replication to efficiently distribute index updates to all searchers/replicas.

Since we have many searchers per indexer shard, to scale to very high QPS, we intentionally tune TieredMergePolicy (TMP) to very aggressively reclaim deletions. Burning extra CPU / bandwidth during indexing to save even a little bit of CPU during searching is a good tradeoff for us (and in general, for Lucene users with high QPS requirements).

But we have a "fun" problem occasionally: sometimes we have an update storm (an upstream team reindexes large-ish portions of Amazon's catalog through the real-time indexing stream), and this leads to lots and lots of merging and many large (max-sized 5 GB) segments being replicated out to searchers in short order, sometimes over links (e.g. cross-region) that are not as crazy-fast as within-region networking fabric, and our searchers fall behind a bit.

Falling behind is not the end of the world: the searchers simply skip some point-in-time snapshots and jump to the latest one, effectively sub-sampling checkpoints as best they can given the bandwidth constraints. Index-to-search latency is hurt a bit, but recovers once the indexer catches up on the update storm.

The bigger problem for us is that we size our shards, roughly, so that the important parts of the index (the parts hit often by query traffic) are fully "hot". I.e. so the OS has enough RAM to hold the hot parts of the index. But when it takes too long to copy and light (switch over to the new segments for searching) a given snapshot, and we skip the next one or two snapshots, the followon snapshot that we finally do load may have a sizable part of the index rewritten, and the snapshot size maybe a big percentage of the overall index, and copying/lighting it will stress the OS into a paging storm, hurting our long-pole latencies.

So one possible solution we thought of is to add an optional (off by default) setMaxBandwidth to TMP so that "on average" (amortized over some time window ish) TMP would not produce so many merges that it exceeds that bandwidth cap. With such a cap, during an update storm (war time), the index delete %tg would necessarily increase beyond what we ideally want / configured with setDeletesPctAllowed, but then during peace time, TMP could again catch up and push the deletes back below the target.

The text was updated successfully, but these errors were encountered:

mikemccand added the type:enhancement label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an optional bandwidth cap to `TieredMergePolicy`? #14148

Add an optional bandwidth cap to `TieredMergePolicy`? #14148

mikemccand commented Jan 17, 2025

Add an optional bandwidth cap to TieredMergePolicy? #14148

Add an optional bandwidth cap to TieredMergePolicy? #14148

Comments

mikemccand commented Jan 17, 2025

Description

Add an optional bandwidth cap to `TieredMergePolicy`? #14148

Add an optional bandwidth cap to `TieredMergePolicy`? #14148