[persist] Thread chunks of data from consolidation to the batch builder #29577

bkirwi · 2024-09-16T21:27:22Z

Pass around structured data in the batch builder, instead of just the flat columnar records.
Allow passing batches of data into the batch builder (which unlike individual updates can contain structured data).
Estimate the size of individual rows in compaction and use that to influence the generated batch size.

The upshot is that it should now be possible to go from input to output in compaction without converting from codec-encoded data to structured data or vice-versa; the process can work in terms of structured data from top to bottom.

Motivation

Subtask for #24830.

Tips for reviewer

Chunking up compaction output is a little tricky, and I will probably want to make some changes in a followup. It's all behind a flag for now of course!

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

src/persist/src/indexed/encoding.rs

bkirwi · 2024-09-18T14:48:21Z

In case interesting, here is my thinking for the batch building estimates:

It estimates arrow-size and not parquet-size, because arrow is easier to estimate, and because in the cases where they're radically different parquet is likely to be smaller thanks to things like compression.
It estimates size based on ArrowOrd because it was straightforward to implement and ought to produce results similar to the old method.
It makes compaction responsible for sizing parts instead of the batch builder - compaction needs to chunk up data anyways, and making it try and generate the right size of chunks up front avoids copies.

The remaining issue is that consolidation may in some circumstances produce chunks that are too small (if. eg. we don't have enough data downloaded yet). ~~This is the sort of tuning I plan to tackle in a followup.~~ The latest version of the PR will concat multiple chunks together to get one of the right size. I think normally we'll have just one chunk and not need to do any extra work, but I've added a metric so I can monitor how much copying is happening to see if it's worth tuning consolidation to output larger batches.

src/persist-types/src/arrow.rs

ParkMyCar

Overall LGTM! Just a couple of thoughts

src/persist-client/src/batch.rs

src/persist-client/src/internal/compact.rs

src/persist-client/src/iter.rs

ParkMyCar · 2024-09-18T20:09:27Z

src/persist-client/src/iter.rs

+        // Keep a running estimate of the size left in the budget, returning None after it's
+        // exhausted. Note that we can't use take_while here... it's important that we pass
+        // on every value we receive from the iterator.


I might misunderstsand what this comment means, but I'm pretty sure we stop returning elements once the budget == 0? Here's an example from the Rust Playground

Yes, that's correct! The idea of the second half of this comment is that it would be bad for us to pull an element from the consolidating iter and not foward it along, so we can't use take_while. (Which drops the first non-matching element.) I'll rephrase.

src/persist/src/indexed/encoding.rs

ParkMyCar · 2024-09-18T20:20:24Z

src/persist-client/src/batch.rs

+        // This is a noop if there are no such updates.
+        // TODO: consider moving the individual updates to BatchBuilder?
+        let previous = self.buffer.drain();
+        self.flush_part(previous).await;


What's the rationale for always flushing the previous parts? It seems like it might be nice to flush the current updates with the previous into a single blob?

What's the rationale for always flushing the previous parts?

Simplest thing to do, basically... we only hit this case if someone was mixing the two types of calls, which never happens in practice.

It seems like it might be nice to flush the current updates with the previous into a single blob?

There's some risk that it outputs a blob that's larger than our target size. (So we could make it conditional, but for the above reason I'm not inclined to put much effort in!.)

src/persist-client/src/batch.rs

src/persist-types/src/arrow.rs

src/persist-client/src/internal/compact.rs

shepherdlybot · 2024-09-19T14:43:51Z

Mitigations

Completing required mitigations increases Resilience Coverage.

Risk Summary:

This pull request carries a high risk score of 82, driven by predictors such as the average age of files, cognitive complexity within files, and the delta of executable lines. Historically, PRs with these predictors are 102% more likely to cause a bug than the repository baseline. Notably, the observed bug trend in the repository is decreasing.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

bkirwi · 2024-09-19T15:12:48Z

Thanks all! Think I've addressed all the blockers, but if anything doesn't look right I'm happy to take it in a followup.

This is a vestige of back when we had two sets of schemas in compaction, which was weird and is long gone. Enforce that all the data in a batch has the same schema.

Following the similar method on columnar records.

If this is more than a small faction of bytes, it's worth tuning consolidation to return larger chunks of output.

…across_restarts To prevent stack overflows

bkirwi · 2024-09-19T19:10:55Z

Ran into a test failure after rebasing on main, and I think I've figured it out.

The test_builtin_connection_alterations_are_preserved_across_restarts test is failing with a SIGSEGV, apparently a stack overflow. This is extra likely to be a stack overflow because @def- just found and fixed it in bf8dcd9.

However, the fix was reverted by @jkosh44 in #29593. I don't see a rationale in the PR desc - Joe, was that intentional and/or would the test change to unrevert?

I've re-applied Dennis's fix on this branch, and it does seem to get the test passing again.

jkosh44 · 2024-09-19T19:29:47Z

However, the fix was reverted by @jkosh44 in #29593. I don't see a rationale in the PR desc - Joe, was that intentional and/or would the test change to unrevert?

I think I understand what happened.

The PR Revert "catalog: Combine epoch and deploy generation" #29433 did two things, revert fa2c417 and add new commit bf8dcd9.
I reverted Revert "catalog: Combine epoch and deploy generation" #29433, which un-reverted fa2c417 but reverted bf8dcd9.

So, no, it was not intentional. Please unrevert bf8dcd9.

bkirwi · 2024-09-19T19:34:22Z

Ah, that'll do it! Thanks for chasing that down.

def- · 2024-09-19T21:13:17Z

Sorry about that! I had to mix that in to get cargo test green

bkirwi · 2024-09-19T22:23:58Z

No need to apologize -- far from it! I was in fact very happy to find out that someone had already figured out a fix for the weird test failure I was hitting...

bkirwi force-pushed the batch-many branch 2 times, most recently from 0ec24a0 to 1045da9 Compare September 17, 2024 16:04

bkirwi changed the title ~~[wip] [persist] Pass sets of updates to the batch builder~~ [wip] [persist] Thread chunks of data from consolidation to the batch builder Sep 17, 2024

bkirwi changed the title ~~[wip] [persist] Thread chunks of data from consolidation to the batch builder~~ [persist] Thread chunks of data from consolidation to the batch builder Sep 17, 2024

bkirwi commented Sep 17, 2024

View reviewed changes

src/persist/src/indexed/encoding.rs Show resolved Hide resolved

bkirwi marked this pull request as ready for review September 17, 2024 21:52

bkirwi requested a review from a team as a code owner September 17, 2024 21:52

bkirwi commented Sep 18, 2024

View reviewed changes

src/persist-types/src/arrow.rs Outdated Show resolved Hide resolved

bkirwi requested review from ParkMyCar September 18, 2024 18:42

bkirwi force-pushed the batch-many branch from 1045da9 to b5a44f2 Compare September 18, 2024 19:18

ParkMyCar approved these changes Sep 18, 2024

View reviewed changes

danhhz approved these changes Sep 18, 2024

View reviewed changes

src/persist-client/src/batch.rs Outdated Show resolved Hide resolved

src/persist-types/src/arrow.rs Outdated Show resolved Hide resolved

src/persist-types/src/arrow.rs Outdated Show resolved Hide resolved

src/persist-client/src/internal/compact.rs Show resolved Hide resolved

bkirwi force-pushed the batch-many branch from 958e153 to 15e6f76 Compare September 19, 2024 14:43

bkirwi enabled auto-merge September 19, 2024 15:12

bkirwi added 8 commits September 19, 2024 12:58

Minor fixups

372cd05

Pass around BlobTraceUpdates in the batch builder

5c082fb

Remove the per-item schemas in favour of batch schemas

b525f43

This is a vestige of back when we had two sets of schemas in compaction, which was weird and is long gone. Enforce that all the data in a batch has the same schema.

Allow pushing entire sets of updates into the builder

b58eb98

Estimate the output size via ArrayOrd

b345465

BlobTraceUpdates::concat

095fdf5

Following the similar method on columnar records.

Work harder to hit the target size

95e3fe6

Count the number of bytes we've had to copy in concat

6aa51b5

If this is more than a small faction of bytes, it's worth tuning consolidation to return larger chunks of output.

bkirwi force-pushed the batch-many branch from 15e6f76 to 6aa51b5 Compare September 19, 2024 16:58

cargo test: Rework test_builtin_connection_alterations_are_preserved_…

463af6f

…across_restarts To prevent stack overflows

bkirwi requested a review from a team as a code owner September 19, 2024 18:34

bkirwi disabled auto-merge September 19, 2024 18:34

bkirwi merged commit 0cebb4a into MaterializeInc:main Sep 19, 2024
84 checks passed

github-actions bot locked and limited conversation to collaborators Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[persist] Thread chunks of data from consolidation to the batch builder #29577

[persist] Thread chunks of data from consolidation to the batch builder #29577

bkirwi commented Sep 16, 2024 •

edited

Loading

bkirwi commented Sep 18, 2024 •

edited

Loading

ParkMyCar left a comment

ParkMyCar Sep 18, 2024

bkirwi Sep 18, 2024

ParkMyCar Sep 18, 2024

bkirwi Sep 18, 2024

shepherdlybot bot commented Sep 19, 2024 •

edited

Loading

bkirwi commented Sep 19, 2024

bkirwi commented Sep 19, 2024 •

edited

Loading

jkosh44 commented Sep 19, 2024

bkirwi commented Sep 19, 2024

def- commented Sep 19, 2024

bkirwi commented Sep 19, 2024

[persist] Thread chunks of data from consolidation to the batch builder #29577

[persist] Thread chunks of data from consolidation to the batch builder #29577

Conversation

bkirwi commented Sep 16, 2024 • edited Loading

Motivation

Tips for reviewer

Checklist

bkirwi commented Sep 18, 2024 • edited Loading

ParkMyCar left a comment

Choose a reason for hiding this comment

ParkMyCar Sep 18, 2024

Choose a reason for hiding this comment

bkirwi Sep 18, 2024

Choose a reason for hiding this comment

ParkMyCar Sep 18, 2024

Choose a reason for hiding this comment

bkirwi Sep 18, 2024

Choose a reason for hiding this comment

shepherdlybot bot commented Sep 19, 2024 • edited Loading

Mitigations

bkirwi commented Sep 19, 2024

bkirwi commented Sep 19, 2024 • edited Loading

jkosh44 commented Sep 19, 2024

bkirwi commented Sep 19, 2024

def- commented Sep 19, 2024

bkirwi commented Sep 19, 2024

bkirwi commented Sep 16, 2024 •

edited

Loading

bkirwi commented Sep 18, 2024 •

edited

Loading

shepherdlybot bot commented Sep 19, 2024 •

edited

Loading

bkirwi commented Sep 19, 2024 •

edited

Loading