storage: Design doc for Source Versioning / Tables from Sources [ENG-TASK-15] #27907

rjobanp · 2024-06-26T22:35:32Z

Motivation

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

benesch

Thank you for writing this up, @rjobanp! The first five sections (problem, solution, success criteria, out of scope, solution proposal) are really clear and well done.

I'm not the expert on the source ingestion pipeline so I only lightly skimmed the implementation plan and don't have useful feedback to offer. I focused my review in that section on the UX considerations (i.e., the specific SQL syntax, and the migration path for the _progress relations).

doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md

bosconi

Looks great! The phased approach looks well worth the extra steps to reduce risk.

doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md

morsapaes · 2024-07-18T13:48:09Z

Migrating a private discussion from Slack around backfilling behavior for source tables:

Something I was mulling over after that user thread:
I expect users to ask us for a solution that allows them to not discard all state of the v0 source table in v1. For example, if they have a 7-day retention period in Kafka and create a source table v0, then after 7 days need to evolve the schema and create a source table v1, the blue/green process would cause all the green downstream objects to hydrate based on the new data only.
Naively, I'd say this looks like a UNION ($?) between v0 and v1, but...we might need to think about how to cleanly approach this (probably via dbt).
We've discussed keeping historical state around for in-place schema changes (i.e. adding new columns, but possibly keeping around the data for old columns that were previously ingested even if they're dropped), but here we're just assuming users are okay bootstrapping their source and its dependencies from scratch.

cc @sdht0, who separately also brought this up in conversation with @rjobanp.

rjobanp · 2024-07-18T13:52:47Z

I expect users to ask us for a solution that allows them to not discard all state of the v0 source table in v1.

Thanks for including that here @morsapaes ! It's a great question and will certainly be important to make this feature usable in real-world scenarios.

My initial thought is that we should try to backfill the new v1 source table using the usual snapshot functionality present in our sources, though this would only be able to make data available at timestamps that are still preserved in the upstream system's replication log (e.g. Postgres WAL, MySQL binlog, kafka topic). Hopefully this will be acceptable for most usecases, but it's worth thinking about whether there are better options (such as the union idea you've proposed).

notion-workspace · 2024-07-24T03:08:07Z

Design Doc

github-actions · 2024-08-06T13:57:44Z

All contributors have signed the CLA.
_{Posted by the CLA Assistant Lite bot.}

petrosagg

thanks for writing this up!

morsapaes · 2024-08-15T07:42:08Z