-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Design doc for Source Versioning / Tables from Sources [ENG-TASK-15] #27907
storage: Design doc for Source Versioning / Tables from Sources [ENG-TASK-15] #27907
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for writing this up, @rjobanp! The first five sections (problem, solution, success criteria, out of scope, solution proposal) are really clear and well done.
I'm not the expert on the source ingestion pipeline so I only lightly skimmed the implementation plan and don't have useful feedback to offer. I focused my review in that section on the UX considerations (i.e., the specific SQL syntax, and the migration path for the _progress
relations).
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
c257333
to
ee12ab6
Compare
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
bfc8a46
to
9f301de
Compare
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! The phased approach looks well worth the extra steps to reduce risk.
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_subsource_deprecation_blue_green_sources.md
Outdated
Show resolved
Hide resolved
9f301de
to
a820cec
Compare
Migrating a private discussion from Slack around backfilling behavior for source tables:
cc @sdht0, who separately also brought this up in conversation with @rjobanp. |
Thanks for including that here @morsapaes ! It's a great question and will certainly be important to make this feature usable in real-world scenarios. My initial thought is that we should try to backfill the new |
a820cec
to
46fd1e1
Compare
46fd1e1
to
dfa419f
Compare
All contributors have signed the CLA. |
dfa419f
to
a26c97a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for writing this up!
ec798db
to
66a76ac
Compare
accomplished during the process of replacing _subsources_ with _tables_ (see | ||
solution proposal below). | ||
|
||
- Dealing with Webhook sources. These are fundamentally different than the other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll move this to within scope and get help from Adapter with porting webhook sources into the new model. Added a placeholder task to #28653.
d945203
to
308f107
Compare
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
from 'normal' tables that are written to directly by users. The value of this will | ||
be one of `'direct'` or `'source'`. | ||
|
||
`data_source_id`: a nullable column that contains the `id` of the data source, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"data source" isn't a term we've used before. My preference would be to stick with a plain ol' "source ID." Are you worried that that would be too vague?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was that we may also have tables who's data comes from other 'data sources' in the future -- e.g. if we supported CREATE TABLE .. FROM WEBHOOK
or CREATE TABLE .. FROM FIVETRAN
, etc, such that I didn't want to tie this column directly to our existing concept of sources. If we feel that other stuff isn't likely to happen I can switch it to just source_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Ok. Will respond in the other thread above (#27907 (comment)) to avoid forking the discussion, since they've converged.
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
- `namespace`: Any namespace that this reference object exists in, if applicable. For Postgres | ||
this would be `<database>.<schema>`, for MySQL this would be `<schema>`, and for Kafka this | ||
would be NULL. | ||
- `last_updated_at`: The timestamp of when this reference was fetched from the upstream system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other tables we use simply updated_at
(see e.g. mz_cluster_replica_statuses
) for the "last update" time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
||
The proposal for the 2nd option (a new system table): | ||
|
||
Introduce a `mz_catalog.mz_available_source_references` with the following columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all married to this, but in mz_source_statistics
we use "known (as in offset_known
) to represent the idea of things that are known to Materialize but may not be perfectly up to date.
Introduce a `mz_catalog.mz_available_source_references` with the following columns: | |
Introduce a `mz_catalog.mz_known_source_references` with the following columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for consistency, all our source-related tables start with mz_source
, so we could roll with mz_source_references
. Do we need an additional qualifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have precedent for a word before mz_source
, like mz_kafka_sources
, mz_postgres_sources
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I don't mind mz_source_references
at all though.)
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
|
||
The proposal for the 2nd option (a new system table): | ||
|
||
Introduce a `mz_catalog.mz_available_source_references` with the following columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for consistency, all our source-related tables start with mz_source
, so we could roll with mz_source_references
. Do we need an additional qualifier?
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
- `namespace`: Any namespace that this reference object exists in, if applicable. For Postgres | ||
this would be `<database>.<schema>`, for MySQL this would be `<schema>`, and for Kafka this | ||
would be NULL. | ||
- `last_updated_at`: The timestamp of when this reference was fetched from the upstream system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
6bc6d06
to
b268029
Compare
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240625_source_versioning__table_from_sources.md
Outdated
Show resolved
Hide resolved
Webhook sources will be migrated to the new model at the SQL level, to allow for | ||
'schema changes' and multiple outputs using the same `CREATE TABLE .. FROM SOURCE` | ||
statement syntax as other sources. | ||
Since the `encoding` and `envelope` options will be moved to that table statement | ||
this will also allow a webhook request to be decoded differently for each table. | ||
|
||
Under the hood, we will continue to have `environmentd` receive Webhook requests | ||
and write directly to a persist collection. However this code will be modified | ||
to just do request validation and then write the _raw_ request body as bytes | ||
and a map of all headers to the persist collection without doing any decoding. | ||
|
||
We will then create a new `Persist Source` that operates exactly like all other source | ||
types such that the storage layer renders a dataflow and the source operator reads | ||
from an existing persist collection -- in this case the 'raw' webhook collection being | ||
written to by `environmentd`. This `Persist Source` can have multiple `SourceExports` | ||
like other sources, and each export can define its own `encoding` and `envelope` processing | ||
to convert the raw bytes of each request into the appropriate relation schema. | ||
The progress collection for the webhook source will essentially be the frontier of | ||
the underlying 'raw' persist collection that each request is written to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the commonality we're trying to extract from webhook sources here, but I worry it won't be that valuable in practice. It's hard for me to imagine a real world use case for changing the body format, for example.
Retroactively adding new headers seems more valuable, but I'm worried it's actually quite surprising if we store the full raw HTTP requests for all time, even though that raw collection isn't visible to users. For example we document that you can exclude headers to omit sensitive information: https://materialize.com/docs/sql/create-source/webhook/#excluding-header-fields
But with the proposed design, that sensitive information will sit secretly inside Materialize until the source is dropped—unless someone reveals it with an ALTER SOURCE
command.
The migration here for existing webhook sources seems inviable, too? Since we don't have the raw HTTP bytes saved for existing webhook sources.
Is there something lighter weight we could do here instead? Something that moves the ball forward without fully aligning with the new subsources model while also not painting ourselves into a corner? I think the most pressing concern is separating out the webhook source object from the webhook table object. But I think you could reasonably say that as a special case 1) envelopes and encodings are a property of the webhook source, not the subsource and 2) webhook subsources cannot be dropped/added.
Backing up a level, I think what folks actually want with webhook sources is the ability to "write at" webhook tables. Much more so than with other sources. Usually with webhooks you're totally happy to tear down the existing webhook source and make a new one with the new configuration (new headers, new CHECK
, new body format, whatever) ... if only you could backfill the new webhook source with the fixed up data from the old webhook source!
CREATE SOURCE FROM WEBHOOK og_webhook ... HEADER 'foo';
-- Shoot, I need another header going forward!
CREATE SOURCE FROM WEBHOOK webhook2 ... HEADER foo, HEADER 'bar';
-- One time backfill
INSERT INTO webhook2 SELECT body, foo, 'unknown' AS bar FROM og_webhook;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm short on time, so that might not be entirely coherent. Happy to chat more about this live next time we're both in the office, @rjobanp, or during this week's S&S sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are all valid concerns and I am definitely worried about the migration of existing webhook sources to the model proposed here.
While we could do something simpler, I worry that we will continue to have special-casing for webhook sources and will need additional special casing for CREATE TABLE .. FROM SOURCE
statements that refer to webhook sources unless we do the work now to align them to the new model. This isn't just confusing for us in the codebase, but also for users and our documentation.
Following up to our live discussion -- if the main concern is that there might be sensitive data in the raw request bytes we write to the first persist collection, I wonder if we instead can just have the 'Persist Source' ask persist to truncate the 'raw' collection as it ingests data into the downstream 'tables', such that we remove data just after it's been decoded. This would still operate with the new model but would just mean that any new 'table' added to an existing webhook source wouldn't be able to re-ingest previous requests, but all future requests would get decoded individually by all tables from that point forward.
cd2345e
to
e8cb2ae
Compare
New webhook design LGTM! 🙇🏽 |
e8cb2ae
to
3705aef
Compare
This PR has gotten fairly big and since we've agreed on all the open discussion points I'm going to merge and apply any future updates to the doc in separate PRs |
Motivation
Tips for reviewer
Rendered
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.