Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Design doc for Source Versioning / Tables from Sources [ENG-TASK-15] #27907

Merged
merged 11 commits into from
Sep 9, 2024

Conversation

rjobanp
Copy link
Contributor

@rjobanp rjobanp commented Jun 26, 2024

Motivation

Tips for reviewer

Rendered

Checklist

Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this up, @rjobanp! The first five sections (problem, solution, success criteria, out of scope, solution proposal) are really clear and well done.

I'm not the expert on the source ingestion pipeline so I only lightly skimmed the implementation plan and don't have useful feedback to offer. I focused my review in that section on the UX considerations (i.e., the specific SQL syntax, and the migration path for the _progress relations).

@rjobanp rjobanp force-pushed the subsource-deprecation-blue-green branch from c257333 to ee12ab6 Compare June 27, 2024 14:13
@rjobanp rjobanp force-pushed the subsource-deprecation-blue-green branch from bfc8a46 to 9f301de Compare July 1, 2024 18:06
@rjobanp rjobanp marked this pull request as ready for review July 1, 2024 18:07
@rjobanp rjobanp requested review from bosconi and a team July 1, 2024 18:08
Copy link
Member

@bosconi bosconi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! The phased approach looks well worth the extra steps to reduce risk.

@rjobanp rjobanp force-pushed the subsource-deprecation-blue-green branch from 9f301de to a820cec Compare July 16, 2024 19:45
@rjobanp rjobanp changed the title storage: Design doc for subsource deprecation & blue-green source schema changes storage: Design doc for Source Versioning / Tables from Sources Jul 17, 2024
@morsapaes
Copy link
Contributor

Migrating a private discussion from Slack around backfilling behavior for source tables:

Something I was mulling over after that user thread:
I expect users to ask us for a solution that allows them to not discard all state of the v0 source table in v1. For example, if they have a 7-day retention period in Kafka and create a source table v0, then after 7 days need to evolve the schema and create a source table v1, the blue/green process would cause all the green downstream objects to hydrate based on the new data only.
Naively, I'd say this looks like a UNION ($?) between v0 and v1, but...we might need to think about how to cleanly approach this (probably via dbt).
We've discussed keeping historical state around for in-place schema changes (i.e. adding new columns, but possibly keeping around the data for old columns that were previously ingested even if they're dropped), but here we're just assuming users are okay bootstrapping their source and its dependencies from scratch.

cc @sdht0, who separately also brought this up in conversation with @rjobanp.

@rjobanp
Copy link
Contributor Author

rjobanp commented Jul 18, 2024

I expect users to ask us for a solution that allows them to not discard all state of the v0 source table in v1.

Thanks for including that here @morsapaes ! It's a great question and will certainly be important to make this feature usable in real-world scenarios.

My initial thought is that we should try to backfill the new v1 source table using the usual snapshot functionality present in our sources, though this would only be able to make data available at timestamps that are still preserved in the upstream system's replication log (e.g. Postgres WAL, MySQL binlog, kafka topic). Hopefully this will be acceptable for most usecases, but it's worth thinking about whether there are better options (such as the union idea you've proposed).

@rjobanp rjobanp force-pushed the subsource-deprecation-blue-green branch from a820cec to 46fd1e1 Compare July 22, 2024 15:15
@rjobanp rjobanp self-assigned this Jul 23, 2024
@bosconi bosconi changed the title storage: Design doc for Source Versioning / Tables from Sources storage: Design doc for Source Versioning / Tables from Sources [ENG-TASK-15] Jul 24, 2024
Copy link

Design Doc

Copy link

github-actions bot commented Aug 6, 2024

All contributors have signed the CLA.
Posted by the CLA Assistant Lite bot.

@rjobanp rjobanp force-pushed the subsource-deprecation-blue-green branch from dfa419f to a26c97a Compare August 8, 2024 16:21
Copy link
Contributor

@petrosagg petrosagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for writing this up!

@rjobanp rjobanp force-pushed the subsource-deprecation-blue-green branch 2 times, most recently from ec798db to 66a76ac Compare August 13, 2024 19:59
accomplished during the process of replacing _subsources_ with _tables_ (see
solution proposal below).

- Dealing with Webhook sources. These are fundamentally different than the other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll move this to within scope and get help from Adapter with porting webhook sources into the new model. Added a placeholder task to #28653.

@rjobanp rjobanp force-pushed the subsource-deprecation-blue-green branch from d945203 to 308f107 Compare August 19, 2024 17:59
from 'normal' tables that are written to directly by users. The value of this will
be one of `'direct'` or `'source'`.

`data_source_id`: a nullable column that contains the `id` of the data source,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"data source" isn't a term we've used before. My preference would be to stick with a plain ol' "source ID." Are you worried that that would be too vague?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that we may also have tables who's data comes from other 'data sources' in the future -- e.g. if we supported CREATE TABLE .. FROM WEBHOOK or CREATE TABLE .. FROM FIVETRAN, etc, such that I didn't want to tie this column directly to our existing concept of sources. If we feel that other stuff isn't likely to happen I can switch it to just source_id

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Ok. Will respond in the other thread above (#27907 (comment)) to avoid forking the discussion, since they've converged.

- `namespace`: Any namespace that this reference object exists in, if applicable. For Postgres
this would be `<database>.<schema>`, for MySQL this would be `<schema>`, and for Kafka this
would be NULL.
- `last_updated_at`: The timestamp of when this reference was fetched from the upstream system.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other tables we use simply updated_at (see e.g. mz_cluster_replica_statuses) for the "last update" time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


The proposal for the 2nd option (a new system table):

Introduce a `mz_catalog.mz_available_source_references` with the following columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all married to this, but in mz_source_statistics we use "known (as in offset_known) to represent the idea of things that are known to Materialize but may not be perfectly up to date.

Suggested change
Introduce a `mz_catalog.mz_available_source_references` with the following columns:
Introduce a `mz_catalog.mz_known_source_references` with the following columns:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for consistency, all our source-related tables start with mz_source, so we could roll with mz_source_references. Do we need an additional qualifier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have precedent for a word before mz_source, like mz_kafka_sources, mz_postgres_sources, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I don't mind mz_source_references at all though.)


The proposal for the 2nd option (a new system table):

Introduce a `mz_catalog.mz_available_source_references` with the following columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for consistency, all our source-related tables start with mz_source, so we could roll with mz_source_references. Do we need an additional qualifier?

- `namespace`: Any namespace that this reference object exists in, if applicable. For Postgres
this would be `<database>.<schema>`, for MySQL this would be `<schema>`, and for Kafka this
would be NULL.
- `last_updated_at`: The timestamp of when this reference was fetched from the upstream system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@rjobanp rjobanp force-pushed the subsource-deprecation-blue-green branch from 6bc6d06 to b268029 Compare August 21, 2024 20:55
Comment on lines 392 to 410
Webhook sources will be migrated to the new model at the SQL level, to allow for
'schema changes' and multiple outputs using the same `CREATE TABLE .. FROM SOURCE`
statement syntax as other sources.
Since the `encoding` and `envelope` options will be moved to that table statement
this will also allow a webhook request to be decoded differently for each table.

Under the hood, we will continue to have `environmentd` receive Webhook requests
and write directly to a persist collection. However this code will be modified
to just do request validation and then write the _raw_ request body as bytes
and a map of all headers to the persist collection without doing any decoding.

We will then create a new `Persist Source` that operates exactly like all other source
types such that the storage layer renders a dataflow and the source operator reads
from an existing persist collection -- in this case the 'raw' webhook collection being
written to by `environmentd`. This `Persist Source` can have multiple `SourceExports`
like other sources, and each export can define its own `encoding` and `envelope` processing
to convert the raw bytes of each request into the appropriate relation schema.
The progress collection for the webhook source will essentially be the frontier of
the underlying 'raw' persist collection that each request is written to.
Copy link
Member

@benesch benesch Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the commonality we're trying to extract from webhook sources here, but I worry it won't be that valuable in practice. It's hard for me to imagine a real world use case for changing the body format, for example.

Retroactively adding new headers seems more valuable, but I'm worried it's actually quite surprising if we store the full raw HTTP requests for all time, even though that raw collection isn't visible to users. For example we document that you can exclude headers to omit sensitive information: https://materialize.com/docs/sql/create-source/webhook/#excluding-header-fields

But with the proposed design, that sensitive information will sit secretly inside Materialize until the source is dropped—unless someone reveals it with an ALTER SOURCE command.

The migration here for existing webhook sources seems inviable, too? Since we don't have the raw HTTP bytes saved for existing webhook sources.

Is there something lighter weight we could do here instead? Something that moves the ball forward without fully aligning with the new subsources model while also not painting ourselves into a corner? I think the most pressing concern is separating out the webhook source object from the webhook table object. But I think you could reasonably say that as a special case 1) envelopes and encodings are a property of the webhook source, not the subsource and 2) webhook subsources cannot be dropped/added.

Backing up a level, I think what folks actually want with webhook sources is the ability to "write at" webhook tables. Much more so than with other sources. Usually with webhooks you're totally happy to tear down the existing webhook source and make a new one with the new configuration (new headers, new CHECK, new body format, whatever) ... if only you could backfill the new webhook source with the fixed up data from the old webhook source!

CREATE SOURCE FROM WEBHOOK og_webhook ... HEADER 'foo';

-- Shoot, I need another header going forward!

CREATE SOURCE FROM WEBHOOK webhook2 ... HEADER foo, HEADER 'bar';

-- One time backfill

INSERT INTO webhook2 SELECT body, foo, 'unknown' AS bar FROM og_webhook;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm short on time, so that might not be entirely coherent. Happy to chat more about this live next time we're both in the office, @rjobanp, or during this week's S&S sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are all valid concerns and I am definitely worried about the migration of existing webhook sources to the model proposed here.
While we could do something simpler, I worry that we will continue to have special-casing for webhook sources and will need additional special casing for CREATE TABLE .. FROM SOURCE statements that refer to webhook sources unless we do the work now to align them to the new model. This isn't just confusing for us in the codebase, but also for users and our documentation.

Following up to our live discussion -- if the main concern is that there might be sensitive data in the raw request bytes we write to the first persist collection, I wonder if we instead can just have the 'Persist Source' ask persist to truncate the 'raw' collection as it ingests data into the downstream 'tables', such that we remove data just after it's been decoded. This would still operate with the new model but would just mean that any new 'table' added to an existing webhook source wouldn't be able to re-ingest previous requests, but all future requests would get decoded individually by all tables from that point forward.

@rjobanp rjobanp force-pushed the subsource-deprecation-blue-green branch from cd2345e to e8cb2ae Compare August 30, 2024 16:58
@benesch
Copy link
Member

benesch commented Aug 30, 2024

New webhook design LGTM! 🙇🏽

@rjobanp rjobanp force-pushed the subsource-deprecation-blue-green branch from e8cb2ae to 3705aef Compare September 6, 2024 20:32
@rjobanp
Copy link
Contributor Author

rjobanp commented Sep 9, 2024

This PR has gotten fairly big and since we've agreed on all the open discussion points I'm going to merge and apply any future updates to the doc in separate PRs

@rjobanp rjobanp merged commit e2c7b1c into MaterializeInc:main Sep 9, 2024
11 checks passed
@rjobanp rjobanp deleted the subsource-deprecation-blue-green branch September 9, 2024 12:16
@github-actions github-actions bot locked and limited conversation to collaborators Sep 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants