Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse - Allow schema management for multiple databases within the same cluster #515

Closed
wants to merge 3 commits into from

Conversation

AndresElizondo
Copy link

@AndresElizondo AndresElizondo commented Jan 19, 2024

This PR refers to the idea proposed in #514.
This is a quick implementation, testing is missing, implementation can be cleaner.
Looking to get some feedback before investing more time.

Does this feature seem reasonable? How should it be different?

To use this feature:
.env

DATABASE_URL="clickhouse://username:password@hostname:port/database_name?schemas=bronze,silver,gold"

On the terminal:

go build
./dbmate --env-file ./.env --env DATABASE_URL dump

Andrés Elizondo added 3 commits January 18, 2024 17:27
@AndresElizondo
Copy link
Author

Known issues:

  • The schemas parameter is within the code structure for clusterParameters. This should change.
  • The way identifiers are quoted.
  • Testing needs to be added.

@amacneil
Copy link
Owner

The feature/issue seems reasonable to solve. I'm not an expert at clickhouse, but if this is a reasonably common setup then we should support it. Ideally try to align the schemas parameter name with any other clickhouse libraries (is there any prior art you leaned on here).

Will let @dossy do a code review.

@dossy
Copy link
Collaborator

dossy commented Jan 19, 2024

@AndresElizondo Thanks for starting the discussion and drafting this PR.

It doesn't appear that ClickHouse has the concept of separate schemas within a database, but rather multiple databases, from what I can tell.

This is different than other DBMS's that have databases that contain schemas that contain database objects (tables, views, stored procedures, etc.).

If you're looking to dump multiple databases, I would recommend something like the following approach:

$ for db in bronze silver gold; do
    dbmate --url clickhouse://username:password@hostname:port/$db --schema-file db/${db}.sql dump
  done

I suppose being able to specify a list of database names and having dbmate iterate through them and dump everything into one single output file offers a little convenience, but you could just concatenate the output files after the for loop if you wanted a single file, e.g., cat db/bronze.sql db/silver.sql db/gold.sql > db/schema.sql.

I'd like to understand whether the for-loop approach I outlined would satisfy your requirements, and if not, why not. And if not, and we proceed with this PR, then at a minimum we should get tests in place, both for single-instance ClickHouse and a ClickHouse cluster, as there are differences in the dumped schema for cluster configurations.

@AndresElizondo
Copy link
Author

AndresElizondo commented Jan 19, 2024

Hey @dossy, thanks for the quick response!

In ClickHouse, the terminology changes, I do not understand the reasoning as it's not explained in the docs.
A ClickHouse server/instance (analogous to a Postgres database) contains any number of instance-wide configs, users, credentials and databases (analogous to Postgres schemas) which in turn contain database objects (tables, views, etc).
Even in terms of SQL syntaxis they're equal:
PostgreSQL:

SELECT * FROM <schema>.<table_name>;

ClickHouse SQL:

SELECT * FROM <database>.<table_name>;

The current behavior in dbmate is like having it work on Postgres for a single schema.

The main issue with having multiple folders and running dbmate up is having migrations split into folders. There can be dependencies between these folders and it would mean we'd need to run part of bronze until it hits a dependency on silver then try to run silver and see if it runs into any dependency issues. Then finish running bronze and so on. It is a messy process and it's actually what we're trying to move away from. Our intention is to consolidate all relevant migrations into a single source of truth that can be run easily without dependency issues.

An alternative would be to have all the migrations in a single folder and add a bash script as you mentioned exclusively for the dump process, however, when doing dbmate up we'd need to concatenate a ./dbmate_dump.sh or similar process to get the correct schema.sql file.

Because of this, I'd argue that even though using the bash script would work, it would be less than ideal. I believe other users will see as much value in this as they do for the search_path parameter in Postgres.

Do you agree with my reasoning?
If so, we can start working on testing, fixing known issues as well as changing the wording use in the code to make it intuitive.

@dossy
Copy link
Collaborator

dossy commented Jan 19, 2024

Because of this, I'd argue that even though using the bash script would work, it would be less than ideal. I believe other users will see as much value in this as they do for the search_path parameter in Postgres.

Do you agree with my reasoning? If so, we can start working on testing, fixing known issues as well as changing the wording use in the code to make it intuitive.

My understanding of your reply is that, while you could accomplish everything you need without a change to dbmate, you'd like to implement the change as a quality-of-life improvement?

At a minimum, if you could contribute an update to the documentation and the tests to cover this new functionality, I'm not opposed to the change. But, if my understanding is correct, that it doesn't implement any functionality that introduces a capability not previously present, I'm not sure how useful this will be to anyone else.

As long as it doesn't break backwards-compatibility so anyone who doesn't need this isn't affected, I don't see the harm in implementing it. But, I'd like to see documentation and tests updated to cover the change, at least.

@AndresElizondo
Copy link
Author

Hey guys, after further testing we've come across a blocker. We need multi-query support and my understanding is that dbmate is not interested in adding this in any way.
For this reason, we will move away from using dbmate and will close this PR.
Apologies for taking up your time

@amacneil
Copy link
Owner

amacneil commented Jan 22, 2024

No problem.

For the record on multi-query: We're not against adding it, but I don't want to maintain our own SQL lexer/parser, and I'm not aware of any well-maintained go libraries for that (although I have not researched it recently). If there is an existing library which can appropriately handle all the dialects of SQL we use, it would be much easier to implement in dbmate.

@AndresElizondo
Copy link
Author

AndresElizondo commented Jan 22, 2024

I can't find anyone doing any complex SQL parsing other than Vitess.io for MySQL. But even then it seems they're splitting by ; in a more elaborate way (a ; within a quotes would not be considered for a split).
We're likely going to use golang-migrate. They have an optional flag that will split on ; and they tell the user about the limitations.

They mention:

The Clickhouse driver does not natively support executing multipe statements in a single query. To allow for multiple statements in a single migration, you can use the x-multi-statement param. There are two important caveats:
- This mode splits the migration text into separately-executed statements by a semi-colon ;. Thus x-multi-statement cannot be used when a statement in the migration contains a string with a semi-colon.
- The queries are not executed in any sort of transaction/batch, meaning you are responsible for fixing partial migrations.

I believe this behavior is a good balance for both maintainers and users
If this sounds like something you'd be interested in doing, I can help out with the PR.

@amacneil
Copy link
Owner

Appreciate the research - sounds like golang-migrate is a good option.

I would rather not introduce a naive splitting on semicolon (even if optional). I agree that it could be a useful tradeoff in some situations, but I feel like inevitably someone is going to put a semicolon in the middle of a string and it will break, leading to more issues/bugs being filed. We have pretty minimal maintainers available for dbmate so are trying to keep the potential bug surface area low.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants