-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving upserts #79
base: master
Are you sure you want to change the base?
Conversation
Adding paragraph on schema's in PostgreSQL.
First, a query is executed to check whether the file already exists. If not, it is inserted. If it does exist the file is updated. The current code was given issues because the transaction was closed after the failure. In the future it might be better to use a SQLAlchemy Session which handles upserts under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @jorisgillis. Thanks for the PR! I'm not sure that the updated strategy here is the right way to handle conflicts on updates. Do you have any more info on the bad state that your setup got into?
pgcontents/query.py
Outdated
@@ -510,7 +510,7 @@ def save_file(db, user_id, path, content, encrypt_func, max_size_bytes): | |||
) | |||
directory, name = split_api_filepath(path) | |||
with db.begin_nested() as savepoint: | |||
try: | |||
if not file_exists(db, user_id, path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm nervous about handling this by checking file_exists
before executing the insert, since that allows for a possible race condition where another transaction inserts a file in between the exists
check and the insert. That worry was the reason that this operation is currently implemented as "try to insert, but handle unique violations if they occur". Do you have any more info on your transaction that ended up in weird state?
Also, since this code was written, postgresql and sqlalchemy have added support for proper upserts directly in sql (see https://docs.sqlalchemy.org/en/14/dialects/postgresql.html#insert-on-conflict-upsert, for instance). Using that would probably be the "right" way to fix this if there's a problem with the current error handling strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to SQLALchemy, so forgive my ignorance.
I was under the assumption that the select and insert/update would happen in a single transaction, because of the db.begin_nested()
in the with statement. Hence, if two transaction are trying to save (a new notebook) simultaneously, one would win, the other would need to retry a save and get updated in the second attempt.
I get the following exceptions when using pgcontents version 0.6.0:
First, the insert fails
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "uix_filepath_username"
DETAIL: Key (user_id, parent_name, name)=(my_awesome_username, /, Untitled.ipynb) already exists.
Then
sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager. Please complete the context manager before emitting further commands.
If I understand it correctly, that means that the transaction started by the db.begin_nested()
has failed in some way and a new transaction needs to be started?
On the upsert: very interesting. My initial attempt at using the on_conflict_do_update
fails with Insert
does not have an attribute on_conflict_do_update
. This is the code fragment:
res = db.execute(
files.insert()
.values(
name=name,
user_id=user_id,
parent_name=directory,
content=content,
)
.on_conflict_do_update(constraint="uix_filepath_username", set_={
"name": name,
"user_id": user_id,
"parent_name": directory,
"content": content
})
)
This is true. These will both happen in a single transaction, however,
Having these happen in separate transactions doesn't prevent the race I'm concerned about. Suppose that there are two writers, A and B, that are each trying to write a new notebook to the same location. The following sequence of events is possible:
This is a pretty common race for insert-or-update operations, and the usual way to fix it (without having dedicated "upsert" operations) to try the operation and handle the error if it occurs, which is what this code was trying to do previously. I'm not sure what's going wrong though. The error suggests that we might need to exit the |
I think the issue here is that # This creates an `Insert` object that supports postgres-specific functionality.
from sqlalchemy.dialects.postgresql import insert
db.execute(insert(files).values(...).on_conflict_do_update(...) |
Using the upsert feature (on_conflict_do_update) of SQLAlchemy and PostgreSQL, instead of relying on exception handling of an insert statement. Upsert statements are support from PostgreSQL 9.5 https://www.postgresql.org/about/press/presskit95/en/ https://www.postgresql.org/docs/current/sql-insert.html
Thanks for the suggestion. I got the upsert working with SQLAlchemy. What do you think? |
@jorisgillis these changes look good to me! Unfortunately, it appears that I no longer have permission to merge PRs to this repo. Quantopian, the company for whom I originally wrote this plugin, no longer exists, so unfortunately many of their open source repos are in a bit of a state of flux. I'll see if I can contact someone about getting permissions back on this repo |
In my setup I ran into problems with updates of notebooks. Because the insert failed, the transaction ended up in a weird state. As a consequence the update was not executed.
This PR changes the code to first check whether a file exists or not. If it does not exist, an insert is performed. If the file does exist an update query is executed.
Also added a paragraph to the README, to explain why some tables can become "invisible" when inspecting the database, due to the table being in a different namespace.