Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 MultipartWriter capability to resume interrupted upload #829

Open
3 tasks done
villasv opened this issue Jul 29, 2024 · 2 comments
Open
3 tasks done

S3 MultipartWriter capability to resume interrupted upload #829

villasv opened this issue Jul 29, 2024 · 2 comments

Comments

@villasv
Copy link

villasv commented Jul 29, 2024

Problem description

I'm using smart_open to write a big file directly into S3, and for that it has been working great. Thank you for the great package!

The issue is that sometimes the Python process gets killed and I'm left with an incomplete multi-part upload. At the moment I have the S3 bucket lifecycle rule to clean these up after a day, but ideally I wanted to resume the upload instead of just starting from scratch, making the most use of multi-part upload and saving some compute & storage.

After a quick glance at MultipartWriter.__init__, this looks impossible as self._client.create_multipart_upload is always invoked. I wanted to gauge how big of an effort would it be to support such scenario if I somehow am able to supply an UploadId to be used.

Versions

Please provide the output of:

macOS-14.5-arm64-arm-64bit
Python 3.10.12 (main, Sep 16 2023, 13:51:00) [Clang 15.0.0 (clang-1500.0.40.1)]
smart_open 7.0.4

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly (I tried!)
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software
@ddelange
Copy link
Contributor

ddelange commented Sep 16, 2024

The issue is that sometimes the Python process gets killed and I'm left with an incomplete multi-part upload.

fwiw, if the process receives SIGTERM, the with-statement will exit with an exception and thus the terminate() method will be called to abort the multipart upload. so I guess we're only talking about SIGKILL here (e.g. kubernetes/docker killing a subprocess to prevent container PID 1 from going out of memory).

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 17, 2024

From my experience, recovering failed uploads is too much of a pain in the arse. It is far simpler to restart the upload from scratch. Of course, it's a waste of CPU time and bandwidth, but it's better than a waste of developer and maintainer effort.

That said, if you did want to recover a failed upload, you'd need to reconstruct the following:

  1. The UploadId
  2. The IDs of successfully uploaded parts
  3. The position in the input byte stream from which to resume (so what got uploaded successfully, minus what was in the internal buffers at the time of the crash)

Of these, 3 is the trickiest, because it's not explicitly exposed anywhere. You can't infer 3 from 2 because the uploaded parts may be compressed differently to the input stream. Theoretically, you could have smart_open dump this information periodically to e.g. /tmp/smart_open/uploadid.json, as it uploads each part. If you're interested in a PR to do this, I'm open to that.

The somewhat ugly part would be implementing resume functionality as part of the API. How would this look in Python code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants