Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save partial progress while running pipeline #25

Open
vincerubinetti opened this issue Sep 25, 2024 · 1 comment
Open

Save partial progress while running pipeline #25

vincerubinetti opened this issue Sep 25, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@vincerubinetti
Copy link
Collaborator

Sean brought this up in our last meeting.

Currently, the pipeline waits until the very end of the process to save its outputs and commit them to the repo. On one hand, this can be a good thing because if anything goes wrong during the process, it won't overwrite existing data (with potentially corrupted/incorrect data).

On the other hand, at some point in the future the pipeline may take an extremely long time to run. We may want to get Google Analytics data chunked by week or even day, for example, which would require an order of magnitude more queries and time to get due to rate limiting. If this becomes the case, we'll want to save progress along the way somehow, in case the workflow times out or we need to cancel it or something else happens.

Not sure the best way to implement this yet.

First thought was to modify the queryMulti function to save every time a new query finishes. The API calls and other network requests are what take up 99% of the running time here, and all of that stuff happens through that util func. This could potentially get hairy due to the async nature of this and race conditions. When restarting the pipeline with partial previous results, the order and data of the new queries might not match the old ones, so how do we know which ones we've already done. We'd need to make some modifications so that each query has a unique ID (I think most already have this, but not all).

A simpler way might be to use the TTL cache util func I had previously written. I deleted it, but it still exists in the commit history. That might be simpler to integrate into the the queryMulti function. The nice thing about this is the caching would essentially become invisible to the user. If a recent run stopped prematurely halfway through, a new run will very quickly speed through the first half with cached results. Another nice thing is it automatically handles identifying entries by hashing the arguments passed to the function and the function body.

@vincerubinetti vincerubinetti self-assigned this Sep 25, 2024
@vincerubinetti vincerubinetti added the enhancement New feature or request label Sep 25, 2024
@vincerubinetti
Copy link
Collaborator Author

https://temporal.io/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant