Save partial progress while running pipeline #25

vincerubinetti · 2024-09-25T15:18:13Z

Sean brought this up in our last meeting.

Currently, the pipeline waits until the very end of the process to save its outputs and commit them to the repo. On one hand, this can be a good thing because if anything goes wrong during the process, it won't overwrite existing data (with potentially corrupted/incorrect data).

On the other hand, at some point in the future the pipeline may take an extremely long time to run. We may want to get Google Analytics data chunked by week or even day, for example, which would require an order of magnitude more queries and time to get due to rate limiting. If this becomes the case, we'll want to save progress along the way somehow, in case the workflow times out or we need to cancel it or something else happens.

Not sure the best way to implement this yet.

First thought was to modify the queryMulti function to save every time a new query finishes. The API calls and other network requests are what take up 99% of the running time here, and all of that stuff happens through that util func. This could potentially get hairy due to the async nature of this and race conditions. When restarting the pipeline with partial previous results, the order and data of the new queries might not match the old ones, so how do we know which ones we've already done. We'd need to make some modifications so that each query has a unique ID (I think most already have this, but not all).

A simpler way might be to use the TTL cache util func I had previously written. I deleted it, but it still exists in the commit history. That might be simpler to integrate into the the queryMulti function. The nice thing about this is the caching would essentially become invisible to the user. If a recent run stopped prematurely halfway through, a new run will very quickly speed through the first half with cached results. Another nice thing is it automatically handles identifying entries by hashing the arguments passed to the function and the function body.

The text was updated successfully, but these errors were encountered:

vincerubinetti · 2024-11-14T18:48:58Z

https://temporal.io/

vincerubinetti self-assigned this Sep 25, 2024

vincerubinetti added the enhancement New feature or request label Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save partial progress while running pipeline #25

Save partial progress while running pipeline #25

vincerubinetti commented Sep 25, 2024

vincerubinetti commented Nov 14, 2024

Save partial progress while running pipeline #25

Save partial progress while running pipeline #25

Comments

vincerubinetti commented Sep 25, 2024

vincerubinetti commented Nov 14, 2024