You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the pipeline waits until the very end of the process to save its outputs and commit them to the repo. On one hand, this can be a good thing because if anything goes wrong during the process, it won't overwrite existing data (with potentially corrupted/incorrect data).
On the other hand, at some point in the future the pipeline may take an extremely long time to run. We may want to get Google Analytics data chunked by week or even day, for example, which would require an order of magnitude more queries and time to get due to rate limiting. If this becomes the case, we'll want to save progress along the way somehow, in case the workflow times out or we need to cancel it or something else happens.
Not sure the best way to implement this yet.
First thought was to modify the queryMulti function to save every time a new query finishes. The API calls and other network requests are what take up 99% of the running time here, and all of that stuff happens through that util func. This could potentially get hairy due to the async nature of this and race conditions. When restarting the pipeline with partial previous results, the order and data of the new queries might not match the old ones, so how do we know which ones we've already done. We'd need to make some modifications so that each query has a unique ID (I think most already have this, but not all).
A simpler way might be to use the TTL cache util func I had previously written. I deleted it, but it still exists in the commit history. That might be simpler to integrate into the the queryMulti function. The nice thing about this is the caching would essentially become invisible to the user. If a recent run stopped prematurely halfway through, a new run will very quickly speed through the first half with cached results. Another nice thing is it automatically handles identifying entries by hashing the arguments passed to the function and the function body.
The text was updated successfully, but these errors were encountered:
Sean brought this up in our last meeting.
Currently, the pipeline waits until the very end of the process to save its outputs and commit them to the repo. On one hand, this can be a good thing because if anything goes wrong during the process, it won't overwrite existing data (with potentially corrupted/incorrect data).
On the other hand, at some point in the future the pipeline may take an extremely long time to run. We may want to get Google Analytics data chunked by week or even day, for example, which would require an order of magnitude more queries and time to get due to rate limiting. If this becomes the case, we'll want to save progress along the way somehow, in case the workflow times out or we need to cancel it or something else happens.
Not sure the best way to implement this yet.
First thought was to modify the
queryMulti
function to save every time a new query finishes. The API calls and other network requests are what take up 99% of the running time here, and all of that stuff happens through that util func. This could potentially get hairy due to the async nature of this and race conditions. When restarting the pipeline with partial previous results, the order and data of the new queries might not match the old ones, so how do we know which ones we've already done. We'd need to make some modifications so that each query has a unique ID (I think most already have this, but not all).A simpler way might be to use the TTL cache util func I had previously written. I deleted it, but it still exists in the commit history. That might be simpler to integrate into the the
queryMulti
function. The nice thing about this is the caching would essentially become invisible to the user. If a recent run stopped prematurely halfway through, a new run will very quickly speed through the first half with cached results. Another nice thing is it automatically handles identifying entries by hashing the arguments passed to the function and the function body.The text was updated successfully, but these errors were encountered: