Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: only *write ccache in "push to master" jobs #11661

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Feb 4, 2025

According to https://github.com/ggerganov/llama.cpp/actions/caches, we're Approaching total cache storage limit (88.08 GB of 10 GB Used)

With this PR, instead of letting each and every branch write their branch-specific outputs to ccache (and probably overwrite each other w/ weird race conditions), we restrict it to pushes to master (hopefully less concurrency). Also proposing to expire cache after 12h but not sure that's needed (risk is if there's no push for over 12h, then nobody will get any ccache to read from).

cc/ @slaren (follow up to #11516)

@ochafik ochafik changed the title ci: only write ccache in release jobs (but keep reading from them) ci: only *write ccache when pushing to master Feb 4, 2025
@ochafik ochafik changed the title ci: only *write ccache when pushing to master ci: only *write ccache in "push to master" jobs Feb 4, 2025
@slaren
Copy link
Collaborator

slaren commented Feb 4, 2025

I don't think exceeding the cache size is necessarily a problem, that's expected, since caches are immutable and every commit adds a new set of caches. As long as the size of all the caches created in a single commit is a few times lower than the max total cache size, so that the cache for the latest master commit and the caches of open PRs are kept, it should be fine. Creating caches for PRs is desirable since it improves the build times of subsequent commits to the PR.

@github-actions github-actions bot added the devops improvements to build systems and github actions label Feb 4, 2025
@ochafik
Copy link
Collaborator Author

ochafik commented Feb 4, 2025

I don't think exceeding the cache size is necessarily a problem, that's expected, since caches are immutable and every commit adds a new set of caches. As long as the size of all the caches created in a single commit is a few times lower than the max total cache size, so that the cache for the latest master commit and the caches of open PRs are kept, it should be fine. Creating caches for PRs is desirable since it improves the build times of subsequent commits to the PR.

I'm weary about the following problems:

  • Cache eviction is currently random and likely drops an entire job's cache at a time, causing that job to randomly take 5+ more minutes for whoever runs it next. We likely never gets to the fine-grained, per file expiration setup in the other PR.
  • Different PRs which jobs run in parallel may overwrite each other's update of the shared cache (not sure what parallelism we have, I'd assume there's clusters of ppl with similar active hours)
  • We currently can't realistically cache the various heavy SDK install files that are now contributing to some of the longest runs

If we only cached the main branch, we could cache said sdk downloads (reducing long tail), and PRs would get a % of cache hit proportional with the amount of files they modified, with a predictable pattern. PRs with lots of header changes would pay a higher compilation price but would reap benefits from long tail SDK-installing jobs being much faster, and a possible majority of PRs (TBC) would still have a high, predictible cache hit rate.

(I'm wondering how to interpret the https://github.com/ggerganov/llama.cpp/actions/metrics/performance metrics, but job queue time is on the rise, and avg time hasn't budged)

@ochafik
Copy link
Collaborator Author

ochafik commented Feb 4, 2025

@slaren Anyway, if you're willing to experiment, we could push something like this (+ maybe cache some sdk downloads) and see in which direction performance metrics budge after a week / revert if it's worse.

@slaren
Copy link
Collaborator

slaren commented Feb 4, 2025

On a related note, evict-old-files probably does not work with scache (the action only has an implementation for ccache), which may be why the windows CUDA cache files are so big.

https://github.com/hendrikmuhs/ccache-action/blob/a1209f81afb8c005c13b4296c32e363431bffea5/src/save.ts#L58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants