-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
smartOS builds are taking more than 6 hours #4011
Comments
When @ryanaslett and I were setting up these new instances, we noticed that total build time was significantly affected by caching. When the cache empty or invalidated the builds would take significantly longer (around 3+h). Builds with valid cache will take about 30-40m (IIRC). I don’t really understand what the caching situation is, Ryan may be able to share more details. But to my knowledge this isn’t an OS controlled behavior. If someone who is more familiar with the build system and the caching model is available and up to it, we can try to help debug this live to figure out what’s going on. |
Merging nodejs/node#55014 completely invalidated the cache. |
There are even cases where it takes 6 hours: https://ci.nodejs.org/job/node-test-commit-smartos/58762/ |
There are examples where the builds take 6 hours. Even with or without ccache, I think there is something wrong. |
Do you have a specific build log where it took 6h with a known valid cache that I could look at? I would definitely want to start by looking at that log before attempting to replicate it. |
@anonrig The number on the top-right of these pages includes the time that the job was waiting for available machines. The jobs themselves (https://ci.nodejs.org/job/node-test-commit-smartos/58762/nodes=smartos22-x64/ and https://ci.nodejs.org/job/node-test-commit-smartos/58762/nodes=smartos23-x64/) didn't take that long. |
This is useful information, and is definitely a significant contributor to what’s going on. How does this cache affect other builds? |
The closest I can find right now is URLPattern PR. It is still building for almost 4 hours and didn't start running the tests... |
But I agree that more than 4 hours to build without cache doesn't seem reasonable. Are those executors on particularly old or undersized hardware? |
Yes but that shows that our smartOS machine count is not sufficient enough to handle our CI run which impacts the PR time to land. URLPattern PR shares an example where it takes more than 4 hours... |
They have 8GB of RAM and 4 CPUs. If it needs to be increased, I can do that. |
How many runners do you suggest? We can create more. |
https://ci.nodejs.org/view/All/job/node-test-commit-smartos/nodes=smartos22-x64/58957/consoleFull
It looks like only one CPU is used??? I don't know which will take precedence between Also, it seems that |
https://ci.nodejs.org/view/All/job/node-test-commit-smartos/configure contains:
|
JOBS is 1 on the executors, according to Jenkins, e.g. https://ci.nodejs.org/computer/test%2Dmnx%2Dsmartos22%2Dx64%2D1/systemInfo Ansible-wise this is a combination of build/ansible/roles/jenkins-worker/tasks/main.yml Lines 7 to 25 in 7a568fd
and build/ansible/roles/jenkins-worker/vars/main.yml Lines 87 to 94 in 7a568fd
and set in
|
I'm not sure. Anything that doesn't make smartOS the bottleneck. |
Having the correct number of parallel jobs should eliminate the bottleneck (at least I wouldn't try something else before validating that) |
Well what's the average number of runners for other platforms? We can make sure there's at least something comparable. |
Yeah, this definitely sounds like it's going to be a problem. This should be configured in Jenkins and/or the node source, and has nothing to do with the OS specifically. Do you know where this change can be made? |
I'm not sure. I don't see where the JOBS=1 is coming from. @richardlau What do you think about changing:
by removing the |
I tried to connect to one of the machines to see how it's going live, but it doesn't work:
/cc @ryanaslett is there a trick? |
I honestly don't remember if Our |
We could also hardcode |
For testing that would be okay... ideally we'd fix however JOBS is being set to 1 in the first place (but that's likely to take longer to figure out). |
I changed the config to:
|
I think what you want is |
I have not used make in ages (I just use ninja) but one of the reasons why I do not use make is that I need to remember how to compute my number of cores available and pass it in every time I switch to a different computer. So I think it doesn't parallelize when you don't give it a number. |
Awesome. Thanks for doing that. Will a build be kicked off automatically, or can we start one to see how it goes? |
Here's a build against the |
Perfect. And do you know if that has a primed cache or not? |
Doesn't work. It's asking me for the root password of |
I don't know if you can. |
I think the parallelism is working on https://ci.nodejs.org/job/node-test-commit-smartos/58964/nodes=smartos23-x64/ (by looking at the timestamps of log lines) |
The build went fine (in less than 15 minutes!), so the cache was probably hot. |
We may need a larger RSS or swap cap on the containers. Is there a document that describes the resource usage of the build? How are the sizes and resource limits chosen for CI runners on other platforms? |
For follow up later on I think that JOBS was being set through ansible when it configures the jenkins agent on a machine. In terms of resources, I don't know if that is written down, but 8G and 4CPU is in the general category if not a bit higher. I do know that @richardlau has added additional swap to some machines. On some machines we do see that with higher parallelism we can run out of memory with v8 using more as more threads are available. (Edit and maybe that is what happend with the doc generation as well) @jclulow, @bahamat if we can just up the available memory to get it running with -J4 and then investigate how to better tune that would be great. |
That's for building V8. I'm very surprised to see an error at the doc building step, which should not be done in parallel with anything else. |
Not sure then, worth re-running so see if it was a one-off ? |
The other build didn't fail and is now running tests: https://ci.nodejs.org/job/node-test-commit-smartos/58964/nodes=smartos22-x64/console |
These instances have 7GB (a consequence of running a zone in a VM) and 4 cpus. So we're in the neighborhood, although not quite where we'd like.
I'm working on resizing these to give them ore resources. @jclulow and I were thinking around 12GB for the instances should do it. |
And a significantly higher swap allowance as well, to account for differences in memory overcommit between Linux and illumos, and for the fact that |
@targos @mhdawson @ryanaslett I need to reboot the SmartOS jenkins instances in order to apply the new memory settings. Is there a protocol for requesting down time? |
I don't think there's more protocol than using a GitHub issue. I suggest we do it when the CI is relatively quiet. You just need someone from @nodejs/jenkins-admins to put the workers offline during the operation. |
Hi,
smartOS builds are taking more than 3.5 hours at the moment, and blocking landing URLPattern PR (nodejs/node#56452)
3.5 hours: https://ci.nodejs.org/job/node-test-commit-smartos/58956/nodes=smartos22-x64/console
6 hours: https://ci.nodejs.org/job/node-test-commit-smartos/58762/
7 hours and 41 minutes: https://ci.nodejs.org/job/node-test-commit-smartos/58760/
cc @jasnell @mcollina @nodejs/platform-smartos @nodejs/tsc @nodejs/build
The text was updated successfully, but these errors were encountered: