Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smartOS builds are taking more than 6 hours #4011

Open
anonrig opened this issue Jan 31, 2025 · 44 comments
Open

smartOS builds are taking more than 6 hours #4011

anonrig opened this issue Jan 31, 2025 · 44 comments

Comments

@anonrig
Copy link
Member

anonrig commented Jan 31, 2025

Hi,

smartOS builds are taking more than 3.5 hours at the moment, and blocking landing URLPattern PR (nodejs/node#56452)

3.5 hours: https://ci.nodejs.org/job/node-test-commit-smartos/58956/nodes=smartos22-x64/console
6 hours: https://ci.nodejs.org/job/node-test-commit-smartos/58762/
7 hours and 41 minutes: https://ci.nodejs.org/job/node-test-commit-smartos/58760/

cc @jasnell @mcollina @nodejs/platform-smartos @nodejs/tsc @nodejs/build

@bahamat
Copy link

bahamat commented Jan 31, 2025

When @ryanaslett and I were setting up these new instances, we noticed that total build time was significantly affected by caching. When the cache empty or invalidated the builds would take significantly longer (around 3+h). Builds with valid cache will take about 30-40m (IIRC).

I don’t really understand what the caching situation is, Ryan may be able to share more details. But to my knowledge this isn’t an OS controlled behavior.

If someone who is more familiar with the build system and the caching model is available and up to it, we can try to help debug this live to figure out what’s going on.

@targos
Copy link
Member

targos commented Jan 31, 2025

Merging nodejs/node#55014 completely invalidated the cache.

@anonrig
Copy link
Member Author

anonrig commented Jan 31, 2025

There are even cases where it takes 6 hours: https://ci.nodejs.org/job/node-test-commit-smartos/58762/

@anonrig
Copy link
Member Author

anonrig commented Jan 31, 2025

When the cache empty or invalidated the builds would take significantly longer (around 3+h). Builds with valid cache will take about 30-40m (IIRC).

There are examples where the builds take 6 hours. Even with or without ccache, I think there is something wrong.

@anonrig anonrig changed the title smartOS builds are taking more than 3.5 hours smartOS builds are taking more than 6 hours Jan 31, 2025
@bahamat
Copy link

bahamat commented Jan 31, 2025

Do you have a specific build log where it took 6h with a known valid cache that I could look at? I would definitely want to start by looking at that log before attempting to replicate it.

@targos
Copy link
Member

targos commented Jan 31, 2025

@anonrig The number on the top-right of these pages includes the time that the job was waiting for available machines. The jobs themselves (https://ci.nodejs.org/job/node-test-commit-smartos/58762/nodes=smartos22-x64/ and https://ci.nodejs.org/job/node-test-commit-smartos/58762/nodes=smartos23-x64/) didn't take that long.

@bahamat
Copy link

bahamat commented Jan 31, 2025

Merging nodejs/node#55014 completely invalidated the cache.

This is useful information, and is definitely a significant contributor to what’s going on.

How does this cache affect other builds?

@anonrig
Copy link
Member Author

anonrig commented Jan 31, 2025

Do you have a specific build log where it took 6h with a known valid cache that I could look at? I would definitely want to start by looking at that log before attempting to replicate it.

The closest I can find right now is URLPattern PR. It is still building for almost 4 hours and didn't start running the tests...

@targos
Copy link
Member

targos commented Jan 31, 2025

But I agree that more than 4 hours to build without cache doesn't seem reasonable. Are those executors on particularly old or undersized hardware?

@anonrig
Copy link
Member Author

anonrig commented Jan 31, 2025

@anonrig The number on the top-right of these pages includes the time that the job was waiting for available machines. The jobs themselves (https://ci.nodejs.org/job/node-test-commit-smartos/58762/nodes=smartos22-x64/ and https://ci.nodejs.org/job/node-test-commit-smartos/58762/nodes=smartos23-x64/) didn't take that long.

Yes but that shows that our smartOS machine count is not sufficient enough to handle our CI run which impacts the PR time to land. URLPattern PR shares an example where it takes more than 4 hours...

@bahamat
Copy link

bahamat commented Jan 31, 2025

But I agree that more than 4 hours to build without cache doesn't seem reasonable. Are those executors on particularly old or undersized hardware?

They have 8GB of RAM and 4 CPUs. If it needs to be increased, I can do that.

@bahamat
Copy link

bahamat commented Jan 31, 2025

Yes but that shows that our smartOS machine count is not sufficient enough to handle our CI run which impacts the PR time to land. URLPattern PR shares an example where it takes more than 4 hours...

How many runners do you suggest? We can create more.

@targos
Copy link
Member

targos commented Jan 31, 2025

https://ci.nodejs.org/view/All/job/node-test-commit-smartos/nodes=smartos22-x64/58957/consoleFull

14:53:31 + ./configure
14:53:32 Node.js configure: Found Python 3.11.1...
14:53:34 INFO: configure completed successfully
14:53:34 + make run-ci -j 1
14:53:34 + JOBS=4
14:53:34 + NODE_TEST_DIR=/home/iojs/node-tmp
14:53:34 + NODE_COMMON_PORT=15000
14:53:34 + FLAKY_TESTS=dontcare
14:53:34 + TEST_CI_ARGS='-t 300'
14:53:34 /opt/local/bin/python3.11 ./configure --verbose 
14:53:34 Node.js configure: Found Python 3.11.1...
14:53:34 Detected C++ compiler (CXX=ccache g++) version: 12.2.0
14:53:34 Detected C compiler (CC=ccache gcc) version: 12.2.0

It looks like only one CPU is used??? I don't know which will take precedence between -j 1 and JOBS=4

Also, it seems that ./configure is called once for nothing before make run-ci (which calls it again)

@targos
Copy link
Member

targos commented Jan 31, 2025

https://ci.nodejs.org/view/All/job/node-test-commit-smartos/configure contains:

#/bin/bash -xe

curl https://raw.githubusercontent.com/nodejs/build/main/jenkins/scripts/node-test-commit-pre.sh | bash -ex -s before

export CCACHE_BASEDIR=$(pwd)
export CC="ccache gcc"
export CXX="ccache g++"
# Need to add this to ENV var to make behaviors match between smart os platforms: See: https://github.com/nodejs/build/issues/3731#issuecomment-2546193074
export CXXFLAGS="${CXXFLAGS} -fno-zero-initialized-in-bss"
curl -sLO https://raw.githubusercontent.com/nodejs/build/main/jenkins/scripts/select-compiler.sh
. ./select-compiler.sh
# configure will pick the correct version of Python to use on all currently supported Node.js versions. 
# Starting from SmartOS 20 the unversioned `python` is Python 3 which doesn't work with Node.js 12.
./configure
JOBS=4 NODE_TEST_DIR=${HOME}/node-tmp NODE_COMMON_PORT=15000 FLAKY_TESTS=dontcare TEST_CI_ARGS="-t 300" make run-ci -j $JOBS

curl https://raw.githubusercontent.com/nodejs/build/main/jenkins/scripts/node-test-commit-diagnostics.sh | bash -ex -s after

@richardlau
Copy link
Member

JOBS is 1 on the executors, according to Jenkins, e.g. https://ci.nodejs.org/computer/test%2Dmnx%2Dsmartos22%2Dx64%2D1/systemInfo

Ansible-wise this is a combination of

- name: set jobs_env from jobs_variants.{{ arch }}
set_fact:
jobs_env: "{{ jobs_variants[arch] }}"
when: jobs_env is undefined and server_jobs is undefined and arch in jobs_variants
- name: set jobs_env from jobs_variants.{{ os|stripversion }}
set_fact:
jobs_env: "{{ jobs_variants[os|stripversion] }}"
when: jobs_env is undefined and server_jobs is undefined and os|stripversion in jobs_variants
- name: set jobs_env from server_jobs or ansible_processor_vcpus or ansible_processor_cores
set_fact:
jobs_env: "{{ server_jobs|default(ansible_processor_vcpus)|default(ansible_processor_cores) }}"
when: jobs_env is undefined and (server_jobs is defined or ansible_processor_vcpus is defined or ansible_processor_cores is defined)
- name: set jobs_env to fall-back value of 2
set_fact:
jobs_env: "2"
when: jobs_env is undefined

and
# primarily for raspberry pi workers, some arm64 workers may need `server_jobs` defined in inventory
# to override these low numbers
jobs_variants: {
armv6l: '1',
armv7l: '2',
arm64: '3',
smartos: '2'
}

and set in
<envvar name='JOBS' value='{{ jobs_env|default("2") }}' />

@anonrig
Copy link
Member Author

anonrig commented Jan 31, 2025

How many runners do you suggest? We can create more.

I'm not sure. Anything that doesn't make smartOS the bottleneck.

@targos
Copy link
Member

targos commented Jan 31, 2025

Having the correct number of parallel jobs should eliminate the bottleneck (at least I wouldn't try something else before validating that)

@bahamat
Copy link

bahamat commented Jan 31, 2025

How many runners do you suggest? We can create more.

I'm not sure. Anything that doesn't make smartOS the bottleneck.

Well what's the average number of runners for other platforms? We can make sure there's at least something comparable.

@bahamat
Copy link

bahamat commented Jan 31, 2025

Having the correct number of parallel jobs should eliminate the bottleneck (at least I wouldn't try something else before validating that)

Yeah, this definitely sounds like it's going to be a problem. This should be configured in Jenkins and/or the node source, and has nothing to do with the OS specifically.

Do you know where this change can be made?

@targos
Copy link
Member

targos commented Jan 31, 2025

I'm not sure. I don't see where the JOBS=1 is coming from.

@richardlau What do you think about changing:

JOBS=4 NODE_TEST_DIR=${HOME}/node-tmp NODE_COMMON_PORT=15000 FLAKY_TESTS=dontcare TEST_CI_ARGS="-t 300" make run-ci -j $JOBS

by removing the -j $JOBS and relying on the JOBS=4 from the start of the line?

@targos
Copy link
Member

targos commented Jan 31, 2025

I tried to connect to one of the machines to see how it's going live, but it doesn't work:

$ ssh test-mnx-smartos22-x64-2
zsh:1: no such file or directory: ssh -i ~/.ssh/nodejs_build_test -W 172.16.9.3:22 [email protected]
Connection closed by UNKNOWN port 65535

/cc @ryanaslett is there a trick?

@richardlau
Copy link
Member

I honestly don't remember if make defaults to "run as many parallel threads as possible" or no parallelism if -j is omitted.

Our Makefile uses JOBS for running tests but doesn't use it for building (i.e. the build-ci target).

@targos
Copy link
Member

targos commented Jan 31, 2025

We could also hardcode -j 4

@richardlau
Copy link
Member

We could also hardcode -j 4

For testing that would be okay... ideally we'd fix however JOBS is being set to 1 in the first place (but that's likely to take longer to figure out).

@targos
Copy link
Member

targos commented Jan 31, 2025

I changed the config to:

# TODO: figure out why JOBS is 1 at the executor level.
# Refs: https://github.com/nodejs/build/issues/4011
export JOBS=4
NODE_TEST_DIR=${HOME}/node-tmp NODE_COMMON_PORT=15000 FLAKY_TESTS=dontcare TEST_CI_ARGS="-t 300" make run-ci -j $JOBS

@bahamat
Copy link

bahamat commented Jan 31, 2025

I tried to connect to one of the machines to see how it's going live, but it doesn't work:

$ ssh test-mnx-smartos22-x64-2
zsh:1: no such file or directory: ssh -i ~/.ssh/nodejs_build_test -W 172.16.9.3:22 [email protected]
Connection closed by UNKNOWN port 65535

/cc @ryanaslett is there a trick?

I think what you want is ssh -o [email protected] 172.16.9.3 (this works for me).

@joyeecheung
Copy link
Member

I honestly don't remember if make defaults to "run as many parallel threads as possible" or no parallelism if -j is omitted.

I have not used make in ages (I just use ninja) but one of the reasons why I do not use make is that I need to remember how to compute my number of cores available and pass it in every time I switch to a different computer. So I think it doesn't parallelize when you don't give it a number.

@bahamat
Copy link

bahamat commented Jan 31, 2025

I changed the config to:

# TODO: figure out why JOBS is 1 at the executor level.
# Refs: https://github.com/nodejs/build/issues/4011
export JOBS=4
NODE_TEST_DIR=${HOME}/node-tmp NODE_COMMON_PORT=15000 FLAKY_TESTS=dontcare TEST_CI_ARGS="-t 300" make run-ci -j $JOBS

Awesome. Thanks for doing that. Will a build be kicked off automatically, or can we start one to see how it goes?

@targos
Copy link
Member

targos commented Jan 31, 2025

Here's a build against the main branch: https://ci.nodejs.org/job/node-test-commit-smartos/58964/

@bahamat
Copy link

bahamat commented Jan 31, 2025

Here's a build against the main branch: https://ci.nodejs.org/job/node-test-commit-smartos/58964/

Perfect. And do you know if that has a primed cache or not?

@targos
Copy link
Member

targos commented Jan 31, 2025

I tried to connect to one of the machines to see how it's going live, but it doesn't work:

$ ssh test-mnx-smartos22-x64-2
zsh:1: no such file or directory: ssh -i ~/.ssh/nodejs_build_test -W 172.16.9.3:22 [email protected]
Connection closed by UNKNOWN port 65535

/cc @ryanaslett is there a trick?

I think what you want is ssh -o [email protected] 172.16.9.3 (this works for me).

Doesn't work. It's asking me for the root password of 67.158.54.56. Same if I add -i ~/.ssh/nodejs_build_test

@targos
Copy link
Member

targos commented Jan 31, 2025

Here's a build against the main branch: ci.nodejs.org/job/node-test-commit-smartos/58964

Perfect. And do you know if that has a primed cache or not?

I don't know if you can.

@targos
Copy link
Member

targos commented Jan 31, 2025

I think the parallelism is working on https://ci.nodejs.org/job/node-test-commit-smartos/58964/nodes=smartos23-x64/ (by looking at the timestamps of log lines)

@targos
Copy link
Member

targos commented Jan 31, 2025

The build went fine (in less than 15 minutes!), so the cache was probably hot.
But then the documentation generation failed with an out of memory error from V8.

@jclulow
Copy link

jclulow commented Jan 31, 2025

We may need a larger RSS or swap cap on the containers. Is there a document that describes the resource usage of the build? How are the sizes and resource limits chosen for CI runners on other platforms?

@mhdawson
Copy link
Member

mhdawson commented Jan 31, 2025

For follow up later on I think that JOBS was being set through ansible when it configures the jenkins agent on a machine.

In terms of resources, I don't know if that is written down, but 8G and 4CPU is in the general category if not a bit higher. I do know that @richardlau has added additional swap to some machines.

On some machines we do see that with higher parallelism we can run out of memory with v8 using more as more threads are available. (Edit and maybe that is what happend with the doc generation as well)

@jclulow, @bahamat if we can just up the available memory to get it running with -J4 and then investigate how to better tune that would be great.

@targos
Copy link
Member

targos commented Jan 31, 2025

On some machines we do see that with higher parallelism we can run out of memory with v8 using more as more threads are available.

That's for building V8. I'm very surprised to see an error at the doc building step, which should not be done in parallel with anything else.

@mhdawson
Copy link
Member

That's for building V8. I'm very surprised to see an error at the doc building step, which should not be done in parallel with anything else.

Not sure then, worth re-running so see if it was a one-off ?

@targos
Copy link
Member

targos commented Jan 31, 2025

The other build didn't fail and is now running tests: https://ci.nodejs.org/job/node-test-commit-smartos/58964/nodes=smartos22-x64/console

@bahamat
Copy link

bahamat commented Feb 1, 2025

In terms of resources, I don't know if that is written down, but 8G and 4CPU is in the general category if not a bit higher. I do know that @richardlau has added additional swap to some machines.

These instances have 7GB (a consequence of running a zone in a VM) and 4 cpus. So we're in the neighborhood, although not quite where we'd like.

@jclulow, @bahamat if we can just up the available memory to get it running with -J4 and then investigate how to better tune that would be great.

I'm working on resizing these to give them ore resources. @jclulow and I were thinking around 12GB for the instances should do it.

@jclulow
Copy link

jclulow commented Feb 1, 2025

And a significantly higher swap allowance as well, to account for differences in memory overcommit between Linux and illumos, and for the fact that /tmp is tmpfs/swap-backed on illumos, etc.

@bahamat
Copy link

bahamat commented Feb 1, 2025

@targos @mhdawson @ryanaslett I need to reboot the SmartOS jenkins instances in order to apply the new memory settings.

Is there a protocol for requesting down time?

@targos
Copy link
Member

targos commented Feb 1, 2025

I don't think there's more protocol than using a GitHub issue. I suggest we do it when the CI is relatively quiet. You just need someone from @nodejs/jenkins-admins to put the workers offline during the operation.

@jasnell
Copy link
Member

jasnell commented Feb 1, 2025

Thank you @bahamat and @jclulow for the effort on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants