-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apparently creating too many threads for tracing-subscriber #1288
Comments
Hmmm, is this specific to We do 10x because of an edge case that can happen when limiting max open files, in some cases open file limit can need more than 1 descriptor. |
Hm, maybe the max open files is set too high in my config. What happens if you try to schedule more work than fits in the max open files limit? Does it fail in a similar way to hitting a ulimit? Or does the scheduler avoid it / things just queue up? If it's the latter I can fix this by dropping max open files back to a reasonable number. |
Yeah, that fixed it. I think this means max_open_files should absolutely never be more than 800 with the current 10x-ing thread limit behaviour. Not that I actually read the docs when I set it way too high, but may be worth adding to them. |
Sounds like adding this info to the docs is a good first issue |
Hmmm... it did fix this problem, but now nativelink seems to be running 1-2 actions at a time. This started happening pretty late in a big build, initially it was fine, but it seems to have run out of open files. Set max_open_files to 512, and it's got about 600 threads running (not 5000). Sounds to me like something is is keeping files open or failing to decrement the open files limit or something. Build graph was about 2000 nodes. ; ls /proc/$(pgrep nativelink)/fdinfo/ | wc -l
543 Edit: basically this is filesystem_cas.json with a local worker also defined. You guys probably don't test this configuration that often. Maybe need to split up the filesystem & worker into two separate nativelink processes (i.e. containers), one for the worker, so the filesystem code does not eat up the open file limits that the worker needs. |
Actually, late in the build graph you have things with many dependencies, and those dep lists are just long lists of object files. Especially actions that are linking binaries and pull in the full link graph. So in this weird way, it may make sense we would hit open file limits more with more dependencies. Still would be a bit weird to consume an open file handle for that while the action is executed. |
Getting a lot of these panics early after startup & running a clean build, which seems to create failures all over the place.
The implication here is nativelink is creating 8000+ threads. It can apparently recover if you restart the build, which is nice. The only crate in the graph that depends on sharded-slab is tracing-subscriber, and I assume that's the code that is using the default limits. I think it's weird that nativelink would be creating 8000+ threads. 8000 sounds like a perfectly sane limit.
Nativelink version 0.5.1 from GitHub running in docker.
The text was updated successfully, but these errors were encountered: