Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparently creating too many threads for tracing-subscriber #1288

Open
cormacrelf opened this issue Aug 28, 2024 · 6 comments
Open

Apparently creating too many threads for tracing-subscriber #1288

cormacrelf opened this issue Aug 28, 2024 · 6 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@cormacrelf
Copy link
Contributor

cormacrelf commented Aug 28, 2024

Getting a lot of these panics early after startup & running a clean build, which seems to create failures all over the place.

thread 'tokio-runtime-worker' panicked at /nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-vendor-cargo-deps/c19b7c6f923b580ac259164a89f2577984ad5ab09ee9d583b888f934adbbe8d0/sharded-slab-0.1.7/src/tid.rs:163:21:
creating a new thread ID (8374) would exceed the maximum number of thread ID bits specified in sharded_slab::cfg::DefaultConfig (8191)
  2024-08-28T08:11:17.344013Z ERROR nativelink_store::filesystem_store: Failed to delete file, file_path: "/nativelink/data/tmp_path-worker_cas/...", err: Error { code: Internal, ... it panicked, basically }
    at nativelink-store/src/filesystem_store.rs:124
    in nativelink_store::filesystem_store::filesystem_delete_file
    in nativelink_store::filesystem_store::filesystem_store_emplace_file
    in nativelink_worker::local_worker::worker_start_action
    in nativelink::worker with name: "worker_0"

The implication here is nativelink is creating 8000+ threads. It can apparently recover if you restart the build, which is nice. The only crate in the graph that depends on sharded-slab is tracing-subscriber, and I assume that's the code that is using the default limits. I think it's weird that nativelink would be creating 8000+ threads. 8000 sounds like a perfectly sane limit.

Nativelink version 0.5.1 from GitHub running in docker.

@allada
Copy link
Member

allada commented Aug 29, 2024

Hmmm, is this specific to tracing-subscriber? From what I see in nativelink we limit blocking threads to ~5k by default (math is 10 * config.global_cfg.max_open_files).

We do 10x because of an edge case that can happen when limiting max open files, in some cases open file limit can need more than 1 descriptor.

@cormacrelf
Copy link
Contributor Author

Hm, maybe the max open files is set too high in my config. What happens if you try to schedule more work than fits in the max open files limit? Does it fail in a similar way to hitting a ulimit? Or does the scheduler avoid it / things just queue up? If it's the latter I can fix this by dropping max open files back to a reasonable number.

@cormacrelf
Copy link
Contributor Author

Yeah, that fixed it. I think this means max_open_files should absolutely never be more than 800 with the current 10x-ing thread limit behaviour. Not that I actually read the docs when I set it way too high, but may be worth adding to them.

@aaronmondal aaronmondal added documentation Improvements or additions to documentation good first issue Good for newcomers labels Aug 29, 2024
@aaronmondal
Copy link
Member

aaronmondal commented Aug 29, 2024

Sounds like adding this info to the docs is a good first issue ☺️

@aaronmondal aaronmondal reopened this Aug 29, 2024
@cormacrelf
Copy link
Contributor Author

cormacrelf commented Aug 29, 2024

Hmmm... it did fix this problem, but now nativelink seems to be running 1-2 actions at a time. This started happening pretty late in a big build, initially it was fine, but it seems to have run out of open files. Set max_open_files to 512, and it's got about 600 threads running (not 5000). Sounds to me like something is is keeping files open or failing to decrement the open files limit or something. Build graph was about 2000 nodes.

; ls /proc/$(pgrep nativelink)/fdinfo/ | wc -l
543

Edit: basically this is filesystem_cas.json with a local worker also defined. You guys probably don't test this configuration that often. Maybe need to split up the filesystem & worker into two separate nativelink processes (i.e. containers), one for the worker, so the filesystem code does not eat up the open file limits that the worker needs.

@cormacrelf
Copy link
Contributor Author

cormacrelf commented Aug 29, 2024

Actually, late in the build graph you have things with many dependencies, and those dep lists are just long lists of object files. Especially actions that are linking binaries and pull in the full link graph. So in this weird way, it may make sense we would hit open file limits more with more dependencies. Still would be a bit weird to consume an open file handle for that while the action is executed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants