Apparently creating too many threads for tracing-subscriber #1288

cormacrelf · 2024-08-28T08:38:18Z

Getting a lot of these panics early after startup & running a clean build, which seems to create failures all over the place.

thread 'tokio-runtime-worker' panicked at /nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-vendor-cargo-deps/c19b7c6f923b580ac259164a89f2577984ad5ab09ee9d583b888f934adbbe8d0/sharded-slab-0.1.7/src/tid.rs:163:21:
creating a new thread ID (8374) would exceed the maximum number of thread ID bits specified in sharded_slab::cfg::DefaultConfig (8191)
  2024-08-28T08:11:17.344013Z ERROR nativelink_store::filesystem_store: Failed to delete file, file_path: "/nativelink/data/tmp_path-worker_cas/...", err: Error { code: Internal, ... it panicked, basically }
    at nativelink-store/src/filesystem_store.rs:124
    in nativelink_store::filesystem_store::filesystem_delete_file
    in nativelink_store::filesystem_store::filesystem_store_emplace_file
    in nativelink_worker::local_worker::worker_start_action
    in nativelink::worker with name: "worker_0"

The implication here is nativelink is creating 8000+ threads. It can apparently recover if you restart the build, which is nice. The only crate in the graph that depends on sharded-slab is tracing-subscriber, and I assume that's the code that is using the default limits. I think it's weird that nativelink would be creating 8000+ threads. 8000 sounds like a perfectly sane limit.

Nativelink version 0.5.1 from GitHub running in docker.

The text was updated successfully, but these errors were encountered:

allada · 2024-08-29T02:52:19Z

Hmmm, is this specific to tracing-subscriber? From what I see in nativelink we limit blocking threads to ~5k by default (math is 10 * config.global_cfg.max_open_files).

We do 10x because of an edge case that can happen when limiting max open files, in some cases open file limit can need more than 1 descriptor.

cormacrelf · 2024-08-29T05:44:15Z

Hm, maybe the max open files is set too high in my config. What happens if you try to schedule more work than fits in the max open files limit? Does it fail in a similar way to hitting a ulimit? Or does the scheduler avoid it / things just queue up? If it's the latter I can fix this by dropping max open files back to a reasonable number.

cormacrelf · 2024-08-29T06:47:32Z

Yeah, that fixed it. I think this means max_open_files should absolutely never be more than 800 with the current 10x-ing thread limit behaviour. Not that I actually read the docs when I set it way too high, but may be worth adding to them.

aaronmondal · 2024-08-29T07:00:13Z

Sounds like adding this info to the docs is a good first issue ☺️

cormacrelf · 2024-08-29T07:06:33Z

Hmmm... it did fix this problem, but now nativelink seems to be running 1-2 actions at a time. This started happening pretty late in a big build, initially it was fine, but it seems to have run out of open files. Set max_open_files to 512, and it's got about 600 threads running (not 5000). Sounds to me like something is is keeping files open or failing to decrement the open files limit or something. Build graph was about 2000 nodes.

; ls /proc/$(pgrep nativelink)/fdinfo/ | wc -l
543

Edit: basically this is filesystem_cas.json with a local worker also defined. You guys probably don't test this configuration that often. Maybe need to split up the filesystem & worker into two separate nativelink processes (i.e. containers), one for the worker, so the filesystem code does not eat up the open file limits that the worker needs.

cormacrelf · 2024-08-29T07:15:28Z

Actually, late in the build graph you have things with many dependencies, and those dep lists are just long lists of object files. Especially actions that are linking binaries and pull in the full link graph. So in this weird way, it may make sense we would hit open file limits more with more dependencies. Still would be a bit weird to consume an open file handle for that while the action is executed.

cormacrelf closed this as completed Aug 29, 2024

aaronmondal added documentation Improvements or additions to documentation good first issue Good for newcomers labels Aug 29, 2024

aaronmondal reopened this Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apparently creating too many threads for tracing-subscriber #1288

Apparently creating too many threads for tracing-subscriber #1288

cormacrelf commented Aug 28, 2024 •

edited

Loading

allada commented Aug 29, 2024

cormacrelf commented Aug 29, 2024

cormacrelf commented Aug 29, 2024

aaronmondal commented Aug 29, 2024 •

edited

Loading

cormacrelf commented Aug 29, 2024 •

edited

Loading

cormacrelf commented Aug 29, 2024 •

edited

Loading

Apparently creating too many threads for tracing-subscriber #1288

Apparently creating too many threads for tracing-subscriber #1288

Comments

cormacrelf commented Aug 28, 2024 • edited Loading

allada commented Aug 29, 2024

cormacrelf commented Aug 29, 2024

cormacrelf commented Aug 29, 2024

aaronmondal commented Aug 29, 2024 • edited Loading

cormacrelf commented Aug 29, 2024 • edited Loading

cormacrelf commented Aug 29, 2024 • edited Loading

cormacrelf commented Aug 28, 2024 •

edited

Loading

aaronmondal commented Aug 29, 2024 •

edited

Loading

cormacrelf commented Aug 29, 2024 •

edited

Loading

cormacrelf commented Aug 29, 2024 •

edited

Loading