Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ntuple] RClusterPool can crash on non-existing cluster #16936

Open
1 task done
jblomer opened this issue Nov 14, 2024 · 1 comment · May be fixed by #16931
Open
1 task done

[ntuple] RClusterPool can crash on non-existing cluster #16936

jblomer opened this issue Nov 14, 2024 · 1 comment · May be fixed by #16931
Assignees

Comments

@jblomer
Copy link
Contributor

jblomer commented Nov 14, 2024

Check duplicate issues.

  • Checked for duplicates

Description

As discovered by the CI, RClusterPool::WaitFor() can throw an R__ASSERT on receiving a null pointer from the cluster future. This can happen under the following condition:

  • The main thread triggers cluster $k$ for background loading
  • Another request to GetCluster() removes cluster $k$ from the provides set. If $k$ is not yet done loading by the I/O thread, this will set the fIsExpired flag on the cluster
  • The I/O thread, upon having loaded cluster $k$, sees the fIsExpired flag (under lock of the work queue). Consequently, it stops short of decompressing the cluster and sets the cluster promise to null (not under lock)
  • Another call to GetCluster() can be scheduled just between the test for fIsExpired and setting the cluster promise to null in the I/O thread. In this case, GetCluster() mistakenly assumes that the requested cluster will be provided by the I/O thread (where in fact the I/O thread returns null).

The fix seems to be to have both under lock, the test for fIsExpired and setting the cluster promise to null.

Reproducer

The RandomAccess unit test sometimes triggers the race.

ROOT version

master

Installation method

n/a

Operating system

n/a

Additional context

No response

@jblomer jblomer self-assigned this Nov 14, 2024
@jblomer jblomer linked a pull request Nov 14, 2024 that will close this issue
@jblomer
Copy link
Contributor Author

jblomer commented Nov 14, 2024

We can consider dropping the "discard/expire" signal entirely. Since the decompression does not take place anymore in the I/O thread, it is questionable if the optimization gains much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant