[ntuple] RClusterPool can crash on non-existing cluster #16936

jblomer · 2024-11-14T08:22:07Z

Check duplicate issues.

Checked for duplicates

Description

As discovered by the CI, RClusterPool::WaitFor() can throw an R__ASSERT on receiving a null pointer from the cluster future. This can happen under the following condition:

The main thread triggers cluster $k$ for background loading
Another request to GetCluster() removes cluster $k$ from the provides set. If $k$ is not yet done loading by the I/O thread, this will set the fIsExpired flag on the cluster
The I/O thread, upon having loaded cluster $k$, sees the fIsExpired flag (under lock of the work queue). Consequently, it stops short of decompressing the cluster and sets the cluster promise to null (not under lock)
Another call to GetCluster() can be scheduled just between the test for fIsExpired and setting the cluster promise to null in the I/O thread. In this case, GetCluster() mistakenly assumes that the requested cluster will be provided by the I/O thread (where in fact the I/O thread returns null).

The fix seems to be to have both under lock, the test for fIsExpired and setting the cluster promise to null.

Reproducer

The RandomAccess unit test sometimes triggers the race.

ROOT version

master

Installation method

n/a

Operating system

n/a

Additional context

No response

The text was updated successfully, but these errors were encountered:

jblomer · 2024-11-14T08:49:34Z

We can consider dropping the "discard/expire" signal entirely. Since the decompression does not take place anymore in the I/O thread, it is questionable if the optimization gains much.

jblomer added bug in:RNTuple labels Nov 14, 2024

jblomer self-assigned this Nov 14, 2024

jblomer linked a pull request Nov 14, 2024 that will close this issue

[ntuple] Fix race in cluster pool #16931

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ntuple] RClusterPool can crash on non-existing cluster #16936

[ntuple] RClusterPool can crash on non-existing cluster #16936

jblomer commented Nov 14, 2024

jblomer commented Nov 14, 2024

[ntuple] RClusterPool can crash on non-existing cluster #16936

[ntuple] RClusterPool can crash on non-existing cluster #16936

Comments

jblomer commented Nov 14, 2024

Check duplicate issues.

Description

Reproducer

ROOT version

Installation method

Operating system

Additional context

jblomer commented Nov 14, 2024