[Discussion] Design goal for v1.0 #361

Moelf · 2024-10-25T23:52:04Z

This is a bigger scope discussion that includes #319 and #314

There are two questions:

what features / behavior we want for TTree/RNTuple reading
what the default behavior should be

I think there are essentially three kinds of behavior when it comes to accessing the data (let's put aside the discussion on application-level behavior mapping, such as what EDM4hep.jl needs, and only focus on data access aspect of those applications):

one-row at a time iteration over the entire table
columnar style

1.a

for evt in df
    evt.Muon.pt
end

2.a

df.Muon.pt

But you can also have batched variations of 1 and 2:

1.b

for sub_df in Tables.partitions(df), evt in sub_df
    evt.Muon.pt
end

2.b

for sub_df in Tables.partitions(df)
    sub_df.Muon.pt
end

Currently, we have been developing this package by assuming users want 1.a (non-bach, for loop), and all the internal designs are geared towards that, for example:

iterate(LazyTree) is type stable, this causes big latency when making the tree
each TBranch/Field carries N caches, this can cause high memory use
each TBranch/Field carries N locks, N = number of threads, this causes over-head when indexing, see benchmark

Proposal

I think it might be useful to reduce latency and memory problem by dropping both iteration stability by default, and the per-thread buffer and lock. If I can only choose to drop one, I'd drop per-thread buffer and lock, and recommend 1.b as default pattern.

The text was updated successfully, but these errors were encountered:

oschulz · 2024-10-26T07:44:54Z

I'd mostly want column-mode reading and writing, along the lines of

using Tables, PropertyFunctions

# Select columns of interest, e.g. using `PropertyFunctions.PropSelFunction` (we can support multiple ways to do this via package extensions):
col_selected_rntuple = @pf((;$a, $b, $c)).(rntuple)

# Iterate over fully instantiated tables (e.g. based on StructArrays, ArraysOfArrays and so on):
for tbl in Tables.partitions(col_selected_rntuple)
   # ... user code that accesses columns directly, or uses broadcasts, or even `adapt(CuArray, tbl)` and so on ...
end

Moelf · 2024-10-26T08:11:40Z

right, Tables.partitions should just iterate clusters in RNTuple era

sounds like you want 2.b, which is not a bad default recommendation, since that's basically uproot

oschulz · 2024-10-26T10:36:44Z

sounds like you want 2.b

Yes, that would be my main way of using it. I assume for-loops with smart buffering will work anyway as well.

And we can support lazy broadcast over whole rntuples, with some specializations for PropertyFunctions and Accessors in package extensions. That way, something like (@pf (; $a, $d)).(rntuple) will be a branch selection, (@pf (x = $a + $d; y = $c * $h)).(rntuple) will be a "mapped view" (but the engine knows up front with branches to access) and (row -> (x = row.a + row.d; y = row.c * row.h)).(rntuple) would be a mapped view that relies on smart branch reading. (With Accessors some similar things can be done as with PropertyFunctions, we should support that as well).

Moelf · 2024-10-26T11:35:16Z

I assume for-loops with smart buffering will work anyway as well.

the question there is: 1) if it should be thread-safe by default, that has the "lock" overhead as well as potentially use too much RAM and 2) if it should be type-stable out of the gate, this has the high latency cost for user.

tamasgal · 2024-10-26T17:48:25Z

We use 99% of the time 1a 😅

Moelf · 2024-10-26T19:29:08Z

Question for that, can you switch to 1.b easily?

1.b has thread-safety by construction, as for type stability, we can always do the TypedTable thing as an option/different API

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Design goal for v1.0 #361

[Discussion] Design goal for v1.0 #361

Moelf commented Oct 25, 2024 •

edited

Loading

oschulz commented Oct 26, 2024

Moelf commented Oct 26, 2024 •

edited

Loading

oschulz commented Oct 26, 2024 •

edited

Loading

Moelf commented Oct 26, 2024

tamasgal commented Oct 26, 2024

Moelf commented Oct 26, 2024

[Discussion] Design goal for v1.0 #361

[Discussion] Design goal for v1.0 #361

Comments

Moelf commented Oct 25, 2024 • edited Loading

1.a

2.a

1.b

2.b

Proposal

oschulz commented Oct 26, 2024

Moelf commented Oct 26, 2024 • edited Loading

oschulz commented Oct 26, 2024 • edited Loading

Moelf commented Oct 26, 2024

tamasgal commented Oct 26, 2024

Moelf commented Oct 26, 2024

Moelf commented Oct 25, 2024 •

edited

Loading

Moelf commented Oct 26, 2024 •

edited

Loading

oschulz commented Oct 26, 2024 •

edited

Loading