-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Design goal for v1.0 #361
Comments
I'd mostly want column-mode reading and writing, along the lines of using Tables, PropertyFunctions
# Select columns of interest, e.g. using `PropertyFunctions.PropSelFunction` (we can support multiple ways to do this via package extensions):
col_selected_rntuple = @pf((;$a, $b, $c)).(rntuple)
# Iterate over fully instantiated tables (e.g. based on StructArrays, ArraysOfArrays and so on):
for tbl in Tables.partitions(col_selected_rntuple)
# ... user code that accesses columns directly, or uses broadcasts, or even `adapt(CuArray, tbl)` and so on ...
end |
right, sounds like you want 2.b, which is not a bad default recommendation, since that's basically uproot |
Yes, that would be my main way of using it. I assume for-loops with smart buffering will work anyway as well. And we can support lazy broadcast over whole rntuples, with some specializations for PropertyFunctions and Accessors in package extensions. That way, something like |
the question there is: 1) if it should be thread-safe by default, that has the "lock" overhead as well as potentially use too much RAM and 2) if it should be type-stable out of the gate, this has the high latency cost for user. |
We use 99% of the time 1a 😅 |
Question for that, can you switch to 1.b easily? 1.b has thread-safety by construction, as for type stability, we can always do the TypedTable thing as an option/different API |
This is a bigger scope discussion that includes #319 and #314
There are two questions:
I think there are essentially three kinds of behavior when it comes to accessing the data (let's put aside the discussion on application-level behavior mapping, such as what EDM4hep.jl needs, and only focus on data access aspect of those applications):
1.a
2.a
But you can also have batched variations of 1 and 2:
1.b
2.b
Currently, we have been developing this package by assuming users want 1.a (non-bach, for loop), and all the internal designs are geared towards that, for example:
iterate(LazyTree)
is type stable, this causes big latency when making the treeN
caches, this can cause high memory useN
locks,N
= number of threads, this causes over-head when indexing, see benchmarkProposal
I think it might be useful to reduce latency and memory problem by dropping both iteration stability by default, and the per-thread buffer and lock. If I can only choose to drop one, I'd drop per-thread buffer and lock, and recommend 1.b as default pattern.
The text was updated successfully, but these errors were encountered: