-
Notifications
You must be signed in to change notification settings - Fork 285
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More control over lazy data creation (chunking) #3333
Comments
Hi @TomekTrzeciak . Thanks for suggesting. In what operations do you want to control chunking ? You can already pass a pre-created dask array to the cube constructor, or assign it into cube.data. |
Constructing the cube directly is not very convenient. I guess reassigning cube.data could be an option, but still rather awkward to write something like this:
instead of just:
|
Regarding load, options will depend on the source format. The "field-based" file formats (FF, PP, GRIB) deal only in 2D fields, and they don't have any efficient access to subregions of a 2d field (i.e. the format code can only load a whole field then extract from it). Thus, the natural chunksize is the whole field, and I don't think there will ever be any practical use for chunking differently in those cases. But I guess you are talking about netCDF ???
There is certainly scope for controlling that : for instance, the chunk reduction assumes c-order contiguity, so will be worst-case if earlier dimensions vary faster in the file. So I think we are talking about adding a chunk-control keyword to the netcdf loader. |
Yes, I think extra keyword passed through from load api to netcdf loader would be all that's needed. |
I had a quick look. Unfortunately, there is no support for additional args/kwargs in the generic load functions,
In the netcdf-specific loader, we currently have
Will this work for your purposes? |
@pp-mo, exposing chunks in I think An alternative could be to use a context manager to set/pass backend options without bloating top level APIs. I've noticed that there already exists
|
Hi @TomekTrzeciak thanks for sticking with this.
I don't think there is any serious reason to oppose additional load controls. I just thought it sounded like more trouble to get such a change agreed. My concern is that, to be useful, I think we need to be able to specify chunking of individual file variables (see why below...). This means that the controls can't be expressed in terms of core Iris concepts such as cube identity, which then looks rather different to the 'save' case.
I don't see any sensible way around this, as you can't easily predict what a given load will produce, or which Iris objects relate to which parts of a source file -- because Iris itself doesn't make any simple guarantees about those behaviours : If data changes, you can't reliably know beforehand how it will merge, how many cubes are returned, or in what order -- see for example #3314. The reason I think we need a flexible control solution is that we do need to cope with large AuxCoords -- often larger than the data variables. That is exactly why we implemented lazy loading for AuxCoords. So it means we will want to control chunking of those variables too. In the near future, we also expect to be dealing with large unstructured grids, which will present the same problem. I think it could be fine, if we can design a default behaviour that enables us to simplify the simple cases. I'm just a bit wary, as it isn't immediately obvious to me how that can work. |
Hints of progress ? |
Cross-copied from #3357
|
Updated in understanding (mine, anyway) ...
This is true, but I think quite rare, as stated.
Though that may be true for the abstract 'as_lazy_data' call, I now think that is probably not so for netcdf data. Frustratingly, I can't find a clear statement of this anywhere. |
Hi @pp-mo One possible alternative solution -- or a part-solution -- would be to support the specification, by the user, of a chunking hint. There would be a default setting, of course: the canonical one (whatever that might be). I adopted this approach in a Python utility I developed for writing multiple compressed variables to netCDF files with different chunking strategies. Looking at the code, I can see that my utility supported the following chunking hints:
Without digging around in the low-level code, I can't remember off the top of my head what each of these hints led to in terms of chunking policy. But that doesn't matter here; you'd obviously choose a variety of hints suitable for chunking dask arrays in different ways. Your solution might want to fall back on a default chunking hint/policy for those cases when the user (or calling program) doesn't specify chunk sizes explicitly. It sounds like Iris is already implementing a default policy, even if it is just the dask default. Anyways, just thought I'd throw this into the mix, although it might not be suitable for the current use-case. (PS: If you did want to snoop around the code, I can sort that out - it's in a private MetO repo.) |
Interested this too ? @cpelley |
Following some more recent experiences, I'm changing my mind on this. My key motivating example:
So, I now believe we really do need to enable user chunking control in such cases,
|
I believe that #4448 is also a very similar problem, with possibly a similar solution |
Hot news! I wrote a draft something that I'm hoping may be useable for this : #4572 |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Currently, it is not possible to control how the lazy data gets chunked. It is also not possible to change that afterwards (dask
rechunk
function does not change the original chunking, it only adds additional split/merge operations on top of it). While the default choice of chunking might be OK in some cases, in other it might be unsuitable and it would be useful to allow for user choice in this respect.The text was updated successfully, but these errors were encountered: