Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read-write Xarray with nc-dataset adapter. #4835

Closed

Conversation

pp-mo
Copy link
Member

@pp-mo pp-mo commented Jun 28, 2022

Targets feature branch, just to unify testing.
This introduces a "netcdf dataset adapter" for xarray datasets, enabling to read/write xarray data as if it were a netcdf file.
ALSO combined here (and not really separate for now):

If this idea flies, it might be worth separating out the core-iris changes into a separate PR rather than doing it all at once?

Hence "WIP"

Also, various TO-DOs noted on this old "private PR" : pp-mo#72
(update: all done)

@pp-mo pp-mo force-pushed the nc_dataset_load branch from 5d21804 to c07ad44 Compare June 28, 2022 12:52
@pp-mo
Copy link
Member Author

pp-mo commented Jun 28, 2022

Status update:
Almost there with CI.
Tests are functioning.
Now just some docs-build errors.

@pp-mo pp-mo force-pushed the nc_dataset_load branch from b7eecb8 to 09f07ff Compare June 28, 2022 17:38
raise IOError(msg)
else:
raise
if hasattr(filename, "createVariable"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth considering a runtime protocol to define, in code, what you mean by "supporting the netCF4.Dataset api"?

https://docs.python.org/3/library/typing.html#typing.runtime_checkable. Added in Python 3.8.

@@ -1061,6 +1064,7 @@ def __init__(self, filename, netcdf_format):

* filename (string):
Name of the netCDF file to save the cube.
OR a writeable object supporting the netCF4.Dataset api.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OR a writeable object supporting the netCF4.Dataset api.
OR a writeable object supporting the netCDF4.Dataset api.

Copy link
Member

@jamesp jamesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a complete review, just some questions that popped out at me as I read the code.

This is SUPER cool :D

class _XrMimic:
"""
An netcdf object "mimic" wrapped around an xarray object, which will be
either a dim, var or dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By dim do you mean DimCoord, in iris parlance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to be explicit here: var I assume is xr.Variable, dataset is xr.Dataset. dim is an object that doesn't exist in xarray, but you need it for netcdf compat?


def __init__(self, xr, name=None, group=None):
"""
Create a mimic object wrapping a :class:`nco.Ncobj` component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is nco.Ncobj? Old init string text?


def __getitem__(self, keys):
if self.ndim == 0:
return self._xr.data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keys is ununsed in this branch of the if. Is that right? Should keys have default value? Document with a comment either way I think.

Comment on lines +207 to +209
Only a limited subset of the :mod:`netCDF4` APIs are currently
supported : just enough to allow Iris to read and write xarray datasets
in place of netcdf files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this might be where a runtime protocol could help document which parts of the api are implemented so far.

@pp-mo pp-mo mentioned this pull request Jun 29, 2022
@pp-mo
Copy link
Member Author

pp-mo commented Jun 29, 2022

Lazy xarray output

Here's an interesting follow-on possibility ... pp-mo#73
(keeping this feature separate for now, just to simplify the discussion).

without this, iris "streams" all data into the netcdf variables, as it would to a file,
so everything gets realised and stored in memory.

@pp-mo
Copy link
Member Author

pp-mo commented Jul 1, 2022

Important Context : things this can be used for ?

In addition to just "fixing data a bit" on either load or save,
I think this might possibly help with a number of outstanding "desired features" (all relating to netcdf) :

Generally, all these issues may be better handled in Xarray, since they are close to the file representation -- so a solution there can work with specific variables+dimensions, rather than Iris CF-representing objects.

@bjlittle
Copy link
Member

bjlittle commented Jul 25, 2022

Ping @dcherian

Just thought that I'd make you aware of this PR.

It's early days, but you may be able to offer some perspective on this...

@pp-mo
Copy link
Member Author

pp-mo commented Jul 25, 2022

Ping @dcherian

Thanks for making the link @bjlittle

FWIW some other comments on this that you might also like to consider ..

The current approach here is rather untidy, and grew out of an attempt to create a simple 'wrapper' object to an existing xarray dataset. But that basic approach actually fails once write support is added, since it is not possible to modify the dimensions of an existing xarray dataset, and the result is now a rather clunky "mixed" approach.

The principal limitation of this as it is, is that it expects an xarray-dataset which does not intpret cf/coords/time-data
- which is why those are all turned off in the test example.
Likewise, the xarray that you get out of the 'to_array' method will also have shortcomings in those respects + not match a "normal" load from file (e.g. in terms of time data).
[ update, 2022-08-18 :

  • in fact, we maybe should also be using "mask_and_scale=False" in the loading too
  • as I think @TomekTrzeciak also pointed out, this particular xarray encoding aspect is (perhaps uniquely?) non-reversible,
    • .. in that, it loses information by (potentially) mixing up NaN values and masked points (i.e. any occurrence of fill-value)
  • however, the other data transformations (which this disables) definitely all can in principle be "undone" by design (the decode+encode methods)
    So there is a valid route to removing those interpretations from a 'regular' dataset, as is done on saving.

]

If this "load/save xarray as a netcdf dataset" approach has some general utility, then we should put it into xarray (not Iris!),
and work on those problems, which I believe is entirely feasibble.

@dcherian
Copy link
Contributor

Thanks @bjlittle

Is there an issue I can read about the motivation for this?

If it is to take advantage of Xarray's read/write backend options, would it be easier to use the to_iris method?

@pp-mo
Copy link
Member Author

pp-mo commented Jul 26, 2022

Thanks @bjlittle

Is there an issue I can read about the motivation for this?

You're quite right, there should probably be an issue.
I will work towards that ...
For now, I can basically say that the ideal goal is ~lossless conversion between iris, xarray and netcdf files.


If it is to take advantage of Xarray's read/write backend options, would it be easier to use the [to_iris method]

It seems to me that a big "problem" with xarray to_iris/from_iris is that it requires embedding knowledge of Iris into xarray. And also, effectively, knowledge of CF, and what parts of CF Iris supports -- all of which is really Iris' business.

Meanwhile, there is a reasonable argument that an iris.to_array/from_xarray would be a better approach, since what Iris does is mostly "extend" xarray capabilities by adding CF handling.
But that then causes essentially the same problems of injecting xarray-specifics into Iris.

And with both those solutions, there is a particular problem regarding dependencies and testing, especially for C.I. :

  • in order to test an xarray to/from-iris, the xarray CI needs to test against Iris, which is a costly dependency
  • and exactly the same is true for the "converse solution" (iris to/from_xarray)

However, this kind of "dataset adapter" approach only depends on knowledge of xarray and netCDF4.

  • so, as it has no Iris-specific knowledge, it really should not be here (**)
    • it could properly belong in xarray,
    • or in a separate package
  • and it could then be used by any other packages wanting to establish file-less linkage to xarray data
    • does that seem feasible ?

(**) but we still need certain changes in Iris :

  1. support reading/writing a netcdf-dataset (or similar) instead of a filepath
  2. pass through lazy data on write, instead of realise+store

Hence, code here is really just a PoC, showing how those solutions can work together

@pp-mo
Copy link
Member Author

pp-mo commented Oct 12, 2022

I have now broken this by advancing the F-B with a mergeback.
But that's ok, I am now rationalising this work to separate concerns a bit.

So the "let iris read+write nc datasets" part is now up, in #5024,
other PRs to follow, against the updated feature-branch.

@pp-mo pp-mo closed this Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

4 participants