Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance coordinate handling and add curvilinear dataset support #736

Merged
merged 10 commits into from
Mar 31, 2025

Conversation

tomvothecoder
Copy link
Collaborator

@tomvothecoder tomvothecoder commented Feb 10, 2025

Description

This pull request focuses on improving coordinate handling and adding support for curvilinear datasets, specifically to ensure horizontal regridding capabilities with correct grid generation. The most important changes include the addition of the get_coords_by_name function, updates to the axis module to use type hinting, and new test cases for curvilinear datasets.

Enhancements to coordinate handling:

  • Added the get_coords_by_name function to retrieve coordinate variables based on their names, which is useful for non-CF-compliant coordinates and axes with multiple sets of coordinates. (xcdat/axis.py)
  • Updated the axis module to use modern type hinting and simplified type definitions. (xcdat/axis.py) [1] [2] [3] [4] [5]

Support for curvilinear datasets:

  • Update grid() property in accessor.py to handle curvilinear grids with correct bound generation.
  • Added a new function generate_curvilinear_dataset to create synthetic curvilinear datasets with CF-compliant metadata. (tests/fixtures.py)
  • Added test cases for curvilinear datasets to ensure proper handling and grid generation. (tests/test_regrid.py)

Documentation and imports:

  • Updated the API documentation to include the new get_coords_by_name function. (docs/api.rst)
  • Added import statements for get_coords_by_name in various files. (xcdat/__init__.py, tests/test_axis.py, tests/fixtures.py) [1] [2] [3]
  • Corrected relative paths in the Jupyter notebook documentation to ensure proper linking. (docs/demos/1-25-23-cwss-seminar/introduction-to-xcdat.ipynb)
  • Fix xcdat.tutorial.open_dataset() API docstring formatting

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@tomvothecoder tomvothecoder added the type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Feb 10, 2025
@tomvothecoder tomvothecoder self-assigned this Feb 10, 2025
@tomvothecoder tomvothecoder changed the title Update grid property to ensure cf compliant coordinates Update regridder.grid property to ensure cf compliant coordinates Feb 10, 2025
Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pochedls and @jasonb5, this PR fixes #718. Can you test and review?

I'd like to get this merged in the next day or two.

@tomvothecoder tomvothecoder added the type: enhancement New enhancement request label Feb 10, 2025
Copy link

codecov bot commented Feb 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (a282117) to head (5aea067).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #736   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           16        16           
  Lines         1658      1680   +22     
=========================================
+ Hits          1658      1680   +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jasonb5
Copy link
Collaborator

jasonb5 commented Feb 12, 2025

Hey @pochedls and @jasonb5, this PR fixes #718. Can you test and review?

I'd like to get this merged in the next day or two.

I will review this PR today.

@pochedls
Copy link
Collaborator

pochedls commented Feb 12, 2025

Things are working now, but I think there is still an issue. Take a look at the map from xcdat (this PR):

Screenshot 2025-02-12 at 11 40 16 AM

compared to calling xesmf directly:

Screenshot 2025-02-12 at 11 41 10 AM

A vague idea/possibility: In curvilinear grids lat/lon are often represented as an index (e.g., [0, 1, 2, ...]) for latitude["lat", "lon"] and longitude["lat", "lon"]. I'm wondering if the regridder is interpreting lat (the index) as the source grid points instead of the actual latitude values that are a function of lat/lon. [I might have lat/latitude reversed in this example, but something like this may be going on].

@tomvothecoder
Copy link
Collaborator Author

@pochedls Thanks Steve. It looks like we need more work on this. I'll push back the release of v0.8.0 depending on when the next E3SM Unified is released (which is probably being pushed back from 2/15)

A vague idea/possibility: In curvilinear grids lat/lon are often represented as an index (e.g., [0, 1, 2, ...]) for latitude["lat", "lon"] and longitude["lat", "lon"]. I'm wondering if the regridder is interpreting lat (the index) as the source grid points instead of the actual latitude values that are a function of lat/lon. [I might have lat/latitude reversed in this example, but something like this may be going on].

Hmm this is a good question. I'll step through the code to see which values are being used for latitude.

@tomvothecoder
Copy link
Collaborator Author

A vague idea/possibility: In curvilinear grids lat/lon are often represented as an index (e.g., [0, 1, 2, ...]) for latitude["lat", "lon"] and longitude["lat", "lon"]. I'm wondering if the regridder is interpreting lat (the index) as the source grid points instead of the actual latitude values that are a function of lat/lon. [I might have lat/latitude reversed in this example, but something like this may be going on].

I explain the issue here. You're correct, nlat and nlon are being used to represent the grid because they are considered the index/dimension coordinates. However, these coordinates contain integers representing the indices of the coordinates. lat and lon are ancillary coordinates and contain the actual values that are needed for regridding with xESMF.

This is mainly a metadata/coordinates issue rather than an issue in xCDAT's logic. We'll need to think more carefully about how we can retrieve lat and lon since they are ancillary coordinates. The regridder.grid property calls _get_axis_data() which then calls xc.get_dim_coords() (only gets dimension coordinates).

Copy link
Collaborator

@jasonb5 jasonb5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomvothecoder
Copy link
Collaborator Author

More work needs to be done based on @jasonb5's comment here: #718 (comment)

@tomvothecoder tomvothecoder changed the title Update regridder.grid property to ensure cf compliant coordinates Update regridder.grid property to return desired coordinates Mar 27, 2025
Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pochedls and @jasonb5, this PR is ready for review. Unit tests have been updated.

I verified this PR now works on the test example in #718.

# %%
# imports
import xcdat as xc
import numpy as np

# target grid
nlat = xc.create_axis("lat", np.arange(-88.5, 90, 2.5))
nlon = xc.create_axis("lon", np.arange(1.25, 360, 2.5))
ngrid = xc.create_grid(x=nlon, y=nlat)

# open / regrid
p = "/p/css03/esgf_publish/CMIP6/ScenarioMIP/NCAR/CESM2-WACCM/ssp585/r1i1p1f1/Omon/fgco2/gn/v20200702/"
ds = xc.open_mfdataset(p)

# %%
ds = ds.regridder.horizontal(
    "fgco2", ngrid, tool="xesmf", method="conservative_normed", periodic=True
)

Output

/home/vo13/miniforge3/envs/xcdat_dev_531/lib/python3.13/site-packages/xarray/conventions.py:204: SerializationWarning: variable 'fgco2' has multiple fill values {np.float32(1e+20), np.float64(1e+20)} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)
/home/vo13/miniforge3/envs/xcdat_dev_531/lib/python3.13/site-packages/xarray/conventions.py:204: SerializationWarning: variable 'fgco2' has multiple fill values {np.float32(1e+20), np.float64(1e+20)} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)
/home/vo13/miniforge3/envs/xcdat_dev_531/lib/python3.13/site-packages/xarray/conventions.py:204: SerializationWarning: variable 'fgco2' has multiple fill values {np.float32(1e+20), np.float64(1e+20)} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)
/home/vo13/miniforge3/envs/xcdat_dev_531/lib/python3.13/site-packages/xarray/conventions.py:204: SerializationWarning: variable 'fgco2' has multiple fill values {np.float32(1e+20), np.float64(1e+20)} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)
/home/vo13/miniforge3/envs/xcdat_dev_531/lib/python3.13/site-packages/xarray/conventions.py:204: SerializationWarning: variable 'fgco2' has multiple fill values {np.float32(1e+20), np.float64(1e+20)} defined, decoding all values to NaN.
  var = coder.decode(var, name=name)
2025-03-28 14:58:48,413 [WARNING]: bounds.py(add_missing_bounds:194) >> The nlat coord variable has a 'units' attribute that is not in degrees.
2025-03-28 14:58:48,413 [WARNING]: bounds.py(add_missing_bounds:194) >> The nlat coord variable has a 'units' attribute that is not in degrees.

I think we can ignore the below warnings since the grid object does not include nlat_bnds and nlon_bnds (they aren't necessary). xESMF is still able to retrieve the correct bounds (lat_bnds and lon_bnds).

/home/vo13/miniforge3/envs/xcdat_dev_531/lib/python3.13/site-packages/xesmf/frontend.py:95: UserWarning: Variables {'nlon_bnds'} not found in object but are referred to in the CF attributes.
  lon_bnds = ds.cf.get_bounds('longitude')
/home/vo13/miniforge3/envs/xcdat_dev_531/lib/python3.13/site-packages/xesmf/frontend.py:96: UserWarning: Variables {'nlon_bnds'} not found in object but are referred to in the CF attributes.
  lat_bnds = ds.cf.get_bounds('latitude')

Comment on lines +606 to +611
def generate_curvilinear_dataset() -> xr.Dataset:
"""Generate a curvilinear Dataset with CF-compliant metadata.

The dataset includes variables for time, latitude, longitude, and their
respective bounds. It also contains a synthetic data variable (``test_var``)
and additional coordinate information.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function generates a dummy curvilinear dataset, similar to the one used in the example in #718.
p = '/p/css03/esgf_publish/CMIP6/ScenarioMIP/NCAR/CESM2-WACCM/ssp585/r1i1p1f1/Omon/fgco2/gn/v20200702/'

Comment on lines +1264 to +1281
def test_grid_curvilinear(self):
ds = fixtures.generate_curvilinear_dataset()

result = ds.regridder.grid
expected = xr.Dataset(
coords={"lat": ds.lat, "lon": ds.lon, "nlon": ds.nlon, "nlat": ds.nlat},
data_vars={
"lat_bnds": ds.lat_bnds,
"lon_bnds": ds.lon_bnds,
},
)

xr.testing.assert_identical(result, expected)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New unit test to cover curvilinear case

Comment on lines 146 to 169
def _get_coord_by_name(self, name: CFAxisKey) -> xr.DataArray:
"""Get the coordinate variable based on its name.

This method is useful for returning the desired coordinates for an
axis that has multiple sets of coordinates. For example, curvilinear
grids can have multiple coordinates for the same axis (e.g., (nlat,lat)
for X and (nlon,lon)) for Y.

This method will find the desired coordinates based on its name while
ensuring that the coordinate variable is 1D by dropping any extra
dimensions (e.g., "nlat" or "nlon").

Parameters
----------
name : CFAxisKey
Name of the coordinate.

Returns
-------
xr.DataArray
The coordinate variable.

Raises
------
KeyError
If the coordinate variable is not found in the dataset.
"""
coord_names = VAR_NAME_MAP[name]

for coord_name in coord_names:
if coord_name in self._ds.coords:
coord = self._ds.coords[coord_name]

return coord

raise KeyError(f"Coordinate with name '{name}' not found in the dataset.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new function retrieves X and Y axis coordinates using the xCDAT VAR_NAME_MAP table (e.g., "lat" for latitude coordinates).

xcdat/xcdat/axis.py

Lines 41 to 49 in a282117

# A dictionary that maps common variable names to coordinate variables. This
# map is used as fall-back when coordinate variables don't have CF attributes
# set for ``cf_xarray`` to interpret using `CF_ATTR_MAP`.
VAR_NAME_MAP: Dict[CFAxisKey, List[str]] = {
"X": ["longitude", "lon"],
"Y": ["latitude", "lat"],
"T": ["time"],
"Z": ["vertical", "height", "pressure", "lev", "plev"],
}

It is more dynamic than the xESMF hard-coded retrieval of just "lat" and "lon" here.

Comment on lines -101 to +120
ds = ds.bounds.add_missing_bounds(axes=["X", "Y", "Z"])
# Add bounds only for axes that do not already have them. This
# prevents multiple sets of bounds being added for the same axis.
# For example, curvilinear grids can have multiple coordinates for the
# same axis (e.g., (nlat, lat) for X and (nlon, lon) for Y). We only
# need lat_bnds and lon_bnds for the X and Y axes, respectively, and not
# nlat_bnds and nlon_bnds.
for axis, has_bounds in axis_has_bounds.items():
if not has_bounds:
ds = ds.bounds.add_bounds(axis=axis)
Copy link
Collaborator Author

@tomvothecoder tomvothecoder Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new logic prevents unnecessary attempts to add bounds for axes if they already exist. Otherwise, I found that ds.bounds.add_missing_bounds() will unnecessarily add nlon_bnds while failing to add bounds for nlat (raises warning about units not in degrees).

@pochedls
Copy link
Collaborator

Thanks for putting this PR together so quickly, @tomvothecoder (and for the work you did in identifying/explaining the issue @jasonb5)! Have a great weekend!

- Update `accessor.py` to use `get_coords_by_name` and remove `_get_coord_by_name`
@github-actions github-actions bot added the type: docs Updates to documentation label Mar 31, 2025
Comment on lines +157 to +200
def get_coords_by_name(obj: xr.Dataset | xr.DataArray, axis: CFAxisKey) -> xr.DataArray:
"""Retrieve the coordinate variable based on its name.

This method is useful for returning the desired coordinate in the following
cases:

- Coordinates that are not CF-compliant (e.g., missing CF attributes like
"axis", "standard_name", or "long_name") but use common names.
- Axes with multiple sets of coordinates. For example, curvilinear grids may
have multiple coordinates for the same axis (e.g., (nlat, lat) for X and
(nlon, lon) for Y). In most cases, "lat" and "lon" are the desired
coordinates, which this function will return.

Common variable names for each axis (from ``VAR_NAME_MAP``):

- "X" axis: ["longitude", "lon"]
- "Y" axis: ["latitude", "lat"]
- "T" axis: ["time"]
- "Z" axis: ["vertical", "height", "pressure", "lev", "plev"]

Parameters
----------
axis : CFAxisKey
The CF axis key ("X", "Y", "T", or "Z").

Returns
-------
xr.DataArray
The coordinate variable.

Raises
------
KeyError
If the coordinate variable is not found in the dataset.
"""
coord_names = VAR_NAME_MAP[axis]

for coord_name in coord_names:
if coord_name in obj.coords:
coord = obj.coords[coord_name]

return coord

raise KeyError(f"Coordinate with name '{axis}' not found in the dataset.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pochedls I made this function a public API (formerly _get_coord_by_name() in accessor.py) because I found it was a general function and it might be useful for users who deal with the cases mentioned in the docstring.

Comment on lines +122 to +130
def _get_axis_coord_and_bounds(
self, axis: CFAxisKey
) -> Tuple[xr.DataArray | None, xr.DataArray | None]:
try:
bounds_var = self._ds.bounds.get_bounds(name, coord_var.name)
except KeyError:
bounds_var = None
coord_var = get_coords_by_name(self._ds, axis)
if coord_var.size == 1:
raise ValueError(
f"Coordinate '{coord_var}' is a singleton and cannot be used."
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore singleton coordinates because we cannot generate bounds for a singleton coordinates yet (e.g., "height" with CF-compliant metadata mapping it to a Z axis).

@tomvothecoder tomvothecoder changed the title Update regridder.grid property to return desired coordinates Enhance coordinate handling and add curvilinear dataset support Mar 31, 2025
@tomvothecoder tomvothecoder merged commit 595845f into main Mar 31, 2025
10 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in xCDAT Development Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. type: docs Updates to documentation type: enhancement New enhancement request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Bug]: CF compliance issues when regridding [Bug]: Getting data CF-compliant for regridding (with xesmf)
3 participants