Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

regridding open RiOMar/GAMAR via Kerchunk → write a HEALPIX Zarr

This notebook make create HEALPIxZARR of ROI>

It opens the dataset from a Kerchunk catalog generated on HPC, and the same workflow can open:

  • directly on the HPC filesystem, or

  • over HTTPS via the DATARMOR export service (https://data-fair2adapt.ifremer.fr/).

Goal

Produce a temporary Zarr (e.g. small.zarr) that is then used as input to:

  • regrid_apply_ufunc.ipynb

Outputs

  • OUT_ZARR: a lightweight subset written to Zarr (local or HPC scratch)

Steps

  1. Open the dataset from a Kerchunk catalog (portable: HPC filesystem or HTTPS export)

  2. Select variables and ensure lon/lat are explicit coordinates

  3. Build a polygon mask helper for the model grid

  4. Define the ROI and apply the mask (load ROI polygon, mask, and subset)

  5. regrid

  6. Write a Zarr

Kerchunk catalog options

A Kerchunk catalog is a JSON mapping that lets Xarray open a collection of NetCDF (or similar) files as a virtual Zarr dataset.
Depending on where you run this notebook, you can point to:

  • HPC filesystem path (fast, when you have direct access to /scale/...)

  • HTTPS export (portable, when you access the same data through https://data-fair2adapt.ifremer.fr/)

Below are example catalog paths that have been created previously (kept here as a reference).

1. Open the dataset from a Kerchunk catalog (portable: HPC filesystem or HTTPS export)

Many Kerchunk catalogs store references to the original files (often as absolute HPC paths).
When opening through HTTPS, we rewrite those references from:

/scale/project/lops-oh-fair2adapt/...https://data-fair2adapt.ifremer.fr/...

The cell below:

  1. Detects whether the HPC path exists (so we are running on the cluster).

  2. Otherwise loads the JSON over HTTPS, patches references in-memory, and opens the dataset.

  3. (Optional) can cache the patched references locally as a parquet file for faster re-opening.

%%time
import json
from pathlib import Path

import fsspec
import numpy as np
import xarray as xr

from healpix_regrid import (
    apply_polygon_mask,
    patch_kc_refs_inplace,
    start_client,
    to_healpix,
)

time_chunk_size = 24  # 1 day as a chunk
child_level = 13

with np.load("parent_ids.npz") as data:
    parent_ids = data["parent_ids"]
    parent_level = int(data["parent_level"])

HPC_PREFIX = "/scale/project/lops-oh-fair2adapt/"
HTTPS_PREFIX = "https://data-fair2adapt.ifremer.fr/"
CATALOG_PATH = "riomar-virtualizarr/YALL.json"
OUT_ZARR = "riomar-zarr_tina/Y2023.zarr"

# Start Dask client (auto-detects HPC vs local)
client = start_client(hpc_prefix=HPC_PREFIX)

# ------------------------------
# 1) HPC mode: open directly
# ------------------------------
if Path(HPC_PREFIX).exists():
    zarr_hp_file_path = HPC_PREFIX + OUT_ZARR
    KERCHUNK_CATALOG = HPC_PREFIX + CATALOG_PATH
    print("Running in HPC mode:", KERCHUNK_CATALOG)
    ds = xr.open_dataset(KERCHUNK_CATALOG, engine="kerchunk", chunks={})

# ------------------------------
# 2) HTTPS mode: fetch JSON, patch refs, open
# ------------------------------
else:
    zarr_hp_file_path = OUT_ZARR
    KERCHUNK_CATALOG = HTTPS_PREFIX + CATALOG_PATH
    print("Running in HTTPS mode:", KERCHUNK_CATALOG)

    with fsspec.open(KERCHUNK_CATALOG, "rt") as f:
        kc = json.load(f)

    kc = patch_kc_refs_inplace(kc)
    ds = xr.open_dataset(kc, engine="kerchunk", chunks={})

ds

2. Select variables and ensure lon/lat are explicit coordinates

The original dataset contains many variables. For this demo we keep:

  • temp (temperature)

  • salt (salinity)

  • zeta (sea surface height)

We also load the 2D longitude/latitude fields and attach them as coordinates (nav_lon_rho, nav_lat_rho).
Loading them explicitly avoids repeated remote reads later (plots, masking, regridding, etc.).

ds = ds[["temp", "salt", "zeta"]].assign_coords(
    nav_lon_rho=ds["nav_lon_rho"].load(),
    nav_lat_rho=ds["nav_lat_rho"].load(),
)
ds
/srv/conda/envs/notebook/lib/python3.12/site-packages/distributed/client.py:3375: UserWarning: Sending large graph of size 329.87 MiB.
This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.
  warnings.warn(
/srv/conda/envs/notebook/lib/python3.12/site-packages/distributed/client.py:3375: UserWarning: Sending large graph of size 329.86 MiB.
This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.
  warnings.warn(
/srv/conda/envs/notebook/lib/python3.12/site-packages/distributed/client.py:3375: UserWarning: Sending large graph of size 329.86 MiB.
This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.
  warnings.warn(
Loading...

3. Apply polygon mask and define ROI

To extract a spatial subset, we load a boundary polygon (GeoJSON) and create a boolean mask on the dataset grid using apply_polygon_mask from the healpix_regrid package.

We then:

  1. Read the polygon from GeoJSON

  2. Trim the dataset with polygon mask

  3. Find which values are ‘ground’ by computing not-null values of zeta at time_counter=0

  4. Stack the spatial coordinates and drop all ground points

4. Define the ROI and apply the mask (load ROI polygon, mask, and subset)

For the regridding demo we focus on a limited area defined by an outer boundary polygon (stored in outer_boundary.geojson).
We will:

  1. Read the polygon from GeoJSON.

  2. Trim the dataset with polygon mask

  3. Find out which values are ‘ground’ by computing not null values of zeta at time_counter=0

  4. Stack the spatial coordinate and drop all the ground point

import geopandas as gpd

# 1. Read the polygon from GeoJSON.

gdf = gpd.read_file("outer_boundary.geojson", driver="GeoJSON")
poly = gdf.geometry.iloc[0]
# 2. Trim the dataset  with polygon mask : zeta_mask
# 3. Find out which values are 'ground' by computing
#    not null values of zeta at time_counter=0 : zeta_mask

zeta_mask = apply_polygon_mask(ds.zeta.isel(time_counter=0).compute(), poly).notnull()
# zeta_mask
ds_roi = ds
ds_roi["zeta_mask"] = zeta_mask

# 4. Stack the spatial coordinate and drop all the ground point (ds_in)

ds_roi_1d = ds_roi.stack(point=("y_rho", "x_rho"))

ds_roi_1d = ds_roi_1d.where(ds_roi_1d.zeta_mask, drop=True).drop_vars("zeta_mask")
ds_roi_1d
from functools import partial

# Build a regrid function with our ROI parameters bound
regrid = partial(
    to_healpix,
    parent_ids=parent_ids,
    parent_level=parent_level,
    child_level=child_level,
    time_chunk_size=time_chunk_size,
)

5.convert to healpix

from healpix_regrid import write_zarr_chunked

block = 24
nt = ds_roi_1d.sizes["time_counter"]
nt = 48  # limit for demo

write_zarr_chunked(
    ds_roi_1d.isel(time_counter=slice(0, nt)),
    zarr_hp_file_path,
    regrid_fn=regrid,
    block_size=block,
)