This notebook make create HEALPIxZARR of ROI>
It opens the dataset from a Kerchunk catalog generated on HPC, and the same workflow can open:
directly on the HPC filesystem, or
over HTTPS via the DATARMOR export service (
https://data-fair2adapt.ifremer.fr/).
Goal¶
Produce a temporary Zarr (e.g. small.zarr) that is then used as input to:
regrid_apply_ufunc.ipynb
Outputs¶
OUT_ZARR: a lightweight subset written to Zarr (local or HPC scratch)
Steps¶
Open the dataset from a Kerchunk catalog (portable: HPC filesystem or HTTPS export)
Select variables and ensure lon/lat are explicit coordinates
Build a polygon mask helper for the model grid
Define the ROI and apply the mask (load ROI polygon, mask, and subset)
regrid
Write a Zarr
Kerchunk catalog options¶
A Kerchunk catalog is a JSON mapping that lets Xarray open a collection of NetCDF (or similar) files as a virtual Zarr dataset.
Depending on where you run this notebook, you can point to:
HPC filesystem path (fast, when you have direct access to
/scale/...)HTTPS export (portable, when you access the same data through
https://data-fair2adapt.ifremer.fr/)
Below are example catalog paths that have been created previously (kept here as a reference).
1. Open the dataset from a Kerchunk catalog (portable: HPC filesystem or HTTPS export)¶
Many Kerchunk catalogs store references to the original files (often as absolute HPC paths).
When opening through HTTPS, we rewrite those references from:
/scale/project/lops-oh-fair2adapt/... → https://data-fair2adapt.ifremer.fr/...
The cell below:
Detects whether the HPC path exists (so we are running on the cluster).
Otherwise loads the JSON over HTTPS, patches references in-memory, and opens the dataset.
(Optional) can cache the patched references locally as a parquet file for faster re-opening.
%%time
import json
from pathlib import Path
import fsspec
import numpy as np
import xarray as xr
from healpix_regrid import (
apply_polygon_mask,
patch_kc_refs_inplace,
start_client,
to_healpix,
)
time_chunk_size = 24 # 1 day as a chunk
child_level = 13
with np.load("parent_ids.npz") as data:
parent_ids = data["parent_ids"]
parent_level = int(data["parent_level"])
HPC_PREFIX = "/scale/project/lops-oh-fair2adapt/"
HTTPS_PREFIX = "https://data-fair2adapt.ifremer.fr/"
CATALOG_PATH = "riomar-virtualizarr/YALL.json"
OUT_ZARR = "riomar-zarr_tina/Y2023.zarr"
# Start Dask client (auto-detects HPC vs local)
client = start_client(hpc_prefix=HPC_PREFIX)
# ------------------------------
# 1) HPC mode: open directly
# ------------------------------
if Path(HPC_PREFIX).exists():
zarr_hp_file_path = HPC_PREFIX + OUT_ZARR
KERCHUNK_CATALOG = HPC_PREFIX + CATALOG_PATH
print("Running in HPC mode:", KERCHUNK_CATALOG)
ds = xr.open_dataset(KERCHUNK_CATALOG, engine="kerchunk", chunks={})
# ------------------------------
# 2) HTTPS mode: fetch JSON, patch refs, open
# ------------------------------
else:
zarr_hp_file_path = OUT_ZARR
KERCHUNK_CATALOG = HTTPS_PREFIX + CATALOG_PATH
print("Running in HTTPS mode:", KERCHUNK_CATALOG)
with fsspec.open(KERCHUNK_CATALOG, "rt") as f:
kc = json.load(f)
kc = patch_kc_refs_inplace(kc)
ds = xr.open_dataset(kc, engine="kerchunk", chunks={})
ds2. Select variables and ensure lon/lat are explicit coordinates¶
The original dataset contains many variables. For this demo we keep:
temp(temperature)salt(salinity)zeta(sea surface height)
We also load the 2D longitude/latitude fields and attach them as coordinates (nav_lon_rho, nav_lat_rho).
Loading them explicitly avoids repeated remote reads later (plots, masking, regridding, etc.).
ds = ds[["temp", "salt", "zeta"]].assign_coords(
nav_lon_rho=ds["nav_lon_rho"].load(),
nav_lat_rho=ds["nav_lat_rho"].load(),
)
ds/srv/conda/envs/notebook/lib/python3.12/site-packages/distributed/client.py:3375: UserWarning: Sending large graph of size 329.87 MiB.
This may cause some slowdown.
Consider loading the data with Dask directly
or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.
warnings.warn(
/srv/conda/envs/notebook/lib/python3.12/site-packages/distributed/client.py:3375: UserWarning: Sending large graph of size 329.86 MiB.
This may cause some slowdown.
Consider loading the data with Dask directly
or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.
warnings.warn(
/srv/conda/envs/notebook/lib/python3.12/site-packages/distributed/client.py:3375: UserWarning: Sending large graph of size 329.86 MiB.
This may cause some slowdown.
Consider loading the data with Dask directly
or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.
warnings.warn(
3. Apply polygon mask and define ROI¶
To extract a spatial subset, we load a boundary polygon (GeoJSON) and create a boolean mask on the dataset grid using apply_polygon_mask from the healpix_regrid package.
We then:
Read the polygon from GeoJSON
Trim the dataset with polygon mask
Find which values are ‘ground’ by computing not-null values of zeta at time_counter=0
Stack the spatial coordinates and drop all ground points
4. Define the ROI and apply the mask (load ROI polygon, mask, and subset)¶
For the regridding demo we focus on a limited area defined by an outer boundary polygon (stored in outer_boundary.geojson).
We will:
Read the polygon from GeoJSON.
Trim the dataset with polygon mask
Find out which values are ‘ground’ by computing not null values of zeta at time_counter=0
Stack the spatial coordinate and drop all the ground point
import geopandas as gpd
# 1. Read the polygon from GeoJSON.
gdf = gpd.read_file("outer_boundary.geojson", driver="GeoJSON")
poly = gdf.geometry.iloc[0]
# 2. Trim the dataset with polygon mask : zeta_mask
# 3. Find out which values are 'ground' by computing
# not null values of zeta at time_counter=0 : zeta_mask
zeta_mask = apply_polygon_mask(ds.zeta.isel(time_counter=0).compute(), poly).notnull()
# zeta_mask
ds_roi = ds
ds_roi["zeta_mask"] = zeta_mask
# 4. Stack the spatial coordinate and drop all the ground point (ds_in)
ds_roi_1d = ds_roi.stack(point=("y_rho", "x_rho"))
ds_roi_1d = ds_roi_1d.where(ds_roi_1d.zeta_mask, drop=True).drop_vars("zeta_mask")
ds_roi_1dfrom functools import partial
# Build a regrid function with our ROI parameters bound
regrid = partial(
to_healpix,
parent_ids=parent_ids,
parent_level=parent_level,
child_level=child_level,
time_chunk_size=time_chunk_size,
)5.convert to healpix¶
from healpix_regrid import write_zarr_chunked
block = 24
nt = ds_roi_1d.sizes["time_counter"]
nt = 48 # limit for demo
write_zarr_chunked(
ds_roi_1d.isel(time_counter=slice(0, nt)),
zarr_hp_file_path,
regrid_fn=regrid,
block_size=block,
)