EGI DataHub Zarr Toolkit - Examples - RiOMar Project – Coastal Water Quality Anticipation to manage coastal zone ecosystem responses for biodiversity conservation

This notebook demonstrates how to use the egi_datahub_zarr toolkit to read and write Zarr v3 stores directly from/to EGI DataHub.

Setup¶

Get an access token from https://datahub.egi.eu (Tokens → Create new access token)
Set the EGI DATAHUB Token in the file “egi-datahub-token” (in your home directory)
Import the toolkit

# Import the toolkit
from egi_datahub_zarr import DataHubClient, open_zarr, to_zarr
import xarray as xr
import numpy as np
import os

Initialize Client¶

# Create client (uses DATAHUB_TOKEN env var)
token = open(os.path.join(os.environ['HOME'],"egi-datahub-token")).read().rstrip()
os.environ["DATAHUB_TOKEN"] = token

client = DataHubClient(token)

# Check connection
user = client.get_user_info()
print(f"Connected as: {user.get('name')}")

Connected as: Anne Fouilloux

List Spaces and Browse Files¶

# List available spaces
spaces = client.list_spaces()
print("Available spaces:")
for name, info in spaces.items():
    print(f"  📁 {name}")

Available spaces:
  📁 Pangeo
  📁 notebooks-shared
  📁 PLAYGROUND
  📁 Reliance
  📁 open-datasets

# Browse a directory
items = client.list_directory("Reliance/FAIR2Adapt")
print("Contents of Reliance/FAIR2Adapt:")
for item in items:
    icon = "📁" if item['type'] == 'DIR' else "📄"
    print(f"  {icon} {item['name']}")

Contents of Reliance/FAIR2Adapt:
  📁 CS1
  📁 CS2
  📁 CS3
  📁 CS4
  📁 CS5
  📁 CS6
  📄 README

Read Zarr Store¶

Method 1: Using the client¶

# Open Zarr store directly from DataHub
ds = client.open_zarr("Reliance/FAIR2Adapt/CS1/sample_climate_data.zarr")
print(ds)

<xarray.Dataset> Size: 378MB
Dimensions:        (time: 365, lat: 180, lon: 360)
Coordinates:
  * time           (time) datetime64[ns] 3kB 2020-01-01 ... 2020-12-30
  * lat            (lat) float64 1kB -90.0 -88.99 -87.99 ... 87.99 88.99 90.0
  * lon            (lon) float64 3kB -180.0 -179.0 -178.0 ... 178.0 179.0 180.0
Data variables:
    precipitation  (time, lat, lon) float64 189MB dask.array<chunksize=(30, 45, 90), meta=np.ndarray>
    temperature    (time, lat, lon) float64 189MB dask.array<chunksize=(30, 45, 90), meta=np.ndarray>
Attributes:
    title:        Sample Climate Data for FAIR2Adapt CS1
    institution:  Science Live
    source:       Synthetic data for demonstration
    Conventions:  CF-1.8

# Access data - only fetches the chunks you need!
temp_slice = ds.temperature[0, :10, :10].values
print(f"Temperature slice shape: {temp_slice.shape}")
print(f"Values:\n{temp_slice}")

Temperature slice shape: (10, 10)
Values:
[[ 19.96714153  13.61735699  21.47688538  30.23029856  12.65846625
   12.65863043  30.79212816  22.67434729  10.30525614  20.42560044]
 [ 20.19346514  30.32738913  13.91239852  19.01711722  21.90143992
   10.98779528  17.24092482  15.12592401  15.97676099   7.26990216]
 [ 18.07801769  -2.10168393   1.51814578  22.43264094  16.70865438
   13.16016664  15.18433933  18.47581705   9.6024032    7.21695275]
 [ 11.51347866  11.50742296  11.78364949  35.76747984  18.81935452
   19.30041647  25.30283454  17.38789159  12.40957854  13.03650151]
 [ 21.62881269  26.73473857  16.81021559   2.03168052  18.99687952
    8.48643106   9.71383318  20.86364019  27.38283071  15.21271577]
 [ 34.01190686  14.39339186   7.91593233  -0.13714393  -3.03139676
   -0.84135943  17.67126651  20.08725023  -0.81190702  23.95038314]
 [  3.88542043  17.46504778  19.98221748  26.40149039  30.80540687
    4.84905808   6.89142489   2.42422141  12.65980141  19.66358374]
 [ 16.75211419  44.85259003  18.67481665  11.86470314  24.21801502
   19.82687887  19.2009449   21.06850593  35.56543565   3.69111565]
 [ -7.38231233  -6.20700153   8.93134822  19.57686586 -12.47504843
   10.00269824   9.73752143  28.88337778  11.14978188  18.82988985]
 [ 15.12499322  43.68403056  -1.68599259  25.5872853   13.27198263
   22.71920155  19.41307152   7.66843803  17.28996392  -3.57901451]]

Method 2: Using convenience function¶

# One-liner to open Zarr
ds = open_zarr("Reliance/FAIR2Adapt/CS1/sample_climate_data.zarr")
print(ds)

<xarray.Dataset> Size: 378MB
Dimensions:        (time: 365, lat: 180, lon: 360)
Coordinates:
  * time           (time) datetime64[ns] 3kB 2020-01-01 ... 2020-12-30
  * lat            (lat) float64 1kB -90.0 -88.99 -87.99 ... 87.99 88.99 90.0
  * lon            (lon) float64 3kB -180.0 -179.0 -178.0 ... 178.0 179.0 180.0
Data variables:
    precipitation  (time, lat, lon) float64 189MB dask.array<chunksize=(30, 45, 90), meta=np.ndarray>
    temperature    (time, lat, lon) float64 189MB dask.array<chunksize=(30, 45, 90), meta=np.ndarray>
Attributes:
    title:        Sample Climate Data for FAIR2Adapt CS1
    institution:  Science Live
    source:       Synthetic data for demonstration
    Conventions:  CF-1.8

Method 3: Using store directly with xarray¶

# Get the low-level store for more control
store = client.get_zarr_store("Reliance/FAIR2Adapt/CS1/sample_climate_data.zarr")

# Use with xarray directly
ds = xr.open_zarr(store, consolidated=False, zarr_format=3)
print(ds)

<xarray.Dataset> Size: 378MB
Dimensions:        (time: 365, lat: 180, lon: 360)
Coordinates:
  * time           (time) datetime64[ns] 3kB 2020-01-01 ... 2020-12-30
  * lat            (lat) float64 1kB -90.0 -88.99 -87.99 ... 87.99 88.99 90.0
  * lon            (lon) float64 3kB -180.0 -179.0 -178.0 ... 178.0 179.0 180.0
Data variables:
    precipitation  (time, lat, lon) float64 189MB dask.array<chunksize=(30, 45, 90), meta=np.ndarray>
    temperature    (time, lat, lon) float64 189MB dask.array<chunksize=(30, 45, 90), meta=np.ndarray>
Attributes:
    title:        Sample Climate Data for FAIR2Adapt CS1
    institution:  Science Live
    source:       Synthetic data for demonstration
    Conventions:  CF-1.8

Write Zarr Store¶

Create a sample dataset¶

# Create a sample dataset
times = np.arange('2024-01-01', '2024-01-11', dtype='datetime64[D]')
lats = np.linspace(-90, 90, 36)
lons = np.linspace(-180, 180, 72)

# Generate random data
np.random.seed(42)
temperature = 15 + 10 * np.random.randn(len(times), len(lats), len(lons))
precipitation = np.abs(np.random.randn(len(times), len(lats), len(lons))) * 10

# Create xarray dataset
new_ds = xr.Dataset(
    {
        'temperature': (['time', 'lat', 'lon'], temperature),
        'precipitation': (['time', 'lat', 'lon'], precipitation),
    },
    coords={
        'time': times,
        'lat': lats,
        'lon': lons,
    },
    attrs={
        'title': 'Test dataset from Python toolkit',
        'created_by': 'egi_datahub_zarr toolkit',
        'Conventions': 'CF-1.8',
    }
)

print(new_ds)

<xarray.Dataset> Size: 416kB
Dimensions:        (time: 10, lat: 36, lon: 72)
Coordinates:
  * time           (time) datetime64[s] 80B 2024-01-01 2024-01-02 ... 2024-01-10
  * lat            (lat) float64 288B -90.0 -84.86 -79.71 ... 79.71 84.86 90.0
  * lon            (lon) float64 576B -180.0 -174.9 -169.9 ... 169.9 174.9 180.0
Data variables:
    temperature    (time, lat, lon) float64 207kB 19.97 13.62 ... 0.3737 14.29
    precipitation  (time, lat, lon) float64 207kB 3.215 13.34 ... 0.04883 19.88
Attributes:
    title:        Test dataset from Python toolkit
    created_by:   egi_datahub_zarr toolkit
    Conventions:  CF-1.8

Write to DataHub¶

# Write using client
client.to_zarr(new_ds, "Reliance/FAIR2Adapt/CS1/my_test_output.zarr")

✅ Written to: Reliance/FAIR2Adapt/CS1/my_test_output.zarr

# Or use the convenience function
to_zarr(new_ds, "Reliance/FAIR2Adapt/CS1/another_output.zarr")

✅ Written to: Reliance/FAIR2Adapt/CS1/another_output.zarr

Verify the write¶

# Read back to verify
verified_ds = client.open_zarr("Reliance/FAIR2Adapt/CS1/my_test_output.zarr")
print("Verified dataset:")
print(verified_ds)
print(f"\nTemperature values match: {np.allclose(verified_ds.temperature.values, new_ds.temperature.values)}")

Verified dataset:
<xarray.Dataset> Size: 416kB
Dimensions:        (time: 10, lat: 36, lon: 72)
Coordinates:
  * time           (time) datetime64[ns] 80B 2024-01-01 ... 2024-01-10
  * lat            (lat) float64 288B -90.0 -84.86 -79.71 ... 79.71 84.86 90.0
  * lon            (lon) float64 576B -180.0 -174.9 -169.9 ... 169.9 174.9 180.0
Data variables:
    precipitation  (time, lat, lon) float64 207kB dask.array<chunksize=(10, 36, 72), meta=np.ndarray>
    temperature    (time, lat, lon) float64 207kB dask.array<chunksize=(10, 36, 72), meta=np.ndarray>
Attributes:
    title:        Test dataset from Python toolkit
    created_by:   egi_datahub_zarr toolkit
    Conventions:  CF-1.8

Temperature values match: True

Advanced: Low-Level Store Access¶

from egi_datahub_zarr import OnedataZarrStore
import os

# Get the store directly for advanced use
TOKEN = os.environ.get("DATAHUB_TOKEN")

# First resolve the path to get the file ID
provider, file_id = client.resolve_path("Reliance/FAIR2Adapt/CS1/sample_climate_data.zarr")
print(f"Provider: {provider}")
print(f"File ID: {file_id[:50]}...")

# Create store directly
store = OnedataZarrStore(
    root_file_id=file_id,
    token=TOKEN,
    provider=provider,
    read_only=True
)

# Use with zarr directly
import zarr
group = zarr.open_group(store, mode='r', zarr_format=3)
print(f"\nZarr group contents: {list(group.keys())}")

Provider: cesnet-oneprovider-01.datahub.egi.eu
File ID: 000000000052446A6775696423666132326565333666326565...

Zarr group contents: ['lat', 'lon', 'precipitation', 'temperature', 'time']

Cleanup (Optional)¶

# Delete test files if needed
client.delete_zarr("Reliance/FAIR2Adapt/CS1/my_test_output.zarr")
client.delete_zarr("Reliance/FAIR2Adapt/CS1/another_output.zarr")

✅ Deleted: Reliance/FAIR2Adapt/CS1/my_test_output.zarr
✅ Deleted: Reliance/FAIR2Adapt/CS1/another_output.zarr

Summary¶

The egi_datahub_zarr toolkit provides:

Feature	Method
List spaces	`client.list_spaces()`
Browse directories	`client.list_directory(path)`
Read Zarr	`client.open_zarr(path)` or `open_zarr(path)`
Write Zarr	`client.to_zarr(ds, path)` or `to_zarr(ds, path)`
Delete Zarr	`client.delete_zarr(path)`
Low-level store	`client.get_zarr_store(path)`

All paths use the format: SpaceName/folder/subfolder/file.zarr