Create RO-Crate from RiOMar dataset

Create RO-Crate from RiOMar dataset#

Context#

Purpose#

We are showing how to create a RO-Crate for a dataset using the rocrate python library. This is a simple example with no specific RO-Crate profile. It follows RO-Crate v 1.1 specification.

Standardized Metadata Packaging: RO-Crates provide a standardized way to bundle datasets with rich metadata, making it easier to understand, share, and reuse the data.
Enhanced FAIRness: By including machine-readable metadata, RO-Crates improve the Findability, Accessibility, Interoperability, and Reusability (FAIR) of the dataset.
Improved Discoverability: Metadata in an RO-Crate allows datasets to be easily indexed and discovered through search engines and data repositories.
Documentation and Provenance: RO-Crates document essential information about the dataset, such as its source, authorship, and creation process, ensuring transparency and traceability.
Facilitates Integration: The structured metadata makes it easier to integrate the dataset with other tools, workflows, or datasets, enhancing its usability.
Compliance with Standards: Many funding agencies and journals now require datasets to be published with detailed metadata. RO-Crates align with these expectations and promote best practices in data management.

Description#

In this notebook, we will learn how to create a simple RO-Crate from the RiOMar data. We will then identify any missing metadata that needs to be added to the original dataset’s metadata.

Contributions#

Notebook#

Anne Fouilloux (author), Simula Research Laboratory (Norway), @annefou
XX (reviewer)

Biblipgraphy and other interesting resources#

rocrate Python package
Research Object documentation

Install and Import libraries#

pip install rocrate rocrateValidator

Requirement already satisfied: rocrate in /srv/conda/envs/notebook/lib/python3.12/site-packages (0.13.0)
Collecting rocrateValidator
  Downloading rocrateValidator-0.2.15-py3-none-any.whl.metadata (228 bytes)
Requirement already satisfied: requests in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (2.32.3)
Requirement already satisfied: arcp==0.2.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (0.2.1)
Requirement already satisfied: jinja2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (3.1.4)
Requirement already satisfied: python-dateutil in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (2.8.2)
Requirement already satisfied: click in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (8.1.7)
Requirement already satisfied: MarkupSafe>=2.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from jinja2->rocrate) (2.1.5)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from python-dateutil->rocrate) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (1.26.19)
Requirement already satisfied: certifi>=2017.4.17 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (2024.7.4)
Downloading rocrateValidator-0.2.15-py3-none-any.whl (11 kB)
Installing collected packages: rocrateValidator
Successfully installed rocrateValidator-0.2.15
Note: you may need to restart the kernel to use updated packages.

import requests
import json
from rocrate.rocrate import ROCrate
from rocrate.model.person import Person
import pandas as pd
from datetime import datetime
import geopandas
import shapely
import xarray as xr
import numpy as np
import s3fs

Open RiOMar data to get metadata#

url_data = "https://data-fair2adapt.ifremer.fr/riomar/small.zarr"

ds = xr.open_zarr(url_data)
ds

<xarray.Dataset> Size: 498MB
Dimensions:       (y_rho: 838, x_rho: 727, s_rho: 40, time_counter: 5)
Coordinates:
    nav_lat_rho   (y_rho, x_rho) float64 5MB dask.array<chunksize=(838, 727), meta=np.ndarray>
    nav_lon_rho   (y_rho, x_rho) float64 5MB dask.array<chunksize=(838, 727), meta=np.ndarray>
  * s_rho         (s_rho) float32 160B -0.9875 -0.9625 ... -0.0375 -0.0125
  * time_counter  (time_counter) datetime64[ns] 40B 2004-01-01T00:58:30 ... 2...
    time_instant  (time_counter) datetime64[ns] 40B dask.array<chunksize=(1,), meta=np.ndarray>
Dimensions without coordinates: y_rho, x_rho
Data variables:
    ocean_mask    (y_rho, x_rho) bool 609kB dask.array<chunksize=(838, 727), meta=np.ndarray>
    temp          (time_counter, s_rho, y_rho, x_rho) float32 487MB dask.array<chunksize=(1, 40, 838, 727), meta=np.ndarray>
Attributes: (12/45)
    CPP-options:    REGIONAL GAMAR MPI TIDES OBC_WEST OBC_NORTH XIOS USE_CALE...
    Conventions:    CF-1.6
    Cs_r:           have a look at variable Cs_r in this file
    Cs_w:           have a look at variable Cs_w in this file
    SRCS:           main.F step.F read_inp.F timers_roms.F init_scalars.F ini...
    Tcline:         15.0
    ...             ...
    title:          GAMAR_GLORYS
    tnu4_expl:      biharmonic mixing coefficient for tracers
    units:          meter4 second-1
    uuid:           06f6b784-fcc0-4422-aceb-17da2a5aa9fa
    v_sponge:       0.0
    x_sponge:       0.0

Get metadata from RiOMAR#

Get the title#

title = ds.attrs["name"]

Need to have better description available in the metadata. It could be constructed from the metadata if metadata were better constructed#

description = "RiOMar dataset " + title 

Get bounding box in WKT#

Latitudes with values of -1 are NaN

minlat = ds.nav_lat_rho.where(ds.nav_lat_rho > -1, np.nan).min().values
maxlat = ds.nav_lat_rho.max().values
minlon = ds.nav_lon_rho.min().values
maxlon = ds.nav_lon_rho.max().values
print(minlat, maxlat, minlon, maxlon)

43.285 50.867471190931404 -8.0 1.6800000000000015

geometry_wkt = shapely.geometry.box(minlon, minlat, maxlon, maxlat).wkt
geometry_wkt

'POLYGON ((1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285))'

time range

ts = pd.to_datetime(str(ds.time_counter.min().values)) 
te = pd.to_datetime(str(ds.time_counter.max().values)) 
date_start = ts.strftime('%Y.%m.%d')
date_end = te.strftime('%Y.%m.%d')
date_start, date_end

('2004.01.01', '2004.01.01')

Creation date (we assume timeStamp contains this information (TBC)

dateCreated = ds.attrs["timeStamp"]
dateCreated

'2024-Apr-01 10:49:18 GMT'

from datetime import date

today = date.today().strftime('%Y.%m.%d')
print("Today's date:", today)

sdDatePublished =  today # could be the date corresponding to the creation of the DOI (publishing)
dateModified =  today # could be the date of creation of the DGGS regridded data e.g. it needs to be added to Zarr when regridding

Today's date: 2025.01.19

Get the size of the dataset#

We usually can get this information from the metadata (needs to be added)

contentSize = 0 # We need to get the total size in bytes

Get the persistent identifier#

Dataset should have a persistent identifier e.g. DOI (currently it does not have one)

doi_data = "NONE" # it is a problem

StudySubject and keywords#

StudySubject and keywords

studySubject_urls = [ "http://inspire.ec.europa.eu/metadata-codelist/TopicCategory/environment"]
keywords = ["riomar", "croco"]

Version of the dataset#

version_data = "1.0"

Prepare information for the provenance#

prov = {
      "@id": "https://doi.org/10.5281/zenodo.13898339",
      "@type": "SoftwareApplication",
      "url": "https://www.croco-ocean.org",
      "name": "CROCO, Coastal and Regional Ocean COmmunity",
      "version": "CROCO GAMA model v2.0.1 https://doi.org/10.5281/zenodo.13898339"
}

Create a new RO-Crate#

crate = ROCrate()

Add the license for the RO-Crate#

The license of the Research Object (RO-Crate) may not be the same as the licenses of the data bundled in the RO-Crate.
Our RO-Crate is open and distributed under CC-BY-4 license.
The content of the license needs to be a URL (here https://creativecommons.org/licenses/by/4.0/)

RO_license_id = "CC-BY-4.0"
RO_license_url = "https://creativecommons.org/licenses/by/4.0/"
RO_license_title = "Creative Commons Attribution 4.0"

Add the selected license to the RO-Crate#

crate.update_jsonld(
{
    "@id": "./",
    "license": { "@id":  RO_license_url},
})
license = {
                "@id": RO_license_url,
                "@type": "CreativeWork",
                "name": RO_license_id,
                "description": RO_license_title,
                }
crate.add_jsonld(license)

<https://creativecommons.org/licenses/by/4.0/ CreativeWork>

Add creators and their Organizations#

you need to add here the list of creators of the RO-Crate
you can go to https://ror.org and search for the organisation you would like to add. In this notebook, we create this information “manually” but it can be better streamlined in the future (for instance using Rohub
You may have several authors and would need to add them in the RO-Crate following the same approach.

Add Persons and organisations#

list_authors = []

organisation_1 = {
    "name": "Simula Research Laboratory",
    "id": "https://ror.org/00vn06n10",
    "url" : "https://www.simula.no"
}
creator_1 = {
    "id": "https://orcid.org/0000-0002-1784-2920", # The id is the ORCID of the author
    "email": "annef@simula.no",
    "givenName": "Anne", 
    "familyName": "Fouilloux", 
    "affiliation": {"@id": organisation_1["id"]}
    
}
creator_1

{'id': 'https://orcid.org/0000-0002-1784-2920',
 'email': 'annef@simula.no',
 'givenName': 'Anne',
 'familyName': 'Fouilloux',
 'affiliation': {'@id': 'https://ror.org/00vn06n10'}}

organisation_2 = {
    "name": "Ifremer",
    "id": "https://ror.org/044jxhp58",
    "url" : "https://www.ifremer.fr"
}
creator_2 = {
    "id": "https://orcid.org/0000-0002-1500-0156", # The id is the ORCID of the author
    "email": "tina.odaka@ifremer.fr",
    "givenName": "Tina Erica", 
    "familyName": "Odaka", 
    "affiliation": {"@id": organisation_2["id"]}
    
}
creator_2

{'id': 'https://orcid.org/0000-0002-1500-0156',
 'email': 'tina.odaka@ifremer.fr',
 'givenName': 'Tina Erica',
 'familyName': 'Odaka',
 'affiliation': {'@id': 'https://ror.org/044jxhp58'}}

list_orcids = [ creator_1["id"], creator_2["id"]]
list_orcids

['https://orcid.org/0000-0002-1784-2920',
 'https://orcid.org/0000-0002-1500-0156']

Adding all the authors#

list_authors.append(creator_1['givenName'] + " " +  creator_1['familyName'])
list_authors.append(creator_2['givenName'] + " " +  creator_2['familyName'])
list_authors

['Anne Fouilloux', 'Tina Erica Odaka']

Add the 2 creators as Person in the RO-Crate

crate.add(Person(crate, creator_1.pop("id"), properties=creator_1))
crate.add(Person(crate, creator_2.pop("id"), properties=creator_2))

<https://orcid.org/0000-0002-1500-0156 Person>

Add the list of authors in the RO-Crate

crate.update_jsonld({
    "@id": "./",
    "author": list_orcids,
})

<./ Dataset>

Add information about data bundled in the RO-Crate#

Prepare Temporal coverage if available#

temporal_coverage = date_start + "/" + date_end
temporal_coverage

'2004.01.01/2004.01.01'

Prepare Spatial coverage if available#

def get_geoshape(geometry):
    # We assume wkt geometry
    geo = shapely.wkt.loads(geometry)
    if hasattr(geo, 'geoms'):
        # take the first one
        geo = geo.geoms[0]
    geo = geo.wkt.replace("POLYGON", "").replace("(","").replace(")","").strip()   
    geolocation = { "@type": "GeoShape", "@id": geo, "polygon": geo}
    return geolocation


geolocation = get_geoshape(geometry_wkt)
geolocation

{'@type': 'GeoShape',
 '@id': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285',
 'polygon': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285'}

Go through each data and add it in the RO-Crate#

In this example we only add one dataset

properties =  {
    "modified_date": dateModified, 
    "name": url_data, 
    "location": geolocation,
    "temporalCoverage": temporal_coverage, 
    "sdDatePublished": sdDatePublished, 
    "dateCreated": dateCreated, 
    "dateModified": dateModified, # could be the date of creation of the DGGS regridded data
###    "contentSize": contentSize,  TBC
    "encodingFormat": ' text/html; charset=us-ascii '
}

print("properties = ", properties)

resource = crate.add_file(url_data, fetch_remote = False, properties=properties)

properties =  {'modified_date': '2025.01.19', 'name': 'https://data-fair2adapt.ifremer.fr/riomar/small.zarr', 'location': {'@type': 'GeoShape', '@id': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285', 'polygon': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285'}, 'temporalCoverage': '2004.01.01/2004.01.01', 'sdDatePublished': '2025.01.19', 'dateCreated': '2024-Apr-01 10:49:18 GMT', 'dateModified': '2025.01.19', 'encodingFormat': ' text/html; charset=us-ascii '}

Add metadata to RO#

Add the title and description#

crate.update_jsonld({
    "@id": "./",
    "description": description,
    "title": title,
    "name": title,
})

<./ Dataset>

Add the publisher and creator#

publisher_name = "Sigma2 AS"
publisher_url = "https://www.wikidata.org/wiki/Q12008197"
publisher = {
                "@id": publisher_url,
                "@type": "Organization",
                "name": publisher_name,
                "url": publisher_url
                }
crate.add_jsonld(publisher)
crate.update_jsonld(
{
    "@id": "./",
    "publisher": { "@id": publisher_url },
})

<./ Dataset>

Add the creator of the RO-Crate#

crate.update_jsonld(
{
    "@id": "ro-crate-metadata.json",
    "creator": { "@id": publisher_url },
})

<ro-crate-metadata.json CreativeWork>

Add Publication date#

date_published =  datetime.strptime(sdDatePublished, "%Y.%m.%d")

crate.update_jsonld({
    "@id": "./",
    "datePublished":  date_published.strftime("%Y-%m-%d") ,
})

<./ Dataset>

Add citation#

doi = "https://doi.org/" + doi_data
cite_as = " and ".join(list_authors) + ", " + title + ", " + publisher_name + ", " + date_published.strftime("%Y") + ". " +  doi_data + "."

crate.update_jsonld({
    "@id": "./",
    "identifier": doi_data,
    "url": doi_data,
    "cite-as":  cite_as ,
})

<./ Dataset>

Add studySubject, keywords, etc.#

The studySubject is from http://inspire.ec.europa.eu/metadata-codelist/TopicCategory/. Go to the URL and select the studySubject that is most relevant for your data

study_subjects = []
for subject_url in studySubject_urls:
    study_subjects.append({
         "@id": subject_url
    })
study_subjects

[{'@id': 'http://inspire.ec.europa.eu/metadata-codelist/TopicCategory/environment'}]

keywords = ", ".join(keywords)
keywords

'riomar, croco'

crate.update_jsonld({
    "@id": "./",
    "about": study_subjects,
    "keywords":  keywords,
})

<./ Dataset>

Add version#

crate.update_jsonld({
    "@id": "./",
    "version": version_data,
})

<./ Dataset>

Add Language#

#crate.update_jsonld({
#    "@id": ,
#    "@type": "Language",
#})

Write to disk#

crate.write("ro-crate")

from rocrateValidator import validate as validate

v = validate.validate("ro-crate")
v.validator()

This is an INVALID RO-Crate
{
    "File existence": [
        true
    ],
    "File size": [
        true
    ],
    "Metadata file existence": [
        true
    ],
    "Json check": [
        true
    ],
    "Json-ld check": [
        true
    ],
    "File descriptor check": [
        true
    ],
    "Direct property check": [
        true
    ],
    "Referencing check": [
        true
    ],
    "Encoding check": [
        true
    ],
    "Web-based data entity check": [
        false,
        "Semantic Error: Invalid ID at https://data-fair2adapt.ifremer.fr/riomar/small.zarr. It should be a downloadable url"
    ],
    "Person entity check": [
        true
    ],
    "Organization entity check": [
        true
    ],
    "Contact information check": [
        true
    ],
    "Citation property check": [
        true
    ],
    "Publisher property check": [
        true
    ],
    "Funder property check": [
        true
    ],
    "Licensing property check": [
        false,
        "Semantic Error: Invalid ID Value at https://creativecommons.org/licenses/by/4.0/. It must be an URL."
    ],
    "Places property check": [
        true
    ],
    "Time property check": [
        true
    ],
    "Scripts and workflow check": [
        false,
        "Semantic Error: Scripts and Workflow is Wrong"
    ]
}