Create RO-Crate from RiOMar dataset#
Context#
Purpose#
We are showing how to create a RO-Crate for a dataset using the rocrate python library. This is a simple example with no specific RO-Crate profile. It follows RO-Crate v 1.1 specification.
Standardized Metadata Packaging: RO-Crates provide a standardized way to bundle datasets with rich metadata, making it easier to understand, share, and reuse the data.
Enhanced FAIRness: By including machine-readable metadata, RO-Crates improve the Findability, Accessibility, Interoperability, and Reusability (FAIR) of the dataset.
Improved Discoverability: Metadata in an RO-Crate allows datasets to be easily indexed and discovered through search engines and data repositories.
Documentation and Provenance: RO-Crates document essential information about the dataset, such as its source, authorship, and creation process, ensuring transparency and traceability.
Facilitates Integration: The structured metadata makes it easier to integrate the dataset with other tools, workflows, or datasets, enhancing its usability.
Compliance with Standards: Many funding agencies and journals now require datasets to be published with detailed metadata. RO-Crates align with these expectations and promote best practices in data management.
Description#
In this notebook, we will learn how to create a simple RO-Crate from the RiOMar data. We will then identify any missing metadata that needs to be added to the original dataset’s metadata.
Contributions#
Notebook#
Anne Fouilloux (author), Simula Research Laboratory (Norway), @annefou
XX (reviewer)
Biblipgraphy and other interesting resources#
rocrate Python package
Install and Import libraries#
pip install rocrate rocrateValidator
Requirement already satisfied: rocrate in /srv/conda/envs/notebook/lib/python3.12/site-packages (0.13.0)
Collecting rocrateValidator
Downloading rocrateValidator-0.2.15-py3-none-any.whl.metadata (228 bytes)
Requirement already satisfied: requests in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (2.32.3)
Requirement already satisfied: arcp==0.2.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (0.2.1)
Requirement already satisfied: jinja2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (3.1.4)
Requirement already satisfied: python-dateutil in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (2.8.2)
Requirement already satisfied: click in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (8.1.7)
Requirement already satisfied: MarkupSafe>=2.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from jinja2->rocrate) (2.1.5)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from python-dateutil->rocrate) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (1.26.19)
Requirement already satisfied: certifi>=2017.4.17 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (2024.7.4)
Downloading rocrateValidator-0.2.15-py3-none-any.whl (11 kB)
Installing collected packages: rocrateValidator
Successfully installed rocrateValidator-0.2.15
Note: you may need to restart the kernel to use updated packages.
import requests
import json
from rocrate.rocrate import ROCrate
from rocrate.model.person import Person
import pandas as pd
from datetime import datetime
import geopandas
import shapely
import xarray as xr
import numpy as np
import s3fs
Open RiOMar data to get metadata#
url_data = "https://data-fair2adapt.ifremer.fr/riomar/small.zarr"
ds = xr.open_zarr(url_data)
ds
<xarray.Dataset> Size: 498MB
Dimensions: (y_rho: 838, x_rho: 727, s_rho: 40, time_counter: 5)
Coordinates:
nav_lat_rho (y_rho, x_rho) float64 5MB dask.array<chunksize=(838, 727), meta=np.ndarray>
nav_lon_rho (y_rho, x_rho) float64 5MB dask.array<chunksize=(838, 727), meta=np.ndarray>
* s_rho (s_rho) float32 160B -0.9875 -0.9625 ... -0.0375 -0.0125
* time_counter (time_counter) datetime64[ns] 40B 2004-01-01T00:58:30 ... 2...
time_instant (time_counter) datetime64[ns] 40B dask.array<chunksize=(1,), meta=np.ndarray>
Dimensions without coordinates: y_rho, x_rho
Data variables:
ocean_mask (y_rho, x_rho) bool 609kB dask.array<chunksize=(838, 727), meta=np.ndarray>
temp (time_counter, s_rho, y_rho, x_rho) float32 487MB dask.array<chunksize=(1, 40, 838, 727), meta=np.ndarray>
Attributes: (12/45)
CPP-options: REGIONAL GAMAR MPI TIDES OBC_WEST OBC_NORTH XIOS USE_CALE...
Conventions: CF-1.6
Cs_r: have a look at variable Cs_r in this file
Cs_w: have a look at variable Cs_w in this file
SRCS: main.F step.F read_inp.F timers_roms.F init_scalars.F ini...
Tcline: 15.0
... ...
title: GAMAR_GLORYS
tnu4_expl: biharmonic mixing coefficient for tracers
units: meter4 second-1
uuid: 06f6b784-fcc0-4422-aceb-17da2a5aa9fa
v_sponge: 0.0
x_sponge: 0.0Get metadata from RiOMAR#
Get the title#
title = ds.attrs["name"]
Need to have better description available in the metadata. It could be constructed from the metadata if metadata were better constructed#
description = "RiOMar dataset " + title
Get bounding box in WKT#
Latitudes with values of -1 are NaN
minlat = ds.nav_lat_rho.where(ds.nav_lat_rho > -1, np.nan).min().values
maxlat = ds.nav_lat_rho.max().values
minlon = ds.nav_lon_rho.min().values
maxlon = ds.nav_lon_rho.max().values
print(minlat, maxlat, minlon, maxlon)
43.285 50.867471190931404 -8.0 1.6800000000000015
geometry_wkt = shapely.geometry.box(minlon, minlat, maxlon, maxlat).wkt
geometry_wkt
'POLYGON ((1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285))'
time range
ts = pd.to_datetime(str(ds.time_counter.min().values))
te = pd.to_datetime(str(ds.time_counter.max().values))
date_start = ts.strftime('%Y.%m.%d')
date_end = te.strftime('%Y.%m.%d')
date_start, date_end
('2004.01.01', '2004.01.01')
Creation date (we assume
timeStampcontains this information (TBC)
dateCreated = ds.attrs["timeStamp"]
dateCreated
'2024-Apr-01 10:49:18 GMT'
from datetime import date
today = date.today().strftime('%Y.%m.%d')
print("Today's date:", today)
sdDatePublished = today # could be the date corresponding to the creation of the DOI (publishing)
dateModified = today # could be the date of creation of the DGGS regridded data e.g. it needs to be added to Zarr when regridding
Today's date: 2025.01.19
Get the size of the dataset#
We usually can get this information from the metadata (needs to be added)
contentSize = 0 # We need to get the total size in bytes
Get the persistent identifier#
Dataset should have a persistent identifier e.g. DOI (currently it does not have one)
doi_data = "NONE" # it is a problem
StudySubject and keywords#
StudySubject and keywords
studySubject_urls = [ "http://inspire.ec.europa.eu/metadata-codelist/TopicCategory/environment"]
keywords = ["riomar", "croco"]
Version of the dataset#
version_data = "1.0"
Prepare information for the provenance#
prov = {
"@id": "https://doi.org/10.5281/zenodo.13898339",
"@type": "SoftwareApplication",
"url": "https://www.croco-ocean.org",
"name": "CROCO, Coastal and Regional Ocean COmmunity",
"version": "CROCO GAMA model v2.0.1 https://doi.org/10.5281/zenodo.13898339"
}
Create a new RO-Crate#
crate = ROCrate()
Add the license for the RO-Crate#
The license of the Research Object (RO-Crate) may not be the same as the licenses of the data bundled in the RO-Crate.
Our RO-Crate is open and distributed under CC-BY-4 license.
The content of the license needs to be a URL (here
https://creativecommons.org/licenses/by/4.0/)
RO_license_id = "CC-BY-4.0"
RO_license_url = "https://creativecommons.org/licenses/by/4.0/"
RO_license_title = "Creative Commons Attribution 4.0"
Add the selected license to the RO-Crate#
crate.update_jsonld(
{
"@id": "./",
"license": { "@id": RO_license_url},
})
license = {
"@id": RO_license_url,
"@type": "CreativeWork",
"name": RO_license_id,
"description": RO_license_title,
}
crate.add_jsonld(license)
<https://creativecommons.org/licenses/by/4.0/ CreativeWork>
Add creators and their Organizations#
you need to add here the list of creators of the RO-Crate
you can go to
https://ror.organd search for the organisation you would like to add. In this notebook, we create this information “manually” but it can be better streamlined in the future (for instance using RohubYou may have several authors and would need to add them in the RO-Crate following the same approach.
Add Persons and organisations#
list_authors = []
organisation_1 = {
"name": "Simula Research Laboratory",
"id": "https://ror.org/00vn06n10",
"url" : "https://www.simula.no"
}
creator_1 = {
"id": "https://orcid.org/0000-0002-1784-2920", # The id is the ORCID of the author
"email": "annef@simula.no",
"givenName": "Anne",
"familyName": "Fouilloux",
"affiliation": {"@id": organisation_1["id"]}
}
creator_1
{'id': 'https://orcid.org/0000-0002-1784-2920',
'email': 'annef@simula.no',
'givenName': 'Anne',
'familyName': 'Fouilloux',
'affiliation': {'@id': 'https://ror.org/00vn06n10'}}
organisation_2 = {
"name": "Ifremer",
"id": "https://ror.org/044jxhp58",
"url" : "https://www.ifremer.fr"
}
creator_2 = {
"id": "https://orcid.org/0000-0002-1500-0156", # The id is the ORCID of the author
"email": "tina.odaka@ifremer.fr",
"givenName": "Tina Erica",
"familyName": "Odaka",
"affiliation": {"@id": organisation_2["id"]}
}
creator_2
{'id': 'https://orcid.org/0000-0002-1500-0156',
'email': 'tina.odaka@ifremer.fr',
'givenName': 'Tina Erica',
'familyName': 'Odaka',
'affiliation': {'@id': 'https://ror.org/044jxhp58'}}
list_orcids = [ creator_1["id"], creator_2["id"]]
list_orcids
['https://orcid.org/0000-0002-1784-2920',
'https://orcid.org/0000-0002-1500-0156']
Add information about data bundled in the RO-Crate#
Prepare Temporal coverage if available#
temporal_coverage = date_start + "/" + date_end
temporal_coverage
'2004.01.01/2004.01.01'
Prepare Spatial coverage if available#
def get_geoshape(geometry):
# We assume wkt geometry
geo = shapely.wkt.loads(geometry)
if hasattr(geo, 'geoms'):
# take the first one
geo = geo.geoms[0]
geo = geo.wkt.replace("POLYGON", "").replace("(","").replace(")","").strip()
geolocation = { "@type": "GeoShape", "@id": geo, "polygon": geo}
return geolocation
geolocation = get_geoshape(geometry_wkt)
geolocation
{'@type': 'GeoShape',
'@id': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285',
'polygon': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285'}
Go through each data and add it in the RO-Crate#
In this example we only add one dataset
properties = {
"modified_date": dateModified,
"name": url_data,
"location": geolocation,
"temporalCoverage": temporal_coverage,
"sdDatePublished": sdDatePublished,
"dateCreated": dateCreated,
"dateModified": dateModified, # could be the date of creation of the DGGS regridded data
### "contentSize": contentSize, TBC
"encodingFormat": ' text/html; charset=us-ascii '
}
print("properties = ", properties)
resource = crate.add_file(url_data, fetch_remote = False, properties=properties)
properties = {'modified_date': '2025.01.19', 'name': 'https://data-fair2adapt.ifremer.fr/riomar/small.zarr', 'location': {'@type': 'GeoShape', '@id': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285', 'polygon': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285'}, 'temporalCoverage': '2004.01.01/2004.01.01', 'sdDatePublished': '2025.01.19', 'dateCreated': '2024-Apr-01 10:49:18 GMT', 'dateModified': '2025.01.19', 'encodingFormat': ' text/html; charset=us-ascii '}
Add metadata to RO#
Add the title and description#
crate.update_jsonld({
"@id": "./",
"description": description,
"title": title,
"name": title,
})
<./ Dataset>
Add the publisher and creator#
publisher_name = "Sigma2 AS"
publisher_url = "https://www.wikidata.org/wiki/Q12008197"
publisher = {
"@id": publisher_url,
"@type": "Organization",
"name": publisher_name,
"url": publisher_url
}
crate.add_jsonld(publisher)
crate.update_jsonld(
{
"@id": "./",
"publisher": { "@id": publisher_url },
})
<./ Dataset>
Add the creator of the RO-Crate#
crate.update_jsonld(
{
"@id": "ro-crate-metadata.json",
"creator": { "@id": publisher_url },
})
<ro-crate-metadata.json CreativeWork>
Add Publication date#
date_published = datetime.strptime(sdDatePublished, "%Y.%m.%d")
crate.update_jsonld({
"@id": "./",
"datePublished": date_published.strftime("%Y-%m-%d") ,
})
<./ Dataset>
Add citation#
doi = "https://doi.org/" + doi_data
cite_as = " and ".join(list_authors) + ", " + title + ", " + publisher_name + ", " + date_published.strftime("%Y") + ". " + doi_data + "."
crate.update_jsonld({
"@id": "./",
"identifier": doi_data,
"url": doi_data,
"cite-as": cite_as ,
})
<./ Dataset>
Add studySubject, keywords, etc.#
The studySubject is from http://inspire.ec.europa.eu/metadata-codelist/TopicCategory/.
Go to the URL and select the studySubject that is most relevant for your data
study_subjects = []
for subject_url in studySubject_urls:
study_subjects.append({
"@id": subject_url
})
study_subjects
[{'@id': 'http://inspire.ec.europa.eu/metadata-codelist/TopicCategory/environment'}]
keywords = ", ".join(keywords)
keywords
'riomar, croco'
crate.update_jsonld({
"@id": "./",
"about": study_subjects,
"keywords": keywords,
})
<./ Dataset>
Add version#
crate.update_jsonld({
"@id": "./",
"version": version_data,
})
<./ Dataset>
Add Language#
#crate.update_jsonld({
# "@id": ,
# "@type": "Language",
#})
Write to disk#
crate.write("ro-crate")
from rocrateValidator import validate as validate
v = validate.validate("ro-crate")
v.validator()
This is an INVALID RO-Crate
{
"File existence": [
true
],
"File size": [
true
],
"Metadata file existence": [
true
],
"Json check": [
true
],
"Json-ld check": [
true
],
"File descriptor check": [
true
],
"Direct property check": [
true
],
"Referencing check": [
true
],
"Encoding check": [
true
],
"Web-based data entity check": [
false,
"Semantic Error: Invalid ID at https://data-fair2adapt.ifremer.fr/riomar/small.zarr. It should be a downloadable url"
],
"Person entity check": [
true
],
"Organization entity check": [
true
],
"Contact information check": [
true
],
"Citation property check": [
true
],
"Publisher property check": [
true
],
"Funder property check": [
true
],
"Licensing property check": [
false,
"Semantic Error: Invalid ID Value at https://creativecommons.org/licenses/by/4.0/. It must be an URL."
],
"Places property check": [
true
],
"Time property check": [
true
],
"Scripts and workflow check": [
false,
"Semantic Error: Scripts and Workflow is Wrong"
]
}