Ingest Dataset

Ingesting Datasets is one of the primary user cases for PySciCat. Here we’re describing, step by step, what’s going on in the provided file in examples/ingest_dataset.py

Create a ScicatClient

To begin with:

from datetime import datetime
from pathlib import Path

from pyscicat.client import encode_thumbnail, ScicatClient
from pyscicat.model import (
    Attachment,
    Datablock,
    DataFile,
    Dataset,
    Ownable
)

# Create a client object. The account used should have the ingestor role in SciCat
scicat = ScicatClient(base_url="http://localhost:3000/api/v3",
                        username="Zaphod",
                        password="heartofgold")

Here we simply import the python code. Then, we setup a ScicatClient instance with the username/password that you were given by your SciCat administrator.

Setup an Ownable

# Create an Ownable that will get reused for several other Model objects
ownable = Ownable(ownerGroup="magrathea", accessGroups=["deep_though"])
thumb_path = Path(__file__).parent.parent / "test/data/SciCatLogo.png"

Now we setup an Ownable instance. This is a model class that several other model classes inherit from. We do not have to create it explicitly (we could have simply added ownerGroup and accessGroups each object that takes it, but here we can be DRY (Don’t Repeat Yourself).

Upload a Dataset

# Create a RawDataset object with settings for your choosing. Notice how
# we pass the `ownable` instance.
dataset = Dataset(
    path="/foo/bar",
    size=42,
    owner="slartibartfast",
    contactEmail="slartibartfast@magrathea.org",
    creationLocation="magrathea",
    creationTime=str(datetime.now()),
    type="raw",
    instrumentId="earth",
    proposalId="deepthought",
    dataFormat="planet",
    principalInvestigator="A. Mouse",
    sourceFolder="/foo/bar",
    scientificMetadata={"a": "field"},
    sampleId="gargleblaster",
    **ownable.dict())
dataset_id = scicat.upload_raw_dataset(dataset)

Now we can create a Dataset instance and upload it! Notice how we passed the fields of the ownable instance there at the end.

Note that we store the provided dataset_id in a variable for later use.

Also note the sourceFolder. This is a folder on the file system that SciCat has access to, and will contain the files for this Dataset.

Upload a Datablock

# Create Datablock with DataFiles
data_file = DataFile(path="file.h5", size=42)
data_block = Datablock(size=42,
                       version=1,
                       datasetId=dataset_id,
                       dataFileList=[data_file],
                       **ownable.dict())
scicat.upload_datablock(data_block)

The Datablock is a container for DataFile instances. We are not loading the files, rather we are creating references that are used (and displayed) in SciCat.

In this example, there is only one DataFile instance. It has a path (“file.h5”). In the real world this would be a file that is in the folder identified in the sourceFolder of the Dataset.

Upload Attachment


#Create Attachment
attachment = Attachment(
    datasetId=dataset_id,
    thumbnail=encode_thumbnail(thumb_path),
    caption="scattering image",
    **ownable.dict()
)
scicat.upload_attachment(attachment)

Now we upload an Attachment. This is often used in SciCat to display thumbnails for a Dataset. Here, we are loading the actual content of a file (stored in SciCat’s database).

So, to put it all together:

from datetime import datetime
from pathlib import Path

from pyscicat.client import encode_thumbnail, ScicatClient
from pyscicat.model import (
    Attachment,
    Datablock,
    DataFile,
    Dataset,
    Ownable
)

# Create a client object. The account used should have the ingestor role in SciCat
scicat = ScicatClient(base_url="http://localhost:3000/api/v3",
                        username="Zaphod",
                        password="heartofgold")

# Create an Ownable that will get reused for several other Model objects
ownable = Ownable(ownerGroup="magrathea", accessGroups=["deep_though"])
thumb_path = Path(__file__).parent.parent / "test/data/SciCatLogo.png"


# Create a RawDataset object with settings for your choosing. Notice how
# we pass the `ownable` instance.
dataset = Dataset(
    path="/foo/bar",
    size=42,
    owner="slartibartfast",
    contactEmail="slartibartfast@magrathea.org",
    creationLocation="magrathea",
    creationTime=str(datetime.now()),
    type="raw",
    instrumentId="earth",
    proposalId="deepthought",
    dataFormat="planet",
    principalInvestigator="A. Mouse",
    sourceFolder="/foo/bar",
    scientificMetadata={"a": "field"},
    sampleId="gargleblaster",
    **ownable.dict())
dataset_id = scicat.upload_raw_dataset(dataset)

# Create Datablock with DataFiles
data_file = DataFile(path="file.h5", size=42)
data_block = Datablock(size=42,
                       version=1,
                       datasetId=dataset_id,
                       dataFileList=[data_file],
                       **ownable.dict())
scicat.upload_datablock(data_block)

#Create Attachment
attachment = Attachment(
    datasetId=dataset_id,
    thumbnail=encode_thumbnail(thumb_path),
    caption="scattering image",
    **ownable.dict()
)
scicat.upload_attachment(attachment)