Uploading datasets#

Please read Downloading datasets first as it explains the general setup.

We connect to SciCat and a file server using a Client:

from scitacean import Client
from scitacean.transfer.sftp import SFTPFileTransfer
client = Client.from_token(url="https://scicat.ess.eu/api/v3",
                           token=...,
                           file_transfer=SFTPFileTransfer(
                               host="login.esss.dk"
                           ))

This code is identical to the one used for downloading . As with the downloading guide, we use a fake client instead of the real one shown above.

[1]:
from scitacean.testing.docs import setup_fake_client
client = setup_fake_client()

This is especially useful here as datasets cannot be deleted from SciCat by regular users, and we don’t want to pollute the database with our test data.

First, we need to generate some data to upload:

[2]:
from pathlib import Path

path = Path("data/witchcraft.dat")
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w") as f:
    f.write("7.9 13 666")

Create a new dataset#

With the totally realistic data in hand, we can construct a dataset.

[3]:
from scitacean import Dataset

dset = Dataset(
    name="Spellpower of the Three Witches",
    description="The spellpower of the maiden, mother, and crone.",
    type="raw",

    owner_group="wyrdsisters",
    access_groups=["witches"],

    owner="Nanny Ogg",
    principal_investigator="Esme Weatherwax",
    contact_email="nogg@wyrd.lancre",

    creation_location="lancre/whichhut",
    data_format="space-separated",
    source_folder="/somewhere/on/remote",
)

There are many more fields that can be filled in as needed. See scitacean.Dataset.

Some fields require an explanation:

  • dataset_type is either raw or derived. The main difference is that derived datasets point to one or more input datasets.

  • owner_group and access_groups correspond to users/usergroups on the file server and determine who can access the files.

Now we can attach our file:

[4]:
dset.add_local_files("data/witchcraft.dat", base_path="data")

Setting the base_path to "data" means that the file will be uploaded to source_folder/withcraft.dat where source_folder will be determined by the file transfer. (See below.) If we did not set base_path, the file would end up in source-dir/data/withcraft.dat.

Now, let’s inspect the dataset.

[5]:
dset
[5]:
RawDataset
Name Type Value Description
*
creation_time datetime 2024-05-29 08:17:09+0000 Time when dataset became fully available on disk, i.e. all containing files have been written. Format according to chapter 5.6 internet date/time format in RFC 3339. Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
*
source_folder RemotePath RemotePath('/somewhere/on/remote') Absolute file path on file server containing the files of this dataset, e.g. /some/path/to/sourcefolder. In case of a single file dataset, e.g. HDF5 data, it contains the path up to, but excluding the filename. Trailing slashes are removed.
description str The spellpower of the maiden, mother, and crone. Free text explanation of contents of dataset.
name str Spellpower of the Three Witches A name for the dataset, given by the creator to carry some semantic meaning. Useful for display purposes e.g. instead of displaying the pid. Will be autofilled if missing using info from sourceFolder.
pid PID None Persistent Identifier for datasets derived from UUIDv4 and prepended automatically by site specific PID prefix like 20.500.12345/
proposal_id str None The ID of the proposal to which the dataset belongs.
sample_id str None ID of the sample used when collecting the data.
Advanced fields
*
contact_email str nogg@wyrd.lancre Email of the contact person for this dataset. The string may contain a list of emails, which should then be separated by semicolons.
*
creation_location str lancre/whichhut Unique location identifier where data was taken, usually in the form /Site-name/facility-name/instrumentOrBeamline-name. This field is required if the dataset is a Raw dataset.
*
owner str Nanny Ogg Owner or custodian of the dataset, usually first name + last name. The string may contain a list of persons, which should then be separated by semicolons.
*
owner_group str wyrdsisters Defines the group which owns the data, and therefore has unrestricted access to this data. Usually a pgroup like p12151
*
principal_investigator str Esme Weatherwax First name and last name of principal investigator(s). If multiple PIs are present, use a semicolon separated list. This field is required if the dataset is a Raw dataset.
access_groups list[str] ['witches'] Optional additional groups which have read access to the data. Users which are members in one of the groups listed here are allowed to access this data. The special group 'public' makes data available to all users.
api_version str None Version of the API used in creation of the dataset.
classification str None ACIA information about AUthenticity,COnfidentiality,INtegrity and AVailability requirements of dataset. E.g. AV(ailabilty)=medium could trigger the creation of a two tape copies. Format 'AV=medium,CO=low'
comment str None Comment the user has about a given dataset.
created_at datetime None Date and time when this record was created. This property is added and maintained by mongoose.
created_by str None Indicate the user who created this record. This property is added and maintained by the system.
data_format str space-separated Defines the format of the data files in this dataset, e.g Nexus Version x.y.
data_quality_metrics int None Data Quality Metrics given by the user to rate the dataset.
end_time datetime None End time of data acquisition for this dataset, format according to chapter 5.6 internet date/time format in RFC 3339. Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
instrument_group str None Optional additional groups which have read and write access to the data. Users which are members in one of the groups listed here are allowed to access this data.
instrument_id str None ID of the instrument where the data was created.
is_published bool None Flag is true when data are made publicly available.
keywords list[str] None Array of tags associated with the meaning or contents of this dataset. Values should ideally come from defined vocabularies, taxonomies, ontologies or knowledge graphs.
license str None Name of the license under which the data can be used.
lifecycle Lifecycle None Describes the current status of the dataset during its lifetime with respect to the storage handling systems.
orcid_of_owner str None ORCID of the owner or custodian. The string may contain a list of ORCIDs, which should then be separated by semicolons.
owner_email str None Email of the owner or custodian of the dataset. The string may contain a list of emails, which should then be separated by semicolons.
relationships list[Relationship] None Stores the relationships with other datasets.
shared_with list[str] None List of users that the dataset has been shared with.
source_folder_host str None DNS host name of file server hosting sourceFolder, optionally including a protocol e.g. [protocol://]fileserver1.example.com
techniques list[Technique] None Stores the metadata information for techniques.
updated_at datetime None Date and time when this record was updated last. This property is added and maintained by mongoose.
updated_by str None Indicate the user who updated this record last. This property is added and maintained by the system.
validation_status str None Defines a level of trust, e.g. a measure of how much data was verified or used by other persons.
Files: 1 (10 B)
Local Remote Size
data/witchcraft.dat None 10 B
[6]:
len(list(dset.files))
[6]:
1
[7]:
dset.size  # in bytes
[7]:
10
[8]:
file = list(dset.files)[0]
print(f"{file.remote_access_path(dset.source_folder) = }")
print(f"{file.local_path = }")
print(f"{file.size = } bytes")
file.remote_access_path(dset.source_folder) = None
file.local_path = PosixPath('data/witchcraft.dat')
file.size = 10 bytes

The file has a local_path but no remote_access_path which means that it exists on the local file system (where we put it earlier) but not on the remote file server accessible by SciCat. The location can also be queried using file.is_on_local and file.is_on_remote.

Likewise, the dataset only exists in memory on our local machine and not on SciCat. Nothing has been uploaded yet. So we can freely modify the dataset or bail out by deleting the Python object if we need to.

Upload the dataset#

Once the dataset is ready, we can upload it using

[9]:
finalized = client.upload_new_dataset_now(dset)

WARNING:

This action cannot be undone by a regular user! Contact an admin if you uploaded a dataset accidentally.

scitacean.Client.upload_new_dataset_now uploads the dataset (i.e. metadata) to SciCat and the files to the file server. And it does so in such a way that it always creates a new dataset and new files without overwriting any existing (meta) data.

It returns a new dataset that is a copy of the input with some updated information generated by SciCat and the file transfer. For example, it has been assigned a new ID:

[10]:
finalized.pid
[10]:
PID(prefix='PID.prefix.a0b1', pid='c0ce316f-2bd7-44dd-b31c-ebe98532bb6d')

And the remote access path of our file has been set:

[11]:
list(finalized.files)[0].remote_access_path(finalized.source_folder)
[11]:
RemotePath('/somewhere/on/remote/witchcraft.dat')

Location of uploaded files#

All files associated with a dataset will be uploaded to the same folder. This folder may be at the path we specify when making the dataset, i.e. dset.source_folder. However, the folder is ultimately determined by the file transfer (in this case SFTPFileTransfer) and it may choose to override thesource_folderthat we set. In this example, since we don't tell the file transfer otherwise, it respectsdset.source_folder` and uploads the files to that location. See the File transfer reference for information how to control this behavior. The reason for this is that facilities may have a specific structure on their file server and Scitacean’s file transfers can be used to enforce that.

In any case, we can find out where files were uploaded by inspecting the finalized dataset that was returned by client.upload_new_dataset_now:

[12]:
finalized.source_folder
[12]:
RemotePath('/somewhere/on/remote')

Or by looking at each file individually as shown in the section above.

Attaching images to datasets#

It is possible to attach small images to datasets. In SciCat, this is done by creating ‘attachment’ objects which contain the image. Scitacean handles those via the attachments property of Dataset. For our locally created dataset, the property is an empty list and we can add an attachment like this:

[13]:
from scitacean import Attachment, Thumbnail

dset.attachments.append(
    Attachment(
        caption="Scitacean logo",
        owner_group=dset.owner_group,
        thumbnail=Thumbnail.load_file("./logo.png"),
    )
)
dset.attachments[0]
[13]:
Scitacean logo
Fields
Name Type Value
access_groups list[str] | None None
created_at datetime | None None
created_by str | None None
dataset_id PID | None None
id str | None None
instrument_group str | None None
is_published bool | None None
owner_group str wyrdsisters
proposal_id str | None None
sample_id str | None None
updated_at datetime | None None
updated_by str | None None

We used Thumbnail.load_file because it properly encodes the file for SciCat.

When we then upload the dataset, the client automatically uploads all attachments as well. Note that this creates a new dataset in SciCat. If you want to add attachments to an existing dataset after upload, you need to use the lower-level API through client.scicat.create_attachment_for_dataset or the web interface directly.

[14]:
finalized = client.upload_new_dataset_now(dset)

In order to download the attachments again, we can pass attachments=True when downloading the dataset:

[15]:
downloaded = client.get_dataset(finalized.pid, attachments=True)
downloaded.attachments[0]
[15]:
Scitacean logo
Fields
Name Type Value
access_groups list[str] | None None
created_at datetime | None 2024-05-29 08:17:09+0000
created_by str | None fake
dataset_id PID | None PID.prefix.a0b1/796e8161-0bef-42d3-a9b6-5cdd44b04c90
id str | None 4468bec3-d852-48e8-b26c-20b734dcf921
instrument_group str | None None
is_published bool | None None
owner_group str wyrdsisters
proposal_id str | None None
sample_id str | None None
updated_at datetime | None 2024-05-29 08:17:09+0000
updated_by str | None fake