Uploading datasets#
Please read Downloading datasets first as it explains the general setup.
We connect to SciCat and a file server using a Client:
from scitacean import Client
from scitacean.transfer.sftp import SFTPFileTransfer
client = Client.from_token(url="https://scicat.ess.eu/api/v3",
token=...,
file_transfer=SFTPFileTransfer(
host="login.esss.dk"
))
This code is identical to the one used for downloading . As with the downloading guide, we use a fake client instead of the real one shown above.
[1]:
from scitacean.testing.docs import setup_fake_client
client = setup_fake_client()
This is especially useful here as datasets cannot be deleted from SciCat by regular users, and we don’t want to pollute the database with our test data.
First, we need to generate some data to upload:
[2]:
from pathlib import Path
path = Path("data/witchcraft.dat")
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w") as f:
f.write("7.9 13 666")
Create a new dataset#
With the totally realistic data in hand, we can construct a dataset.
[3]:
from scitacean import Dataset
dset = Dataset(
name="Spellpower of the Three Witches",
description="The spellpower of the maiden, mother, and crone.",
type="raw",
owner_group="wyrdsisters",
access_groups=["witches"],
owner="Nanny Ogg",
principal_investigator="Esme Weatherwax",
contact_email="nogg@wyrd.lancre",
creation_location="lancre/whichhut",
data_format="space-separated",
source_folder="/somewhere/on/remote",
)
There are many more fields that can be filled in as needed. See scitacean.Dataset.
Some fields require an explanation:
dataset_type
is eitherraw
orderived
. The main difference is that derived datasets point to one or more input datasets.owner_group
andaccess_groups
correspond to users/usergroups on the file server and determine who can access the files.
Now we can attach our file:
[4]:
dset.add_local_files("data/witchcraft.dat", base_path="data")
Setting the base_path
to "data"
means that the file will be uploaded to source_folder/withcraft.dat
where source_folder
will be determined by the file transfer. (See below.) If we did not set base_path
, the file would end up in source-dir/data/withcraft.dat
.
Now, let’s inspect the dataset.
[5]:
dset
[5]:
Name | Type | Value | Description | |
---|---|---|---|---|
* |
creation_time | datetime | 2024-05-29 08:17:09+0000 | Time when dataset became fully available on disk, i.e. all containing files have been written. Format according to chapter 5.6 internet date/time format in RFC 3339. Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server. |
* |
source_folder | RemotePath | RemotePath('/somewhere/on/remote') | Absolute file path on file server containing the files of this dataset, e.g. /some/path/to/sourcefolder. In case of a single file dataset, e.g. HDF5 data, it contains the path up to, but excluding the filename. Trailing slashes are removed. |
description | str | The spellpower of the maiden, mother, and crone. | Free text explanation of contents of dataset. | |
name | str | Spellpower of the Three Witches | A name for the dataset, given by the creator to carry some semantic meaning. Useful for display purposes e.g. instead of displaying the pid. Will be autofilled if missing using info from sourceFolder. | |
pid | PID | None | Persistent Identifier for datasets derived from UUIDv4 and prepended automatically by site specific PID prefix like 20.500.12345/ | |
proposal_id | str | None | The ID of the proposal to which the dataset belongs. | |
sample_id | str | None | ID of the sample used when collecting the data. |
Advanced fields
* |
contact_email | str | nogg@wyrd.lancre | Email of the contact person for this dataset. The string may contain a list of emails, which should then be separated by semicolons. |
* |
creation_location | str | lancre/whichhut | Unique location identifier where data was taken, usually in the form /Site-name/facility-name/instrumentOrBeamline-name. This field is required if the dataset is a Raw dataset. |
* |
owner | str | Nanny Ogg | Owner or custodian of the dataset, usually first name + last name. The string may contain a list of persons, which should then be separated by semicolons. |
* |
owner_group | str | wyrdsisters | Defines the group which owns the data, and therefore has unrestricted access to this data. Usually a pgroup like p12151 |
* |
principal_investigator | str | Esme Weatherwax | First name and last name of principal investigator(s). If multiple PIs are present, use a semicolon separated list. This field is required if the dataset is a Raw dataset. |
access_groups | list[str] | ['witches'] | Optional additional groups which have read access to the data. Users which are members in one of the groups listed here are allowed to access this data. The special group 'public' makes data available to all users. | |
api_version | str | None | Version of the API used in creation of the dataset. | |
classification | str | None | ACIA information about AUthenticity,COnfidentiality,INtegrity and AVailability requirements of dataset. E.g. AV(ailabilty)=medium could trigger the creation of a two tape copies. Format 'AV=medium,CO=low' | |
comment | str | None | Comment the user has about a given dataset. | |
created_at | datetime | None | Date and time when this record was created. This property is added and maintained by mongoose. | |
created_by | str | None | Indicate the user who created this record. This property is added and maintained by the system. | |
data_format | str | space-separated | Defines the format of the data files in this dataset, e.g Nexus Version x.y. | |
data_quality_metrics | int | None | Data Quality Metrics given by the user to rate the dataset. | |
end_time | datetime | None | End time of data acquisition for this dataset, format according to chapter 5.6 internet date/time format in RFC 3339. Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server. | |
instrument_group | str | None | Optional additional groups which have read and write access to the data. Users which are members in one of the groups listed here are allowed to access this data. | |
instrument_id | str | None | ID of the instrument where the data was created. | |
is_published | bool | None | Flag is true when data are made publicly available. | |
keywords | list[str] | None | Array of tags associated with the meaning or contents of this dataset. Values should ideally come from defined vocabularies, taxonomies, ontologies or knowledge graphs. | |
license | str | None | Name of the license under which the data can be used. | |
lifecycle | Lifecycle | None | Describes the current status of the dataset during its lifetime with respect to the storage handling systems. | |
orcid_of_owner | str | None | ORCID of the owner or custodian. The string may contain a list of ORCIDs, which should then be separated by semicolons. | |
owner_email | str | None | Email of the owner or custodian of the dataset. The string may contain a list of emails, which should then be separated by semicolons. | |
relationships | list[Relationship] | None | Stores the relationships with other datasets. | |
shared_with | list[str] | None | List of users that the dataset has been shared with. | |
source_folder_host | str | None | DNS host name of file server hosting sourceFolder, optionally including a protocol e.g. [protocol://]fileserver1.example.com | |
techniques | list[Technique] | None | Stores the metadata information for techniques. | |
updated_at | datetime | None | Date and time when this record was updated last. This property is added and maintained by mongoose. | |
updated_by | str | None | Indicate the user who updated this record last. This property is added and maintained by the system. | |
validation_status | str | None | Defines a level of trust, e.g. a measure of how much data was verified or used by other persons. |
Files: 1 (10 B)
Local | Remote | Size |
---|---|---|
data/witchcraft.dat | None | 10 B |
[6]:
len(list(dset.files))
[6]:
1
[7]:
dset.size # in bytes
[7]:
10
[8]:
file = list(dset.files)[0]
print(f"{file.remote_access_path(dset.source_folder) = }")
print(f"{file.local_path = }")
print(f"{file.size = } bytes")
file.remote_access_path(dset.source_folder) = None
file.local_path = PosixPath('data/witchcraft.dat')
file.size = 10 bytes
The file has a local_path
but no remote_access_path
which means that it exists on the local file system (where we put it earlier) but not on the remote file server accessible by SciCat. The location can also be queried using file.is_on_local
and file.is_on_remote
.
Likewise, the dataset only exists in memory on our local machine and not on SciCat. Nothing has been uploaded yet. So we can freely modify the dataset or bail out by deleting the Python object if we need to.
Upload the dataset#
Once the dataset is ready, we can upload it using
[9]:
finalized = client.upload_new_dataset_now(dset)
WARNING:
This action cannot be undone by a regular user! Contact an admin if you uploaded a dataset accidentally.
scitacean.Client.upload_new_dataset_now uploads the dataset (i.e. metadata) to SciCat and the files to the file server. And it does so in such a way that it always creates a new dataset and new files without overwriting any existing (meta) data.
It returns a new dataset that is a copy of the input with some updated information generated by SciCat and the file transfer. For example, it has been assigned a new ID:
[10]:
finalized.pid
[10]:
PID(prefix='PID.prefix.a0b1', pid='c0ce316f-2bd7-44dd-b31c-ebe98532bb6d')
And the remote access path of our file has been set:
[11]:
list(finalized.files)[0].remote_access_path(finalized.source_folder)
[11]:
RemotePath('/somewhere/on/remote/witchcraft.dat')
Location of uploaded files#
All files associated with a dataset will be uploaded to the same folder. This folder may be at the path we specify when making the dataset, i.e. dset.source_folder
. However, the folder is ultimately determined by the file transfer (in this case SFTPFileTransfer) and it may choose to override the
source_folderthat we set. In this example, since we don't tell the file transfer otherwise, it respects
dset.source_folder` and uploads the files to that location. See the File
transfer reference for information how to control this behavior. The reason for this is that facilities may have a specific structure on their file server and Scitacean’s file transfers can be used to enforce that.
In any case, we can find out where files were uploaded by inspecting the finalized dataset that was returned by client.upload_new_dataset_now
:
[12]:
finalized.source_folder
[12]:
RemotePath('/somewhere/on/remote')
Or by looking at each file individually as shown in the section above.
Attaching images to datasets#
It is possible to attach small images to datasets. In SciCat, this is done by creating ‘attachment’ objects which contain the image. Scitacean handles those via the attachments
property of Dataset
. For our locally created dataset, the property is an empty list and we can add an attachment like this:
[13]:
from scitacean import Attachment, Thumbnail
dset.attachments.append(
Attachment(
caption="Scitacean logo",
owner_group=dset.owner_group,
thumbnail=Thumbnail.load_file("./logo.png"),
)
)
dset.attachments[0]
[13]:
We used Thumbnail.load_file
because it properly encodes the file for SciCat.
When we then upload the dataset, the client automatically uploads all attachments as well. Note that this creates a new dataset in SciCat. If you want to add attachments to an existing dataset after upload, you need to use the lower-level API through client.scicat.create_attachment_for_dataset
or the web interface directly.
[14]:
finalized = client.upload_new_dataset_now(dset)
In order to download the attachments again, we can pass attachments=True
when downloading the dataset:
[15]:
downloaded = client.get_dataset(finalized.pid, attachments=True)
downloaded.attachments[0]
[15]: