Uploading datasets#

Please read Downloading datasets first as it explains the general setup.

We connect to SciCat and a file server using a Client:

from scitacean import Client
from scitacean.transfer.sftp import SFTPFileTransfer
client = Client.from_token(url="https://scicat.ess.eu/api/v3",
                           token=...,
                           file_transfer=SFTPFileTransfer(
                               host="login.esss.dk"
                           ))

This code is identical to the one used for downloading . As with the downloading guide, we use a fake client instead of the real one shown above.

[1]:

from scitacean.testing.docs import setup_fake_client
client = setup_fake_client()

This is especially useful here as datasets cannot be deleted from SciCat by regular users, and we don’t want to pollute the database with our test data.

First, we need to generate some data to upload:

[2]:

from pathlib import Path

path = Path("data/witchcraft.dat")
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w") as f:
    f.write("7.9 13 666")

Create a new dataset#

With the totally realistic data in hand, we can construct a dataset.

[3]:

from scitacean import Dataset

dset = Dataset(
    name="Spellpower of the Three Witches",
    description="The spellpower of the maiden, mother, and crone.",
    type="raw",

    owner_group="wyrdsisters",
    access_groups=["witches"],

    owner="Nanny Ogg",
    principal_investigator="Esme Weatherwax",
    contact_email="nogg@wyrd.lancre",

    creation_location="lancre/whichhut",
    data_format="space-separated",
    source_folder="/somewhere/on/remote",
)

There are many more fields that can be filled in as needed. See scitacean.Dataset.

Some fields require an explanation:

dataset_type is either raw or derived. The main difference is that derived datasets point to one or more input datasets.
owner_group and access_groups correspond to users/usergroups on the file server and determine who can access the files.

Now we can attach our file:

[4]:

dset.add_local_files("data/witchcraft.dat", base_path="data")

Setting the base_path to "data" means that the file will be uploaded to source_folder/withcraft.dat where source_folder will be determined by the file transfer. (See below.) If we did not set base_path, the file would end up in source-dir/data/withcraft.dat.

Now, let’s inspect the dataset.

[5]:

dset

[5]:

RawDataset

	Name	Type	Value	Description
*	creation_time	datetime	2024-05-29 08:17:09+0000	Time when dataset became fully available on disk, i.e. all containing files have been written. Format according to chapter 5.6 internet date/time format in RFC 3339. Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
*	source_folder	RemotePath	RemotePath('/somewhere/on/remote')	Absolute file path on file server containing the files of this dataset, e.g. /some/path/to/sourcefolder. In case of a single file dataset, e.g. HDF5 data, it contains the path up to, but excluding the filename. Trailing slashes are removed.
	description	str	The spellpower of the maiden, mother, and crone.	Free text explanation of contents of dataset.
	name	str	Spellpower of the Three Witches	A name for the dataset, given by the creator to carry some semantic meaning. Useful for display purposes e.g. instead of displaying the pid. Will be autofilled if missing using info from sourceFolder.
	pid	PID	None	Persistent Identifier for datasets derived from UUIDv4 and prepended automatically by site specific PID prefix like 20.500.12345/
	proposal_id	str	None	The ID of the proposal to which the dataset belongs.
	sample_id	str	None	ID of the sample used when collecting the data.

Advanced fields

*	contact_email	str	nogg@wyrd.lancre	Email of the contact person for this dataset. The string may contain a list of emails, which should then be separated by semicolons.
*	creation_location	str	lancre/whichhut	Unique location identifier where data was taken, usually in the form /Site-name/facility-name/instrumentOrBeamline-name. This field is required if the dataset is a Raw dataset.
*	owner	str	Nanny Ogg	Owner or custodian of the dataset, usually first name + last name. The string may contain a list of persons, which should then be separated by semicolons.
*	owner_group	str	wyrdsisters	Defines the group which owns the data, and therefore has unrestricted access to this data. Usually a pgroup like p12151
*	principal_investigator	str	Esme Weatherwax	First name and last name of principal investigator(s). If multiple PIs are present, use a semicolon separated list. This field is required if the dataset is a Raw dataset.
	access_groups	list[str]	['witches']	Optional additional groups which have read access to the data. Users which are members in one of the groups listed here are allowed to access this data. The special group 'public' makes data available to all users.
	api_version	str	None	Version of the API used in creation of the dataset.
	classification	str	None	ACIA information about AUthenticity,COnfidentiality,INtegrity and AVailability requirements of dataset. E.g. AV(ailabilty)=medium could trigger the creation of a two tape copies. Format 'AV=medium,CO=low'
	comment	str	None	Comment the user has about a given dataset.
	created_at	datetime	None	Date and time when this record was created. This property is added and maintained by mongoose.
	created_by	str	None	Indicate the user who created this record. This property is added and maintained by the system.
	data_format	str	space-separated	Defines the format of the data files in this dataset, e.g Nexus Version x.y.
	data_quality_metrics	int	None	Data Quality Metrics given by the user to rate the dataset.
	end_time	datetime	None	End time of data acquisition for this dataset, format according to chapter 5.6 internet date/time format in RFC 3339. Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
	instrument_group	str	None	Optional additional groups which have read and write access to the data. Users which are members in one of the groups listed here are allowed to access this data.
	instrument_id	str	None	ID of the instrument where the data was created.
	is_published	bool	None	Flag is true when data are made publicly available.
	keywords	list[str]	None	Array of tags associated with the meaning or contents of this dataset. Values should ideally come from defined vocabularies, taxonomies, ontologies or knowledge graphs.
	license	str	None	Name of the license under which the data can be used.
	lifecycle	Lifecycle	None	Describes the current status of the dataset during its lifetime with respect to the storage handling systems.
	orcid_of_owner	str	None	ORCID of the owner or custodian. The string may contain a list of ORCIDs, which should then be separated by semicolons.
	owner_email	str	None	Email of the owner or custodian of the dataset. The string may contain a list of emails, which should then be separated by semicolons.
	relationships	list[Relationship]	None	Stores the relationships with other datasets.
	shared_with	list[str]	None	List of users that the dataset has been shared with.
	source_folder_host	str	None	DNS host name of file server hosting sourceFolder, optionally including a protocol e.g. [protocol://]fileserver1.example.com
	techniques	list[Technique]	None	Stores the metadata information for techniques.
	updated_at	datetime	None	Date and time when this record was updated last. This property is added and maintained by mongoose.
	updated_by	str	None	Indicate the user who updated this record last. This property is added and maintained by the system.
	validation_status	str	None	Defines a level of trust, e.g. a measure of how much data was verified or used by other persons.

Files: 1 (10 B)

Local	Remote	Size
data/witchcraft.dat	None	10 B

Scientific Metadata

Name	Value

[6]:

len(list(dset.files))

[6]:

[7]:

dset.size  # in bytes

[7]:

[8]:

file = list(dset.files)[0]
print(f"{file.remote_access_path(dset.source_folder) = }")
print(f"{file.local_path = }")
print(f"{file.size = } bytes")

file.remote_access_path(dset.source_folder) = None
file.local_path = PosixPath('data/witchcraft.dat')
file.size = 10 bytes

The file has a local_path but no remote_access_path which means that it exists on the local file system (where we put it earlier) but not on the remote file server accessible by SciCat. The location can also be queried using file.is_on_local and file.is_on_remote.

Likewise, the dataset only exists in memory on our local machine and not on SciCat. Nothing has been uploaded yet. So we can freely modify the dataset or bail out by deleting the Python object if we need to.

Upload the dataset#

Once the dataset is ready, we can upload it using

[9]:

finalized = client.upload_new_dataset_now(dset)

WARNING:

This action cannot be undone by a regular user! Contact an admin if you uploaded a dataset accidentally.

scitacean.Client.upload_new_dataset_now uploads the dataset (i.e. metadata) to SciCat and the files to the file server. And it does so in such a way that it always creates a new dataset and new files without overwriting any existing (meta) data.

It returns a new dataset that is a copy of the input with some updated information generated by SciCat and the file transfer. For example, it has been assigned a new ID:

[10]:

finalized.pid

[10]:

PID(prefix='PID.prefix.a0b1', pid='c0ce316f-2bd7-44dd-b31c-ebe98532bb6d')

And the remote access path of our file has been set:

[11]:

list(finalized.files)[0].remote_access_path(finalized.source_folder)

[11]:

RemotePath('/somewhere/on/remote/witchcraft.dat')

Location of uploaded files#

All files associated with a dataset will be uploaded to the same folder. This folder may be at the path we specify when making the dataset, i.e. dset.source_folder. However, the folder is ultimately determined by the file transfer (in this case SFTPFileTransfer) and it may choose to override thesource_folderthat we set. In this example, since we don't tell the file transfer otherwise, it respectsdset.source_folder` and uploads the files to that location. See the File transfer reference for information how to control this behavior. The reason for this is that facilities may have a specific structure on their file server and Scitacean’s file transfers can be used to enforce that.

In any case, we can find out where files were uploaded by inspecting the finalized dataset that was returned by client.upload_new_dataset_now:

[12]:

finalized.source_folder

[12]:

RemotePath('/somewhere/on/remote')

Or by looking at each file individually as shown in the section above.

Attaching images to datasets#

It is possible to attach small images to datasets. In SciCat, this is done by creating ‘attachment’ objects which contain the image. Scitacean handles those via the attachments property of Dataset. For our locally created dataset, the property is an empty list and we can add an attachment like this:

[13]:

from scitacean import Attachment, Thumbnail

dset.attachments.append(
    Attachment(
        caption="Scitacean logo",
        owner_group=dset.owner_group,
        thumbnail=Thumbnail.load_file("./logo.png"),
    )
)
dset.attachments[0]

[13]:

Scitacean logo

Fields

Name	Type	Value
access_groups	list[str] \| None	None
created_at	datetime \| None	None
created_by	str \| None	None
dataset_id	PID \| None	None
id	str \| None	None
instrument_group	str \| None	None
is_published	bool \| None	None
owner_group	str	wyrdsisters
proposal_id	str \| None	None
sample_id	str \| None	None
updated_at	datetime \| None	None
updated_by	str \| None	None

We used Thumbnail.load_file because it properly encodes the file for SciCat.

When we then upload the dataset, the client automatically uploads all attachments as well. Note that this creates a new dataset in SciCat. If you want to add attachments to an existing dataset after upload, you need to use the lower-level API through client.scicat.create_attachment_for_dataset or the web interface directly.

[14]:

finalized = client.upload_new_dataset_now(dset)

In order to download the attachments again, we can pass attachments=True when downloading the dataset:

[15]:

downloaded = client.get_dataset(finalized.pid, attachments=True)
downloaded.attachments[0]

[15]:

Scitacean logo

Fields

Name	Type	Value
access_groups	list[str] \| None	None
created_at	datetime \| None	2024-05-29 08:17:09+0000
created_by	str \| None	fake
dataset_id	PID \| None	PID.prefix.a0b1/796e8161-0bef-42d3-a9b6-5cdd44b04c90
id	str \| None	4468bec3-d852-48e8-b26c-20b734dcf921
instrument_group	str \| None	None
is_published	bool \| None	None
owner_group	str	wyrdsisters
proposal_id	str \| None	None
sample_id	str \| None	None
updated_at	datetime \| None	2024-05-29 08:17:09+0000
updated_by	str \| None	fake