Downloading datasets#

All communication with SciCat is handled by a client object. Normally, one would construct one using something like

from scitacean import Client
from scitacean.transfer.sftp import SFTPFileTransfer
client = Client.from_token(url="https://scicat.ess.eu/api/v3",
                           token=...,
                           file_transfer=SFTPFileTransfer(
                               host="login.esss.dk"
                           ))

In this example, we use ESS’s SciCat. If you want to use a different one, you need to figure out its URL. Note that this is not the same URL that you open in a browser but typically ends in a suffix like /api/v3.

Here, we authenticate using a token. You can find your token in the web interface by logging in and opening the settings. Alternatively, we could use username and password via Client.from_credentials.

WARNING:

Do not hard code secrets like tokens or passwords in notebooks or scripts! There is a high risk of exposing them when code is under version control or uploaded to SciCat.

Scitacean currently requires secrets to be passed as function arguments. So you will have to find your own solution for now.

While the client itself is responsible for talking to SciCat, a file_transfer object is required to download data files. Here, we use SFTPFileTransfer which downloads / uploads files via SFTP.

The file transfer needs to authenticate separately from the SciCat connection. By default, it requires an SSH agent to be running an set up for the selected host.

For the purposes of this guide, we don’t want to connect to a real SciCat server in order to avoid the complications associated with that. So we set up a fake client that only pretends to connect to SciCat and file servers. Everything else in this guide works in the same way with a real client. See Developer Documentation/Testing if you are interested in the details.

[1]:

from scitacean.testing.docs import setup_fake_client
client = setup_fake_client()

Metadata#

We need the ID (pid) of a dataset in order to download it. The fake client provides a dataset with id 20.500.12269/72fe3ff6-105b-4c7f-b9d0-073b67c90ec3. We can download it using

[2]:

dset = client.get_dataset("20.500.12269/72fe3ff6-105b-4c7f-b9d0-073b67c90ec3")

Datasets can easily be inspected in Jupyter notebooks:

[3]:

dset

[3]:

RawDataset

	Name	Type	Value	Description
*	creation_time	datetime	2022-06-29 14:01:05+0000	Time when dataset became fully available on disk, i.e. all containing files have been written. Format according to chapter 5.6 internet date/time format in RFC 3339. Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
*	source_folder	RemotePath	RemotePath('/hex/ps/thaum')	Absolute file path on file server containing the files of this dataset, e.g. /some/path/to/sourcefolder. In case of a single file dataset, e.g. HDF5 data, it contains the path up to, but excluding the filename. Trailing slashes are removed.
	description	str	Measured the thaum flux	Free text explanation of contents of dataset.
	name	str	Thaum flux	A name for the dataset, given by the creator to carry some semantic meaning. Useful for display purposes e.g. instead of displaying the pid. Will be autofilled if missing using info from sourceFolder.
	pid	PID	20.500.12269/72fe3ff6-105b-4c7f-b9d0-073b67c90ec3	Persistent Identifier for datasets derived from UUIDv4 and prepended automatically by site specific PID prefix like 20.500.12345/
	proposal_id	str	None	The ID of the proposal to which the dataset belongs.
	sample_id	str	None	ID of the sample used when collecting the data.

Advanced fields

*	contact_email	str	p.stibbons@uu.am	Email of the contact person for this dataset. The string may contain a list of emails, which should then be separated by semicolons.
*	creation_location	str	UnseenUniversity	Unique location identifier where data was taken, usually in the form /Site-name/facility-name/instrumentOrBeamline-name. This field is required if the dataset is a Raw dataset.
*	owner	str	Ponder Stibbons	Owner or custodian of the dataset, usually first name + last name. The string may contain a list of persons, which should then be separated by semicolons.
*	owner_group	str	uu	Defines the group which owns the data, and therefore has unrestricted access to this data. Usually a pgroup like p12151
*	principal_investigator	str	p.stibbons@uu.am	First name and last name of principal investigator(s). If multiple PIs are present, use a semicolon separated list. This field is required if the dataset is a Raw dataset.
	access_groups	list[str]	['faculty']	Optional additional groups which have read access to the data. Users which are members in one of the groups listed here are allowed to access this data. The special group 'public' makes data available to all users.
	api_version	str	None	Version of the API used in creation of the dataset.
	classification	str	None	ACIA information about AUthenticity,COnfidentiality,INtegrity and AVailability requirements of dataset. E.g. AV(ailabilty)=medium could trigger the creation of a two tape copies. Format 'AV=medium,CO=low'
	comment	str	None	Comment the user has about a given dataset.
	created_at	datetime	2022-08-17 14:20:23+0000	Date and time when this record was created. This property is added and maintained by mongoose.
	created_by	str	Ponder Stibbons	Indicate the user who created this record. This property is added and maintained by the system.
	data_format	str	None	Defines the format of the data files in this dataset, e.g Nexus Version x.y.
	data_quality_metrics	int	None	Data Quality Metrics given by the user to rate the dataset.
	end_time	datetime	None	End time of data acquisition for this dataset, format according to chapter 5.6 internet date/time format in RFC 3339. Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server.
	instrument_group	str	None	Optional additional groups which have read and write access to the data. Users which are members in one of the groups listed here are allowed to access this data.
	instrument_id	str	None	ID of the instrument where the data was created.
	is_published	bool	None	Flag is true when data are made publicly available.
	keywords	list[str]	None	Array of tags associated with the meaning or contents of this dataset. Values should ideally come from defined vocabularies, taxonomies, ontologies or knowledge graphs.
	license	str	None	Name of the license under which the data can be used.
	lifecycle	Lifecycle	None	Describes the current status of the dataset during its lifetime with respect to the storage handling systems.
	orcid_of_owner	str	None	ORCID of the owner or custodian. The string may contain a list of ORCIDs, which should then be separated by semicolons.
	owner_email	str	None	Email of the owner or custodian of the dataset. The string may contain a list of emails, which should then be separated by semicolons.
	relationships	list[Relationship]	None	Stores the relationships with other datasets.
	shared_with	list[str]	None	List of users that the dataset has been shared with.
	source_folder_host	str	None	DNS host name of file server hosting sourceFolder, optionally including a protocol e.g. [protocol://]fileserver1.example.com
	techniques	list[Technique]	None	Stores the metadata information for techniques.
	updated_at	datetime	2022-11-01 13:22:08+0000	Date and time when this record was updated last. This property is added and maintained by mongoose.
	updated_by	str	anonymous	Indicate the user who updated this record last. This property is added and maintained by the system.
	validation_status	str	None	Defines a level of trust, e.g. a measure of how much data was verified or used by other persons.

Files: 2 (95 B)

Local	Remote	Size
None	RemotePath('flux.dat')	20 B
None	RemotePath('logs/measurement.log')	75 B

Scientific Metadata

Name	Value
data_type	histogram
temperature	123 [K]

All attributes listed above can be accessed directly:

[4]:

dset.type

[4]:

<DatasetType.RAW: 'raw'>

[5]:

dset.name

[5]:

'Thaum flux'

[6]:

dset.owner

[6]:

'Ponder Stibbons'

See Dataset for a list of available fields.

In addition, datasets can have free form scientific metadata which we can be accessed using

[7]:

dset.meta

[7]:

{'data_type': 'histogram', 'temperature': {'value': '123', 'unit': 'K'}}

Files#

The data files associated with this dataset can be accessed using

[8]:

for f in dset.files:
    print(f"{f.remote_access_path(dset.source_folder) = }")
    print(f"{f.local_path = }")
    print(f"{f.size = } bytes")
    print("----")

f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/flux.dat')
f.local_path = None
f.size = 20 bytes
----
f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/logs/measurement.log')
f.local_path = None
f.size = 75 bytes
----

Note that the local_path for both files is None. This indicates that the files have not been downloaded. Indeed, client.get_dataset downloads only the metadata from SciCat, not the files.

We can download the first file using

[9]:

dset_with_local_file = client.download_files(dset, target="download", select="flux.dat")

[10]:

for f in dset_with_local_file.files:
    print(f"{f.remote_access_path(dset.source_folder) = }")
    print(f"{f.local_path = }")
    print(f"{f.size = } bytes")
    print("----")

f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/flux.dat')
f.local_path = PosixPath('download/flux.dat')
f.size = 20 bytes
----
f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/logs/measurement.log')
f.local_path = None
f.size = 75 bytes
----

Which populates the local_path:

[11]:

file = list(dset_with_local_file.files)[0]

[12]:

file.local_path

[12]:

PosixPath('download/flux.dat')

We can use it to read the file:

[13]:

with file.local_path.open("r") as f:
    print(f.read())

5 4 9 11 15 12 7 6 1

If we wanted to download all files, we could pass select=True (or nothing, True is the default) to client.download_files. See Client.download_files for more options to select files.