Downloading datasets#
All communication with SciCat is handled by a client object. Normally, one would construct one using something like
from scitacean import Client
from scitacean.transfer.sftp import SFTPFileTransfer
client = Client.from_token(url="https://scicat.ess.eu/api/v3",
token=...,
file_transfer=SFTPFileTransfer(
host="login.esss.dk"
))
In this example, we use ESS’s SciCat. If you want to use a different one, you need to figure out its URL. Note that this is not the same URL that you open in a browser but typically ends in a suffix like /api/v3
.
Here, we authenticate using a token. You can find your token in the web interface by logging in and opening the settings. Alternatively, we could use username and password via Client.from_credentials.
WARNING:
Do not hard code secrets like tokens or passwords in notebooks or scripts! There is a high risk of exposing them when code is under version control or uploaded to SciCat.
Scitacean currently requires secrets to be passed as function arguments. So you will have to find your own solution for now.
While the client itself is responsible for talking to SciCat, a file_transfer
object is required to download data files. Here, we use SFTPFileTransfer
which downloads / uploads files via SFTP.
The file transfer needs to authenticate separately from the SciCat connection. By default, it requires an SSH agent to be running an set up for the selected host
.
For the purposes of this guide, we don’t want to connect to a real SciCat server in order to avoid the complications associated with that. So we set up a fake client that only pretends to connect to SciCat and file servers. Everything else in this guide works in the same way with a real client. See Developer Documentation/Testing if you are interested in the details.
[1]:
from scitacean.testing.docs import setup_fake_client
client = setup_fake_client()
Metadata#
We need the ID (pid
) of a dataset in order to download it. The fake client provides a dataset with id 20.500.12269/72fe3ff6-105b-4c7f-b9d0-073b67c90ec3
. We can download it using
[2]:
dset = client.get_dataset("20.500.12269/72fe3ff6-105b-4c7f-b9d0-073b67c90ec3")
Datasets can easily be inspected in Jupyter notebooks:
[3]:
dset
[3]:
Name | Type | Value | Description | |
---|---|---|---|---|
* |
creation_time | datetime | 2022-06-29 14:01:05+0000 | Time when dataset became fully available on disk, i.e. all containing files have been written. Format according to chapter 5.6 internet date/time format in RFC 3339. Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server. |
* |
source_folder | RemotePath | RemotePath('/hex/ps/thaum') | Absolute file path on file server containing the files of this dataset, e.g. /some/path/to/sourcefolder. In case of a single file dataset, e.g. HDF5 data, it contains the path up to, but excluding the filename. Trailing slashes are removed. |
description | str | Measured the thaum flux | Free text explanation of contents of dataset. | |
name | str | Thaum flux | A name for the dataset, given by the creator to carry some semantic meaning. Useful for display purposes e.g. instead of displaying the pid. Will be autofilled if missing using info from sourceFolder. | |
pid | PID | 20.500.12269/72fe3ff6-105b-4c7f-b9d0-073b67c90ec3 | Persistent Identifier for datasets derived from UUIDv4 and prepended automatically by site specific PID prefix like 20.500.12345/ | |
proposal_id | str | None | The ID of the proposal to which the dataset belongs. | |
sample_id | str | None | ID of the sample used when collecting the data. |
Advanced fields
* |
contact_email | str | p.stibbons@uu.am | Email of the contact person for this dataset. The string may contain a list of emails, which should then be separated by semicolons. |
* |
creation_location | str | UnseenUniversity | Unique location identifier where data was taken, usually in the form /Site-name/facility-name/instrumentOrBeamline-name. This field is required if the dataset is a Raw dataset. |
* |
owner | str | Ponder Stibbons | Owner or custodian of the dataset, usually first name + last name. The string may contain a list of persons, which should then be separated by semicolons. |
* |
owner_group | str | uu | Defines the group which owns the data, and therefore has unrestricted access to this data. Usually a pgroup like p12151 |
* |
principal_investigator | str | p.stibbons@uu.am | First name and last name of principal investigator(s). If multiple PIs are present, use a semicolon separated list. This field is required if the dataset is a Raw dataset. |
access_groups | list[str] | ['faculty'] | Optional additional groups which have read access to the data. Users which are members in one of the groups listed here are allowed to access this data. The special group 'public' makes data available to all users. | |
api_version | str | None | Version of the API used in creation of the dataset. | |
classification | str | None | ACIA information about AUthenticity,COnfidentiality,INtegrity and AVailability requirements of dataset. E.g. AV(ailabilty)=medium could trigger the creation of a two tape copies. Format 'AV=medium,CO=low' | |
comment | str | None | Comment the user has about a given dataset. | |
created_at | datetime | 2022-08-17 14:20:23+0000 | Date and time when this record was created. This property is added and maintained by mongoose. | |
created_by | str | Ponder Stibbons | Indicate the user who created this record. This property is added and maintained by the system. | |
data_format | str | None | Defines the format of the data files in this dataset, e.g Nexus Version x.y. | |
data_quality_metrics | int | None | Data Quality Metrics given by the user to rate the dataset. | |
end_time | datetime | None | End time of data acquisition for this dataset, format according to chapter 5.6 internet date/time format in RFC 3339. Local times without timezone/offset info are automatically transformed to UTC using the timezone of the API server. | |
instrument_group | str | None | Optional additional groups which have read and write access to the data. Users which are members in one of the groups listed here are allowed to access this data. | |
instrument_id | str | None | ID of the instrument where the data was created. | |
is_published | bool | None | Flag is true when data are made publicly available. | |
keywords | list[str] | None | Array of tags associated with the meaning or contents of this dataset. Values should ideally come from defined vocabularies, taxonomies, ontologies or knowledge graphs. | |
license | str | None | Name of the license under which the data can be used. | |
lifecycle | Lifecycle | None | Describes the current status of the dataset during its lifetime with respect to the storage handling systems. | |
orcid_of_owner | str | None | ORCID of the owner or custodian. The string may contain a list of ORCIDs, which should then be separated by semicolons. | |
owner_email | str | None | Email of the owner or custodian of the dataset. The string may contain a list of emails, which should then be separated by semicolons. | |
relationships | list[Relationship] | None | Stores the relationships with other datasets. | |
shared_with | list[str] | None | List of users that the dataset has been shared with. | |
source_folder_host | str | None | DNS host name of file server hosting sourceFolder, optionally including a protocol e.g. [protocol://]fileserver1.example.com | |
techniques | list[Technique] | None | Stores the metadata information for techniques. | |
updated_at | datetime | 2022-11-01 13:22:08+0000 | Date and time when this record was updated last. This property is added and maintained by mongoose. | |
updated_by | str | anonymous | Indicate the user who updated this record last. This property is added and maintained by the system. | |
validation_status | str | None | Defines a level of trust, e.g. a measure of how much data was verified or used by other persons. |
Files: 2 (95 B)
Local | Remote | Size |
---|---|---|
None | RemotePath('flux.dat') | 20 B |
None | RemotePath('logs/measurement.log') | 75 B |
All attributes listed above can be accessed directly:
[4]:
dset.type
[4]:
<DatasetType.RAW: 'raw'>
[5]:
dset.name
[5]:
'Thaum flux'
[6]:
dset.owner
[6]:
'Ponder Stibbons'
See Dataset for a list of available fields.
In addition, datasets can have free form scientific metadata which we can be accessed using
[7]:
dset.meta
[7]:
{'data_type': 'histogram', 'temperature': {'value': '123', 'unit': 'K'}}
Files#
The data files associated with this dataset can be accessed using
[8]:
for f in dset.files:
print(f"{f.remote_access_path(dset.source_folder) = }")
print(f"{f.local_path = }")
print(f"{f.size = } bytes")
print("----")
f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/flux.dat')
f.local_path = None
f.size = 20 bytes
----
f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/logs/measurement.log')
f.local_path = None
f.size = 75 bytes
----
Note that the local_path
for both files is None
. This indicates that the files have not been downloaded. Indeed, client.get_dataset
downloads only the metadata from SciCat, not the files.
We can download the first file using
[9]:
dset_with_local_file = client.download_files(dset, target="download", select="flux.dat")
[10]:
for f in dset_with_local_file.files:
print(f"{f.remote_access_path(dset.source_folder) = }")
print(f"{f.local_path = }")
print(f"{f.size = } bytes")
print("----")
f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/flux.dat')
f.local_path = PosixPath('download/flux.dat')
f.size = 20 bytes
----
f.remote_access_path(dset.source_folder) = RemotePath('/hex/ps/thaum/logs/measurement.log')
f.local_path = None
f.size = 75 bytes
----
Which populates the local_path
:
[11]:
file = list(dset_with_local_file.files)[0]
[12]:
file.local_path
[12]:
PosixPath('download/flux.dat')
We can use it to read the file:
[13]:
with file.local_path.open("r") as f:
print(f.read())
5 4 9 11 15 12 7 6 1
If we wanted to download all files, we could pass select=True
(or nothing, True
is the default) to client.download_files
. See Client.download_files for more options to select files.