Historical Overview
Open Science
A part of research data comes from public funding, and in this case, there is a legitimate demand for not only the creators but also anyone else who finances their production in any form (such as through tax payments) to have access to them. Even in the case of data generated from sources other than public funding, there may be a need for broader accessibility to the data. The Open Science movement aims to promote the open and controlled dissemination of research data and publications based on them. Data repositories provide good infrastructure for sharing research data.
Data Repositories
Data repositories serve the purpose of making data/research data shareable and manageable. There are so-called generic data repositories as well as domain-specific ones. Generic data repositories are designed to store uploaded data/files in the form of data packages organized in some hierarchy or structure. These data packages are associated with editing and viewing permissions, version control, and potentially assigned persistent identifiers. The primary purpose of these data packages is to provide a reliable form of evidence for the results of publications, enable verification of claims made in publications, and ensure reproducibility of experiments.
FAIR Principles
The FAIR principles are becoming increasingly prominent. The acronym stands for Findable, Accessible, Interoperable, and Reusable, and it simplifies the requirements for research data to be findable, easily accessible, interoperable, and reproducible. The Open Science initiative has gladly adopted these principles because they simplify the processing of already published data for any knowledgeable person (even though adhering to the FAIR principles does not guarantee that research data will be made publicly available). Compliance with the FAIR principles should not only be ensured for human processing but also provide accessibility and, at times, automated processability for automatic tools. However, achieving the latter requirement assumes semantic labeling of research data, which is not currently feasible or requires significant compromises in the existing data repositories.
FAIR Digital Objects
In 2018, the European Union introduced the concept of FAIR Digital Objects through an action plan. FAIR Digital Objects (FDO) are digital objects that implement the FAIR principles in a given environment: "These objects could represent data, software, protocols or other research resources."... "They need to be accompanied by Persistent Identifiers (PIDs) and metadata rich enough to enable them to be reliably found, used and cited."... "Software and algorithms, when shared, should include not just the source itself but also appropriate documentation including machine- actionable statements about dependencies and licencing."
In simplified terms, this definition means that an FDO should be self-descriptive in a given digital environment for both humans and automated processing tools. Due to the criteria of reuse and reproducibility, a level of metadata needs to be provided to formally specify the necessary steps for processing research data and indicate both the source data and the results of processing. This criterion imposes a burden on generic data repositories, making it challenging to achieve since typically the repository and formal metadata level are at the data package level, which makes it difficult to make assertions about individual files.
RO-Crate
RO-Crate (Research Object Crate) is a packaging technique that was selected during the ELKH ARP project to support the FAIR principles. From a user's perspective, an RO-Crate data package offers a hierarchical structure of files and URI addresssable resources (essentially, files and references organized into directories). Descriptive data and metadata can be associated with each element (i.e., the entire data package, directories, files, and references), allowing each file to describe how it was created and what it contains. An RO-Crate data package is a bundled file that contains this structure and the associated metadata.
The Relationship between RO-Crate and Data Repositories
The RO-Crate data package, as a data file, can be uploaded to generic repositories. In this case, a data package is created, which includes the contents of the uploaded RO-Crate data package, and there is the possibility to provide it with metadata. In our interpretation, the RO-Crate data package represents a complete dataset intended for publication of a research, meaning that the metadata for the entire RO-Crate data package and the metadata for the data package going into the repositories should match, as they describe the same thing. However, there is no explicit support for this in data repositories, and the discoverability of the data is only possible through the metadata at the data package level, even though the RO-Crate package provides file-level metadata.
ARP Data Repository, AROMA
During the ARP project, a new data repository was developed with the support of ELKH. This data repository is a generic repository, an enhancement of Harvard Dataverse, which goes beyond simple file upload and supports the management of RO-Crate data packages. It allows for the import, export, and local editing of RO-Crate data packages in a way that keeps the metadata for the entire RO-Crate and the data package in sync during any operation. The integrated tool for file-level metadata editing is the AROMA (ARP RO-Crate Manager) software component.