DiSSCo aims to describe relationships between specimen and e.g. the collections in which they are curated, the collection holding institute, contributors, contributions, funders and scholarly publications. All these objects need to be uniquely identified to be able to connect their information.
To show the importance of connecting specimen with their institutes we can look at the European Loans and Visits System, ELViS. This is a DiSSCo service under development to provide physical and virtual access to the specimen that are curated and preserved in the collections held in our institutes. Many user stories collected for this service require to be able to uniquely identify institutes. A few examples:
- user story #130: Find institution that has holotype of a certain species
- user story #5: Request sample of specific specimens
- user story #121: Plan visit to use specialist equipment
- user story #65: Compare statistics with other institutions
- user story #129 Institution condition for loans (Nagoya related or other)
To uniquely identify an institute it needs a persistent identifier, a PID. The Persistent Identifier policy for the European Open Science Cloud (EOSC) lists several requirements for a good persistent identifier: “A Persistent Identifier that supports and enables research that is FAIR is one that is globally unique, persistent, and resolvable“. It also describes what resolvable means: A PID is resolvable when it allows both human and machine users to access an object or its representation, and its Kernel Information. Kernel Information is a structured record that contains information (metadata) about the referred object, like a pointer to the location where the data (bit sequence) for the object can be found.
When an object or its representation are no longer available the PID still needs to resolve, with other words: resolution to Kernel Information must still be possible. It will then contain some ‘tombstone’ information about the object. The PID will thus need to remain forever, something which is very hard to achieve. It requires robust governance structures for PID registries where multiple organisations share the responsibility.
In order to be able to connect scientific data on a global level, DiSSCo aims to make use of a global PID registry for research institutes, rather than to create its own local solution for only the DiSSCo institutes or to use a registry that lists only natural history collection institutes. Since all DiSSCo institutes participate in the DiSSCo Research Infrastructure, they are a research organisation by default. Our natural history museums have no properties that require them to have a PID that is different from other research organisations.
There are several existing registries that provide PIDs or organisations. An overview of organisation identifier providers can be found in the report from a survey that was carried out by ORCID. It lists for example ISNI (International Standard Name Identifier), LEI (Legal Entity Identifier) and GRID (Global Research Identifier Database). Which one to choose? For selecting an organisation identifier provider the DiSSCo technical team had several requirements:
- it should meet the requirements outlined in the EOSC PID Policy (see above)
- have an established registry with enough research organisations already (“critical mass”) and all DiSSCo institutes should fit in its scope
- the PIDs with kernel data should be public domain (Creative Commons Zero);
- have transparent, non-profit governance;
- Offer the ability for organizations to manage their own records, if possible without significant costs
- Have appropriate metadata associated with them (e.g. things like a Name as human understandable label and relationship metadata to be interoperable with other identifiers)
- Resolve to HTTP(S) URIs to allow easy access by both humans and machines
Non of the existing system fit all the requirements but GRID, the Global Research Identifier Database is the registry that fits these requirements the best. It meets all requirements above except for one important requirement: it is managed by a commercial company, Digital Science, and does therefore not meet the requirement of having a transparent, non-profit governance. However there is a community effort to fix that, which is ROR, the Research Organization Registry. This is a community-led project to develop an open, sustainable, usable, and unique identifier for every research organization in the world. Its steering group contains members from organisations like DataCite, Crossref, the California Digital Library and Digital Science. There is currently a 1:1 relationship between GRID and ROR identifiers, and both refer to each other in their metadata.
ROR just started and published its minimal viable product early 2019, a registry seeded with data from GRID. At the moment it contains the same number of organisations but less metadata than GRID provides and the ROR organisation is seeking funding to become sustainable. Therefore we will use both but use GRID as the primary system during the development of the ELViS service. It contains some metadata that is not yet present in ROR but useful for DiSSCo, like the geolocation of an institute.
GRID and ROR currently have PIDs for almost 100.000 research organisations in 217 countries. The data is public domain, and contains information about an organization like its name, alternate names, and location. This data is extracted from research funding grants and research paper affiliations. Source data is associated manually to the corresponding GRID record in a process called mapping. Whenever a source data row can not be mapped to a GRID record, a new record is created. This proces of manual curation ensures that each organisation has only one record. Records are named by using the generally recognised name of the institution, which is determined by querying the official website, encyclopaedic records and other trusted data sources.
Since institutes in DiSSCo often participate in research grants and produce research papers, most of them likely already have a GRID and ROR identifier. In SYNTHESYS+ we are piloting the use of these identifiers. As the ELViS Minimum Viable Product, a system has been developed that supports a Virtual Access pilot provided by 22 institutes. Most of these institutes indeed appeared to already have a GRID and ROR identifiers. The ones that did not have one, appeared to be part of a university that had an identifier. GRID supports relationships like child institutes and related institutes though. The institutes without an identifier applied for one through a form supplied at https://www.grid.ac/institutes. This proved to be an easy process, that can also be carried out through ROR: https://ror.org/curation/. There are no costs involved, getting a GRID or ROR is free. A few minor errors in the metadata were also fixed though the GRID form.
The use of GRID and ROR provides several benefits for DiSSCo. RORs are supported in version 4.3 of the DataCite Metadata Schema making it easy to connect with datasets published through DataCite such as the GBIF datasets. GRIDs are supported in ORCID which provides identifiers for researchers, making it easy to connect people with ORCID iDs with their institutes. Not only the institute name is supported as Name label but also alternative name labels in the form of aliases, language variants and acronyms. These are commonly used in our community.
Let’s look at an example for the National Museum of Natural History in Paris:
the GRID identifier is grid.410350.3 and the ROR identifier is 03wkt5x30. Since the identifiers contain name labels this could be displayed in a DiSSCo service for a human user with a label plus link to the identifier landing page as: National Museum of Natural History or as MNHN, depending on the need.
Using content negotiation, a machine will see the data different from what a users sees in a webbrowser. To see what a machine or a piece of software sees you will need to use for instance a cURL command:
curl -L -H "Accept: application/rdf+xml" https://www.grid.ac/institutes/grid.410350.3
GRID and ROR unfortunately do not use the Handle system, like e.g DOI and ePIC do. So you need to know the URI: https://www.grid.ac/institutes/grid.410350.3 or https://ror.org/03wkt5x30 and the PID URIs cannot move to another location without breaking things. For instance if ror.org or grid.org websites cease to exist in the future. So although ROR and GRID provide the best solution so far for research organisation PIDS, there are still some improvements to make.