Debunking reliability myths of PIDs for Digital Specimens

In this post I address an erroneous assertion – a myth perhaps, that the proposed Digital Specimen Architecture relies heavily on a centralized resolver and registry for persistent identifiers that is inherently not distributed and that this makes the proposed “persistent” identifiers (PID) for Digital Specimens unreliable. By unreliable is meant link rot (‘404 not found’) and/or content drift (content today is not the same as content yesterday).

This assertion and its concerns (myths) came during a lively Q&A and associated ‘chat’ that took place while I was presenting the recent progress in development of the openDS standard at the virtual TDWG 2020 SYM07 symposium this week.

I want to show that any such issues are not those of the persistent identifier scheme itself or its associated service provider organizations but are usually human failings and inadequacies in the management and procedures adopted by users of such schemes.

Myth: DOI/Handle system are centralized systems for registration and resolution

The first myth is that PID schemes like the DOI system and the Handle System and Domain Name System (DNS) for identifiers are centralized and that this makes them unreliable.

With terms like centralized, decentralized and distributed people often talk at cross-purposes. We must separate how something is arranged organizationally from how it is arranged for technical implementation. This is the layering principle of architecture design.

A service can be organizationally centralized (i.e., offered and managed by a single organization) whilst being practically and technically distributed in its implementation (in terms of server location and/or subcontracts to provide the service). The converse is also true i.e., a service can be offered and managed in a distributed manner by multiple organizations whilst being practically and technically centralized in implementation. In fact, if you look at the DOI system carefully you will see that it has both styles simultaneously in different parts of what is overall a decentralized governance system with a distributed technical implementation. You can read more about its architecture in section 4 of RFC 3650. The system consists of multiple decentralized services, each of which are implemented in a distributed manner for resilience.

Myth: doi.org is centralized

You may think the doi.org resolution service is centralized but it is not. In the sense that it has a single entry point via a proxy server at https://doi.org it gives that appearance. Behind the scenes a resolution request is routed to the Handle service (LHS) instance responsible for the specific prefix. This service itself is often distributed in nature for resilience. LHS are the points at which the metadata associated with a DOI/Handle name are maintained by the organization to whom the Handle/DOI prefix is registered (or a proxy on their behalf).

Myth: DOIs/Handles are identifiers and resolvers at the same time

The identifier, the DOI name (for example, 10.XXXX/ABCD) is just that – a name that identifies a thing. Just like my name identifies me. When you use the name properly as in “doi: 10.XXXX/ABCD” it performs only the identification function. The name says nothing about the resolver you need to use nor the technology of resolution. When you make no assumption about this you retain the freedom to change the technology and to persist the identifier for as long as is needed. In the heritage collections sector, when identifying digital specimens, a reasonable target for that is 100+ years.

International Standard Book Numbers (ISBN) are another example of an identifier that says nothing about how resolve them.

You can use any resolver you like, if it is capable to resolve such names. Of course, the best-known resolver for DOIs is offered by the International DOI Foundation (IDF) via a proxy service at https://doi.org/ but there are several others. Common practice today is for humans to use DOIs within the environment of the World Wide Web and so we often see DOIs presented in their Web usable form: https://doi.org/10.xxxx/abcd. The IDF has recently changed its advice on presentation of DOIs to encourage wider use of this form. This is a mistake in my opinion but one I can understand the logic of when humans work with DOIs. Machines certainly don’t need this. It saves a copy and paste action when you don’t have a clickable DOI name. In fact, in this form resolution becomes a multi-step process, with a DNS lookup being needed as the first step.

Myth: DOI/Handles are unreliable

DOI/Handles are not intrinsically unreliable. The issue of dead DOIs and referenced content changing is not a problem of the DOI (or Handle) system per se but one of poor management and/or lack of proper procedures by/for users of such systems. i.e., those responsible for creating and maintaining DOI/Handle registrations.

Continuing with the ‘who I am and where I live’ analogy: When I move to a new house it’s incumbent on me to tell especially the postal service my new address i.e., to update my metadata. Otherwise, when mail is posted to me it will not go to the correct place. Of course, I have strong incentives to maintain my own metadata to ensure I receive my mail, but data guardians also have an incentive if they want their data to be found, re-used and cited more often.

Better automation has a key role to play in ensuring that when things move (migrate), their location data is updated, or that when a new version of the content is created that an appropriate version control procedure has been applied, including where necessary issuing a new identifier for the new version.

Content-based identifiers (e.g., as suffixes to DOI/Handle prefixes) have a role to play mitigating the latter problem (as discussed by Elliott et al., doi: 10.1016/j.ecoinf.2020.101132). The problem of identifying the content of time-series data sets at specific moments in time, of which the eBird data published to GBIF is one example, is well-known. However, content-based identifiers cannot solve issues of outdated location information or ‘404 not found’. Those problems can only be solved with adequate infrastructure functionality and procedures.

Myth: Identifiers should be separate from the information needed for resolution

You cannot disassociate my name from the address where I live and expect to be able to find me. In the same way, you cannot separate a persistent identifier from the information needed to resolve (deference) to the location of the thing being identified. Any identifier resolution system must maintain a list of locations (one or more) where I or another thing could reasonably expect to be found. If you encode that information into the identifier itself, then the identifier will become obsolete when the thing moves to a new location.

As John Kunze (California Digital Library) asserts in the documentation of his ‘nice opaque identifiers’ (noid) generator: “No technology exists that automatically manages objects and associations [namestring+assertions]; persistence is a matter of service commitment, tools that support that commitment, and information that allows users receiving identifiers to make the best judgment regarding an organization’s ability and intention to maintain them.

With its PID services specifically for Digital Specimens and other associated objet types (digital collections, annotations and interpretations, loans and visits) DiSSCo certainly has the intention to make the commitment to long-term persistence of the identifiers it will use, and to their maintenance and (in specific, well defined cases) graceful degradation.

END.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: