Roundup of technical updates

Dear Readers,

In this post, we share some of our recent updates. We have been super busy with various technical developments. Here are a few highlights:

Upcoming conference presentations 📺

We will be at The Biodiversity Information Standard (TDWG) 2022! This year the annual conference will be a hybrid meeting, hosted in Sofia, Bulgaria.

  1. https://doi.org/10.3897/biss.6.90987Human and Machine Working Together towards High Quality Specimen Data: Annotation and Curation of the Digital Specimen”. We are excited to share our proof of concept around community annotation and curation service with a focus on data quality checks. This talk is part of SYM05: “Standardizing Biodiversity Data Quality”. SYM05 will start Tuesday at 09:00 EEST.
  2. https://doi.org/10.3897/biss.6.91168: “Zen and the Art of Persistent Identifier Service Development for Digital Specimen”. No, we won’t be talking about Zen philosophy here. The title is a nod to the 1974 book Zen and the Art of Motorcycle Maintenance. This talk (part of the LTD14 session entitled “Ensuring FAIR Principles and Open Science through Integration of Biodiversity Data”) will feature our current local Handle server setup and persistent identifier for Digital Specimen and related metadata work. LDT14 will be held on 18th Oct Tuesday, between 14:00-16:00 EEST.
  3. https://doi.org/10.3897/biss.6.91428Connecting the Dots: Joint development of best practices between infrastructures in support of bidirectional data linking”. This talk will focus on our work in the BiCIKL project highlighting the best practices for reliably linking specimen collection data with other data classes. All three above talks have FAIR Digital Objects as the key focus. More on that is below. Connecting the dots is part of SYM09: “A Global Collections Network: building capacity and developing community” (Thu, Oct 20, 09:00-10:30 EEST).
  4. https://doi.org/10.3897/biss.6.94350DiSSCo Flanders: A regional natural science collections management infrastructure in an international context” (part of SYM08: Monday 17 Oct 11:30-12:30 EEST) and https://doi.org/10.3897/biss.6.91391DiSSCo UK: A new partnership to unlock the potential of 137 million UK-based specimens” (part of SYM03: Thu 20 Oct 14:00-16:00 EEST). Both of these will highlight national level initiatives that are doing some amazing work on scaling up digitisation, data mobilisation and implementing FAIR principles.

After TDWG, we are back at Leiden for the 1st International Conference on FAIR Digital Objects (26-28 Oct 2022 at the Naturalis Biodiversity Center). The following presentation will focus on DiSSCo and related collaborations such as the Biodiversity Digital Twin.

  1. https://doi.org/10.3897/rio.8.e93816From data pipelines to FAIR data infrastructures: A vision for the new horizons of bio- and geodiversity data for scientific research”. In this presentation, we will touch upon how from various data pipelines and data aggregations, we can go to the next step of machine actionability — A FAIR (Findable, Accessible, Interoperable, and Reusable) and Fully AI Ready data infrastructure that can support pressing research questions. Check out the full program for exciting keynotes and panels.

openDS data modeling work 🏃

Within the openDS working group, we are focusing on how the different digital objects and their relationships should look like — Digital Specimen, Annotation, and Media objects. All of these are FAIR Digital Objects with their own persistent identifiers (Handle), PID Kernel, and structured serialisation (JSON and JSON-LD). We are paying close attention to the existing Darwin Core and ABCD elements in use, the ongoing work with MIDS to ensure the reusability of existing efforts. We also had several conversations that provided feedback on the new GBIF data model and we are exploring various pilots to see how DiSSCo and GBIF can help each other.

The data modelling work is a community effort and often takes time to reach a consensus. So we are taking a two pronged approach. First, continuing our regular working group meetings within the DiSSCo Prepare project and keep other outside stakeholders informed. We need this focused effort to fine tune the details. Second, as the data model is evolving we are taking an agile and DevOps approach to test and deploy the infrastructure. Over the past few months, we have deployed a robust test implementation of the DiSSCo Digital Specimen architecture following modern data and software architecture principles with FAIR and FAIR Digital Objects in mind. We will share some of these developments during TDWG and in this space as well.

CMS Roundtable 🦜

On Oct 10, we organised a virtual roundtable inviting several Collection Management System (CMS) vendors, developers to think about how local data systems can interact, integrate or make use of the envisioned DiSSCo infrastructure. For a technical background, please check out DiSSCo Prepare report D6.1 (“Harmonization and migration plan for the integration of CMSs into the coherent DiSSCo Research Infrastructure“) where we talked about API integration and Event Driven Design. The report summarises a previous workshop we did on Event Storming. The CMS roundtable is a follow of this workshop to get more feedback from the community. We had a lively discussion around the future possibilities and challenges. A report and future undertakings based on this roundtable will be available soon.

Geo-diversity data 🌋

DiSSCo will be working with both bio and geo-diversity data. There are already several European and global efforts going on (such as GeoCASe and Mindat) that are using existing data pipelines from museums via BioCASe and ABCD Extension for Geosciences to mobilise minerals, rocks, meteorites and fossils. We are working closely with several different stakeholders (software developers, data managers, and collection managers) to understand current strengths and gaps. More on this soon.

FAIR Digital Objects and Machine-Actionability

Why it matters more in the age of AI and Machine Learning

One of the key concepts outlined in the 2016 FAIR principle paper is “machine-actionability”:

..the idea of being machine-actionable applies in two contexts—first, when referring to the contextual metadata surrounding a digital object (‘what is it?’), and second, when referring to the content of the digital object itself (‘how do I process it/integrate it?’). Either, or both of these may be machine-actionable, and each forms its own continuum of actionability.

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

Since then, different interpretations and applications of machine-actionability has appeared. However, in recent times, with growth of data repositories, data silos, new architectures principles (such as Lakehouse and Data Mesh), critique of large scale AI models the continuum of machine-actionability is ever expanding. And even with the larger scale and higher volume, the question remains the same. We still need to know “what is it?” and “how do I process it/integrate it?”. We still need to understand and process each data element (the different digital objects) with “minimum viable metadata” and do operations on them — this could be image recognition program that distinguishes a husky from a wolf or diagnosing cancerous cells. The attributes and context of the individual artefacts matter. As we are expanding the scale and usage of AI and Machine Learning, this matters even more now.

And furthermore, even though machine-actionability might imply minimal human intervention, the operations and results of these actions have real world implications. Along with precise definitions and semantics, the context and provenance will become more and more relevant. The husky vs wolf example often time used to show the bias in model training. The original research was designed to see how human users react to such errors and how to create confidence and trust in AI models. In order to go towards such trustworthy system we need to understand the implications and implementation of machine-actionability.

FAIR and FAIR Digital Objects can play a significant role in creating such confidence and trust. In particular when it comes to open science and data intensive research. To begin with, precise definitions and formal semantics are essential. Along with that capturing the context and provenance can tell us why, where, who and when. All these are building blocks that can make data and information “Fully AI-ready” (another interpretation of the FAIR acronym). This readiness needs to be a modular approach instead of a one size fits all. At the same time, we need to provide an open and standard framework for better interoperability. A recent paper entitled “FAIR Digital Twins for Data-Intensive Research” by Schultes et al. proposes such modular approach to building systems based on FAIR Digital Objects. DiSSCo is also working on similar efforts. Some of these ideas also be explored in the newly launched Biodiversity Digital Twin project.

We welcome you to engage with us and think through these topics at the first international conference on FAIR Digital Objects. This event will be on Oct 26-28 (2022) in Leiden — 2022 European City of Science. More information about the conference is here. Hope you see you in Leiden!

Why Spreadsheets Matter?

A news story published on Aug 6, 2020, in The Verge had the following to say: “Sometimes it’s easier to rewrite genetics than update Excel“.

The story was about a known problem where Excel by default converts gene names to dates . You can try this out at home with the gene symbol “MARCHF1” — formerly known as “MARCH1”. If you type “MARCH1” in an Excel cell, it converts it to “Mar-01”. The HUGO Gene Nomenclature Committee published guidelines in 2020 to work around the Excel issue. However, according to a 2021 study, the problem persists and the authors concluded: “These findings further reinforce that spreadsheets are ill-suited to use with large genomic data.”

source: xkcd https://xkcd.com/2180/

But it seems we still cannot stop using Spreadsheets. Recently, a Twitter thread discussed the issue of using Spreadsheets for collections data and metrics. In this post, I try to summarise a few key points from that thread. The goal is partly to appreciate the hold Spreadsheets has in our lives but at the same time to acknowledge that we probably do not have to take a drastic measure as “rewrite genetics”, maybe a few best practices can help us use our tools better.

The twitter discussion started when a curator posted this:

Inevitably the issue of Spreadsheets came up:

And the conversation soon veered into other domains about recording loan requests, visits, walled/commercial software.

The gist of the conversation can be summarised as such: Besides the vast amount of scientific data we store in our collections and the various links to molecular, historical and other data types, we also have transactional and provenance data. For instance, digitisation, loans and visits requests. For a lot of us, these requests and subsequently the importance of these requests and the objects are hidden (often in Spreadsheets and Emails). And as a result, we cannot fully comprehend the time and effort it takes for processing these requests and the curatorial activity that happens behind the scene. This has a direct impact on how scientific data is used and shared. The more time it takes for curators and institutions to process this request, the more time it takes for the researcher to get the data. The more these efforts are invisible, the less funding and resources are allocated and as a result, slows down the data production lifecycle.

Some of these issues related to data curation work have also been raised in this paper : Groom, Q., Güntsch, A., Huybrechts, P., Kearney, N., Leachman, S., Nicolson, N., Page, R.D., Shorthouse, D.P., Thessen, A.E. and Haston, E., 2020. People are essential to linking biodiversity data. Database2020.)

How do we get these hidden data out of the Spreadsheets and make them usable and FAIR? There are various initiatives (like ELVIS in DiSSCo) that are trying to create services and infrastructures that can allow us to build abstraction layers to share these hidden metrics better.

We are not there yet. In the meantime, however, it is important to understand that tools like Excel are easy to use. It is cheaper than large enterprise database systems or large scale European projects. And of course, there’s also the resistance to change. When we use a tool for a long time with relatively good results (unless you are dealing with gene names!), it’s going to be hard for us to switch. There are also other issues such as creating and maintaining APIs and interfaces that need to be in place for the data elements to be useable.

https://thedatalabs.org/history-of-spreadsheets/

This is not to sing the praise of Spreadsheets but we need to acknowledge the ubiquity, and ease of as we move forward to a FAIR data ecosystem. Instead of trying to change the genetic codes of our data and workflow, we need to find better ways to collaborate and share ideas (as we did in the Twitter thread!).

Identifiers and contextual data

Persistent identifiers, their links to other identifiers and various contextual data are essential components of the DiSSCo architectural design. Here is a brief example to demonstrate the importance of these identifiers, linking them and providing enough contextual information to perform operations on the digital specimen objects. The example here also highlights some of the challenges we are facing regarding standards mapping and interoperability.

At Naturalis, we are currently adding images to some of our bird’s nest specimen data. Here’s one nest for the bird Turdus merula merula (common blackbird, merel in Dutch).

Nest of Turdus merula merula from Naturalis Biodiversity Center. source: https://bioportal.naturalis.nl/specimen/ZMA.AVES.64793

We have data about this specimen in the museum portal and also in GBIF. However, we start to see some challenges while trying to find these specimens and link them. The museum system describes the nest using the ABCD schema with the following two terms: “recordBasis”: “PreservedSpecimen” and “kindOfUnit”: “nest”. recordBasis maps to the Darwin Core term basisOfRecord but kindOfUnit does not map to any terms (FYI: there is a mapping schema and the community is aware of these issues). As a result, searching for nests of Turdus merula merula can only be done from the Naturalis bioportal. Even though the GBIF record points back to the museum record, from the museum record we cannot get to GBIF. Other museums are using the field dynamicProperties. Here’s an example of a nest from NHMUK (this is a good example because it shows bi-directional links). As different standards and systems are involved in the data management and publishing pipeline that are not fully interoperable, we lose the context. Again, we are well aware of these problems. And various initiatives are currently addressing these issues from different data points.

For this example, I created a simple Digital Specimen. It has a persistent identifier and enough contextual information to tell us that this is a nest, not a bird specimen. It also shows that the collector is “Max Weber” (not the famous sociologist) but Max Wilhelm Carl Weber, a German-Dutch zoologist.

Screenshot from Bionomia that uses wikidata and gbif identifiers to connect specimens to collectors. source: https://bionomia.net/Q63149

A snippet from our test digital specimen:

ods:authoritative": {
"ods:midsLevel": 1,
"ods:curatedObjectID": "https://data.biodiversitydata.nl/naturalis/specimen/ZMA.AVES.64793",
"ods:institution": "https://ror.org/0566bfb96",
"ods:institutionCode": "Naturalis",
"ods:objectType": "Bird's nest",
"ods:name": "Turdus merula merula"
}
,

We can also add more information about this digital object to provide more context:

“ods:supplementary”: {
“gbifId”: https://www.gbif.org/occurrence/2434245775,
“dwc:recordedBy”: “Weber, Max”,
“dwc:recordedByID”: https://www.wikidata.org/wiki/Q63149
}

To learn more about the elements and the structure of Digital Specimen please check out the openDS repo. The python codes of this example are available in this notebook.

This simple example shows the value of identifiers (such as Handle, ROR, Wikidata) that can help link various contextual information to make these specimens FAIR. In the coming years, DiSSCo with others will tackle some of these challenges.

Roundup of recent news

A lot has happened since our last post here. So a roundup of recent news.

Our very own Wouter Addink presented DiSSCo at the Chinese national biodiversity informatics congress. This is a great initiative to facilitate global conversations around data and infrastructures.

Several members of the DiSSCo Technical Team and members of DiSSCo Prepare WP5 (Common Resources and Standards) and WP6 (Technical Architecture & Services provision) participated in the first BiCIKl Hackathon (September 20-24, 2021) focusing on interoperability between the infrastructures. It was great to see everyone in person (the hackathon was also a hybrid event). Thanks to Meise Botanic Garden for hosting us.

In the team involving DiSSCo members, we worked on the DiSSCo modelling framework to see how to model heterogeneous concepts and data that are linked and related but needs to be encapsulated for certain operations and workflow. We used Wikibase and Shape Expressions Language (ShEx) to define a schema of data types and properties and created a simple RDF validation pipeline to create FAIR Digital objects in a Digital Object Repository. We will write more about this framework soon.

At the upcoming TDWG Virtual Annual Conference (Oct 18-22), you will hear more about the BiCIKL project and the hackathon. Also, during TDWG join us for SYM03: Specimen data provision and mobilisation: DiSSCo community e-services and standards.

DiSSCo also participated in the MOBILISE Action workshop on machine learning on images of Natural History Collections and workshop on Loans and Permits.

Recently, at the iDigBio Biodiversity Digitization 2021 conference, Alex Hardisty gave a presentation on Open Digital Specimen (openDS). If you missed it the recording is available. Also, Executive Director of DiSSCo, Dimitris Koureas talked about International research (data) infrastructure frameworks as leverage for sustainable natural science collections funding: The European example of the Distributed System of Scientific Collections.

ELViS 1.0.0 is here: An important milestone for DiSSCo

On March 18, 2021 a new deployment of ELViS (European Loans and Visits System) became available. ELViS 1.0.0 is currently being used to facilitate the 3rd Transnational Access call for SYNTHESYS+ (to fund short-term research visits to consortium institutions) and the 2nd Virtual Access call (to fund digitisation-on-demand requests).

Preseucoela imallshookupis is a species of gall wasp. The genus name, Preseucoela, is named after Elvis Presley. Image source: Berenbaum, M., 2010. Preseucoela imallshookupis has left the building. American Entomologist56(4), pp.196-197. https://doi.org/10.1093/ae/56.4.196

The current version of ELViS is an important milestone in the SYNTHESYS+ project and also towards building DiSSCo — a new world-class Research Infrastructure for natural science collections. We have come a long way since mid 2019 when we started gathering user surveys and requirements with our development partner Picturae. The surveys, workshops and weekly meetings contributed to user-stories. Here is a collection of user stories that have been addressed in this version of ELViS.

GitHub project board from the ELViS repo

ELViS is a great example of a tool that is being built together with the community and the user base that will ultimately use it. As members in the SYNTHESYS+ project are based in different parts of Europe, we were already holding regular zoom meetings to facilitate the development process. Github was extensively used to create wireframes and guide the sprint activities. Although not all the efforts of the SYNTHESYS+ WP6 partners and the talented developers at Picturae are reflected in the following chart, you can see activities based on issues submitted over the past few months during our test and development process.

Chart generated in https://jerrywu.dev/github-issue-visualizer/

We still have a long way to go to support the loans and visits transactions but we are excited about the launch of the 3rd Transnational Access call and future of ELViS and DiSSCo.

The DiSSCo Knowledgebase

Authors: Mareike Petersen*, Julia Pim Reis*, Sabine von Mering*, Falko Glöckler*
* Museum für Naturkunde Berlin, Germany

Introduction

As an initiative formed by public research institutions, DiSSCo is committed to Open Science. We believe that Open Science not only makes the scientific work more transparent and accessible but also enables a whole new set of collaborative and IT-based scientific methods. Therefore, the outputs of our common research projects are openly available as much as possible and research data easily Findable, more Accessible, Interoperable and Reusable (FAIR principle).

DiSSCo Prepare (DPP), the preparatory project phase of DiSSCo, will build on profound technical knowledge from various sources and initiatives. In order to allow for efficient knowledge and technology transfer for partners building the DiSSCo technical backbone, a central and freely accessible DiSSCo Knowledgebase will be designed and implemented within the project. The conceptual and developmental work is done under the Work Package “Common Resources and Standards” and the Task “DiSSCo Knowledgebase for technical development” (both led by the Museum für Naturkunde Berlin). This hub for knowledge management relevant within the DiSSCo context will not only store all research outputs from DiSSCo-linked projects in one place but also act as a reference for further building blocks relevant for the DiSSCo Research Infrastructure (RI).

Approach

As a first step, the extent of information types expected to be stored in the knowledgebase was collected. To get the most complete picture we discussed this topic within the respective project task group and work package, but also together with project overarching bodies such as the DiSSCo Technical Team. As a last preparatory step, we sent a survey to all task and work package leads of DPP to evaluate which information types partners are planning to make available via the knowledgebase. The feedback was included in the discussions and planning steps. The latest overview of desired information types is given in Figure 1.

Figure 1: Information Types in the DiSSCo Knowledgebase. Expected cluster of information categories (blue dots) based on DPP Project outcomes and relevant external resources. The format of resources varies within and among information types.

As the term knowledgebase traditionally was used in a context of providing machines with a database of facts for reasoning processes, the partners agreed that we would use the term with a main focus on human readability in the DiSSCo Knowledgebase in the first place. The importance of machine readability varies amongst different information types. However, the metadata will be machine-readable in a consistent manner.

According to our findings, the different information types vary in formats and system requirements and cannot be stored in one single system. Whereas for some information types the target system is more or less set (e.g. GitHub for software code), for others a well considered decision is necessary. Task partners focused on a decision about a software system for the most common information type “Public Documents and External Resources” in order to aggregate references to distributed documents and sources in a single point of entry. A comprehensive landscape analysis with short presentations of each system took place during two task group meetings. For the decision process, requirements of the knowledgebase were collected and prioritised. 

Criteria of top priority for the decision of an appropriate component for the knowledgebase to serve the information type “Public Documents and External Resources” were:

  • Capability of storing documents and free text for referencing deliverables, publications and Questions and Answers / FAQs
  • Extensibility & customization (plugins or extensions)
  • Comprehensive public technical documentation and user documentation
  • Comprehensive REST API
  • Mechanisms for stable versioning of content
  • Search index (including the capability of indexing of customizable metadata)
  • Hierarchical structuring of pages and other entities
  • Capability of structuring the content by categories, tags or labels
  • File upload, storage and download
  • User-friendly search functionality
  • Regular security updates
  • View and download functionality for common document and image file formats
  • Option to run an instance in a cloud environment (rather than a Software as a Service approach)
  • Sustainability of the software product (e.g. organisation in place to support and maintain)

Based on the requirements, the most promising systems were DSpace, CKAN, and Alfresco. All three products meet the requirements for the respective information type “Public documents and external resources” in the knowledgebase according to the prioritized criteria. So, the following additional aspects with respect to the implementation and maintenance have been included in the decision process: latest releases, size of user community, regular support and good software maintenance allowing the correction of possible bugs, and regular security updates. Thus, the team chose DSpace, an open source repository software package of rich and powerful features that focus on long-term storage, access and preservation of digital content. It is available as free software under an open-source license in a public GitHub repository and has a huge user community and a very active group of developers. It offers customizable interfaces, a full-text-search where the provided metadata for content is indexed to be searchable and accessible with the use of a REST API enabling the data to be FAIR. A reliable search functionality allows the end-users to find the content without delay even for huge amounts of data which is essential regarding scalability with an increasing amount of linked information. A list of more convincing key features of DSpace can be accessed at the official website.

First Version

The implementation of DSpace as a first version of the DiSSCo Knowledgebase core will have a customized layout with the DiSSCo branding. It will allow to create a hierarchy of DiSSCo-linked projects and their respective collections of documents and references. In order to store content like Frequently Asked Questions (FAQs), best practices, guidelines, recommendations and documented decisions on the RI, the DiSSCo partners will be enabled to extend the knowledgebase with their content (being free text or files) with the help of easy-to-use web-forms that include a rich text editor. An editorial workflow modelled in the system will allow the platform administrators to review the content prior to publication via role based access. This will also allow for preparing documents privately before publishing them and conducting a profound quality assurance.
The first version of the DiSSCo Knowledgebase will be launched by end of January 2021 at http://know.dissco.eu 

Next Steps

As a next step, the current results of the implementation of the DiSSCo Knowledgebase will be presented at the first All Hands (virtual) Meeting of DiSSCo Prepare (18 – 22 January 2021). This is an event that will bring together leaders and partners of the project, with the objective to present, discuss and produce key elements of what will become Europe’s leading natural science collections Research Infrastructure, DiSSCo RI. In a dedicated session,  the participants will have the opportunity to test the first version of the knowledgebase by browsing the software and testing the features, allowing us to collect feedback and requirements from the project partners.

The DiSSCo Knowledgebase, in its final version, will provide structured technical documentation of identified DiSSCo technical building blocks, such as web services, PID systems, controlled vocabularies, ontologies and data standards for bio- and geo-collection objects, collection descriptions, digital assets standards as well as domain-specific software products for quality assurance and monitoring; an assessment of their technical readiness for DiSSCo as well as specifications on their relevance for the overall DiSSCo technical infrastructure and the DiSSCo data model.

Outlook

DiSSCo uses a DOI namespace provided by DataCite for assigning DOIs to documents like public deliverables and reports. This process will be automated with the help of a DSpace plugin on the document’s submission. In addition, depositing and linking documents on Zenodo will be integrated. 

To increase the findability of content the metadata will be linked and enriched by cross-references to related content and external resources (e.g. ORCID). In order to optimize the findability even outside the search interface of the knowledgebase the JSON-LD format will be embedded in the landing pages, so the visibility of DiSSCo outcome and knowledge is maximized in the big search engines.

Over the course of the upcoming year 2021 all the other information types will be accommodated or linked in the DiSSCo Knowledgebase. This can be assured by submitting at least a metadata description about information that will be managed outside of DSpace (e.g. software code on GitHub or controlled vocabulary in WikiBase). But by providing machine-readable formats, custom plugins in DSpace will allow even richer connections between different components of the DiSSCo Knowledgebase.

Want to get involved? Feel free to check our remote repository on GitHub or contact us here! 

Debunking reliability myths of PIDs for Digital Specimens

In this post I address an erroneous assertion – a myth perhaps, that the proposed Digital Specimen Architecture relies heavily on a centralized resolver and registry for persistent identifiers that is inherently not distributed and that this makes the proposed “persistent” identifiers (PID) for Digital Specimens unreliable. By unreliable is meant link rot (‘404 not found’) and/or content drift (content today is not the same as content yesterday).

Continue reading “Debunking reliability myths of PIDs for Digital Specimens”

Reflections on TDWG 2020 Virtual sessions and other thoughts on long term data infrastructures

This year the annual conference of the Biodiversity Information Standards (historically known as the Taxonomic Databases Working Group — TDWG) is virtual and happening in two parts. The working sessions were concluded a few weeks ago and are separated from the virtual conference, which will be held on October 19-23. All the recordings of the working sessions are now available in youtube.

As several people already mentioned in twitter (#TDWG2020) the single track and the virtual format allowed participation from around the world which generated a wide range of discussions on not just data standards but also about data curation, attribution, annotation, integration, publication and most importantly the human efforts that are behind the data and systems.

It is this human aspect in the midst of our current data-intensive approach got me thinking about several contrasting aspects of biodiversity informatics and natural science collections management. Thinking about these two aspects together should be more at the forefront of our data and infrastructure discussions.

One contrast that lurks behind the “data-intensive” approach is the mix of structured collection of items (such as databases, spreadsheets) with narratives. This is what Lev Manovich called in his 1999 article the “database/narrative” opposition:

“As a cultural form, database represents the world as a list of items and it refuses to order the list. In contrast, a narrative creates a cause-and-effect trajectory of seemingly unordered items (events). Therefore, database and narrative are natural enemies. Competing for the same territory of human culture, each claims an exclusive right to make meaning out of the world.”

The physical objects stored and curated by the Natural History Museums and other institutes — elements for scientific meaning-making of the world — provide an interesting opportunity to explore this contrast further. In one hand, we have data collected about specimens and related objects stored in different formats (databases, spreadsheets, etc.). Most often there is some structure to these datasets. For instance this snippet from a GBIF occurrence record:

Branta bernicla nigricans (Lawrence, 1846)
Alaska
North America, United States, Alaska, Aleutians West Census Area
Saint Paul Island, Pribilof Islands
G. Hanna
NMNH Extant Biology
http://n2t.net/ark:/65665/396827066-bc6d-4419-83e7-25774fe2b0d3

With the help of APIs and parsing tools, we can figure out the structure of this snippet and derive at an assessment that this contains species name, collector name, a place, and a specimen identifier. On the other hand, we find snippets like the following hidden among the structured elements. This is from the European Nucleotide Archive (ENA) accession data derived from the above specimen:

DT 02-JUL-2002 (Rel. 72, Created) 
DT 29-NOV-2004 (Rel. 81, Last updated, Version 4)
note="obtained from preserved Brant goose (Branta bernicula)
specimen from the Smithsonian Institution's Museum of Natural History; 
specimen was originally collected by G.D. Hanna from St. Paul Island, 
Pribilof Islands, Alaska, on Sept. 9, 1917

Here we find a narrative — an ordered event list — it describes who, when, and what was collected. Of course, from the linked data, semantic interoperability, machine readability, actionability and FAIR point view, there are plenty of issues here that the community are struggling with. But let’s focus on what it means when our systems and workflows encounter these two very different types of data.

First of all, with tools and APIs, these two datasets (GBIF and ENA) eventually can be linked and made interoperable, FAIR — a definitely useful endeavour. But what is much harder to study and provide is to understand the theoretical underpinning and the context of these data. From several publications related to this specimen (mentioned in the GBIF snippet above), we learn that it was used in research related to the 1918 pandemic virus (the Smithsonian has several thousand such specimens from the early part of the 20th century). As we are living through another pandemic, one might wonder what were the historical, social, and political contexts of collecting and preserving these specimens? Who are the people behind these collection events? (see Bionomia profile of G.D. Hanna).

Scientists and data engineers might not be interested in these questions. Still, we often overlook that there’s no such thing as raw data and contexts, history influence scientific reasoning and the direction of research. This is echoed by different philosophers and historians of science. Most recently by Sabina Leonelli in the context of big data biology where she says, “increasing power of computational algorithms requires a proportional increase in critical thinking”. And as more data-intensive and automated, our research is becoming, the more we need to seriously look at:

“value- and theory-laden history of data objects. It also promotes efforts to document that history within databases, so that future data users can assess the quality of data for themselves and according to their own standards.”

The second point pertains to this aspect of history — in particular when data moves from one system to another. As data are collected from field sites then added to spreadsheets, imported into a database and then published to an aggregator they get denormalized, decontextualized, and then normalized and contextualized again. An API endpoint might provide some provenance information and summary, but the narrative and “data events” usually are missing. And we probably do not expect all systems to capture all these events. But these practices, events, data migrations leave traces of prior use that have impacts on later workflows (see the article by Andrea Thomer et al. that talks about data ghosts that haunt Collection Management System (CMS) data migration).

As we are building and working on data infrastructures to support scientists and eventually the society, we should have a pragmatic and holistic approach in understanding the database/narrative mix. With our unbridled euphoria about all things machine learning, automation and AI, we should be cautious about the long term implications and build something that is here to last.

This brings us back to the human aspect of the data. I will end the article from a quote by historian Mar Hicks. Recently COBOL (designed in 1959) become the scapegoat as the U.S unemployment insurance systems were overwhelmed during the pandemic. It turns out the issue was not with COBOL, it was the web front end that people used to file the claims (written in Java). Her article talks about the notion of “labor of care” — the engineers, people behind COBOL, the care and effort that goes behind maintaining large, complicated software and infrastructures — especially the ones that are needed during a crisis. Our tech innovation culture is too much focused on speed and structure side of things instead of the narrative. I leave you with her concluding sentence:

If we want to care for people in a pandemic, we also have to be willing to pay for the labor of care. This means the nurses and doctors who treat COVID patients; the students and teachers who require smaller, online classes to return to school; and the grocery workers who risk their lives every day. It also means making long-term investments in the engineers who care for the digital infrastructures that care for us in a crisis.

When systems are built to last for decades, we often don’t see the disaster unfolding until the people who cared for those systems have been gone for quite some time. The blessing and the curse of good infrastructure is that when it works, it is invisible: which means that too often, we don’t devote much care to it until it collapses.

Natural Science Identifiers & CETAF Stable Identifiers

The DiSSCo Technical Team gets asked a lot about Natural Science Identifiers (NSId). What are they? Why do we need them in addition to CETAF Stable Identifiers? Are they just for DiSSCo/Europe or are they global? In this post we answer those questions.

Q1. What is a Natural Science Identifier (NSId)?

A Natural Science Identifier (NSId) is a universal, unique persistent identifier for digitised natural science specimens (i.e., Digital Specimens) and other associated object types. An NSId will help you unambiguously refer to a specimen you are working with or will help to find a specimen that someone else has told you about by giving you the NSId e.g., as a reference in a journal article.

Continue reading “Natural Science Identifiers & CETAF Stable Identifiers”