A news story published on Aug 6, 2020, in The Verge had the following to say: “Sometimes it’s easier to rewrite genetics than update Excel“.
The story was about a known problem where Excel by default converts gene names to dates . You can try this out at home with the gene symbol “MARCHF1” — formerly known as “MARCH1”. If you type “MARCH1” in an Excel cell, it converts it to “Mar-01”. The HUGO Gene Nomenclature Committee published guidelines in 2020 to work around the Excel issue. However, according to a 2021 study, the problem persists and the authors concluded: “These findings further reinforce that spreadsheets are ill-suited to use with large genomic data.”
But it seems we still cannot stop using Spreadsheets. Recently, a Twitter thread discussed the issue of using Spreadsheets for collections data and metrics. In this post, I try to summarise a few key points from that thread. The goal is partly to appreciate the hold Spreadsheets has in our lives but at the same time to acknowledge that we probably do not have to take a drastic measure as “rewrite genetics”, maybe a few best practices can help us use our tools better.
The twitter discussion started when a curator posted this:
Inevitably the issue of Spreadsheets came up:
And the conversation soon veered into other domains about recording loan requests, visits, walled/commercial software.
The gist of the conversation can be summarised as such: Besides the vast amount of scientific data we store in our collections and the various links to molecular, historical and other data types, we also have transactional and provenance data. For instance, digitisation, loans and visits requests. For a lot of us, these requests and subsequently the importance of these requests and the objects are hidden (often in Spreadsheets and Emails). And as a result, we cannot fully comprehend the time and effort it takes for processing these requests and the curatorial activity that happens behind the scene. This has a direct impact on how scientific data is used and shared. The more time it takes for curators and institutions to process this request, the more time it takes for the researcher to get the data. The more these efforts are invisible, the less funding and resources are allocated and as a result, slows down the data production lifecycle.
Some of these issues related to data curation work have also been raised in this paper : Groom, Q., Güntsch, A., Huybrechts, P., Kearney, N., Leachman, S., Nicolson, N., Page, R.D., Shorthouse, D.P., Thessen, A.E. and Haston, E., 2020. People are essential to linking biodiversity data. Database, 2020.)
How do we get these hidden data out of the Spreadsheets and make them usable and FAIR? There are various initiatives (like ELVIS in DiSSCo) that are trying to create services and infrastructures that can allow us to build abstraction layers to share these hidden metrics better.
We are not there yet. In the meantime, however, it is important to understand that tools like Excel are easy to use. It is cheaper than large enterprise database systems or large scale European projects. And of course, there’s also the resistance to change. When we use a tool for a long time with relatively good results (unless you are dealing with gene names!), it’s going to be hard for us to switch. There are also other issues such as creating and maintaining APIs and interfaces that need to be in place for the data elements to be useable.
This is not to sing the praise of Spreadsheets but we need to acknowledge the ubiquity, and ease of as we move forward to a FAIR data ecosystem. Instead of trying to change the genetic codes of our data and workflow, we need to find better ways to collaborate and share ideas (as we did in the Twitter thread!).