A news story published on Aug 6, 2020, in The Verge had the following to say: “Sometimes it’s easier to rewrite genetics than update Excel“.
The story was about a known problem where Excel by default converts gene names to dates . You can try this out at home with the gene symbol “MARCHF1” — formerly known as “MARCH1”. If you type “MARCH1” in an Excel cell, it converts it to “Mar-01”. The HUGO Gene Nomenclature Committee published guidelines in 2020 to work around the Excel issue. However, according to a 2021 study, the problem persists and the authors concluded: “These findings further reinforce that spreadsheets are ill-suited to use with large genomic data.”
But it seems we still cannot stop using Spreadsheets. Recently, a Twitter thread discussed the issue of using Spreadsheets for collections data and metrics. In this post, I try to summarise a few key points from that thread. The goal is partly to appreciate the hold Spreadsheets has in our lives but at the same time to acknowledge that we probably do not have to take a drastic measure as “rewrite genetics”, maybe a few best practices can help us use our tools better.
The twitter discussion started when a curator posted this:
Inevitably the issue of Spreadsheets came up:
And the conversation soon veered into other domains about recording loan requests, visits, walled/commercial software.
The gist of the conversation can be summarised as such: Besides the vast amount of scientific data we store in our collections and the various links to molecular, historical and other data types, we also have transactional and provenance data. For instance, digitisation, loans and visits requests. For a lot of us, these requests and subsequently the importance of these requests and the objects are hidden (often in Spreadsheets and Emails). And as a result, we cannot fully comprehend the time and effort it takes for processing these requests and the curatorial activity that happens behind the scene. This has a direct impact on how scientific data is used and shared. The more time it takes for curators and institutions to process this request, the more time it takes for the researcher to get the data. The more these efforts are invisible, the less funding and resources are allocated and as a result, slows down the data production lifecycle.
Some of these issues related to data curation work have also been raised in this paper : Groom, Q., Güntsch, A., Huybrechts, P., Kearney, N., Leachman, S., Nicolson, N., Page, R.D., Shorthouse, D.P., Thessen, A.E. and Haston, E., 2020. People are essential to linking biodiversity data. Database, 2020.)
How do we get these hidden data out of the Spreadsheets and make them usable and FAIR? There are various initiatives (like ELVIS in DiSSCo) that are trying to create services and infrastructures that can allow us to build abstraction layers to share these hidden metrics better.
We are not there yet. In the meantime, however, it is important to understand that tools like Excel are easy to use. It is cheaper than large enterprise database systems or large scale European projects. And of course, there’s also the resistance to change. When we use a tool for a long time with relatively good results (unless you are dealing with gene names!), it’s going to be hard for us to switch. There are also other issues such as creating and maintaining APIs and interfaces that need to be in place for the data elements to be useable.
This is not to sing the praise of Spreadsheets but we need to acknowledge the ubiquity, and ease of as we move forward to a FAIR data ecosystem. Instead of trying to change the genetic codes of our data and workflow, we need to find better ways to collaborate and share ideas (as we did in the Twitter thread!).
One thought on “Why Spreadsheets Matter?”
This is a difficult issue because so many people in the biodiversity data community use Excel for purposes other than sharing data, e.g. tracking loans etc. It becomes easy to think that Excel is free of data management problems and can be used for compiling and sharing biodiversity data. This leads to disastrous results. As a data auditor who looks at shared biodiversity data I often find
– data formatted in sych a way that it is no longer usable except in spreadsheet form
– data items shifted in blocks “left” or “right” into fields where they do not belong
– fill-down errors, including incrementing ones (https://www.datafix.com.au/BASHing/2021-05-26.html)
– formula residues (“#NAME?”)
– dates, numbers or certain types of strings converted by Excel into something else
– Excel date twins (https://www.datafix.com.au/BASHing/2021-03-09.html)
– formatting problems after export from Excel to plain text (usually CSV)
Problems like these have long been recognised and the Carpentries trainers have a series of lessons on how to avoid the most obvious ones:
There are also some widely ignored advisories as published papers, e.g.
Excel users feel comfortable in spreadsheets. They might think that they can spot and correct any data entry or Excel-caused errors just by running their eye over thousands of records, which is not humanly possible. They trust their data entry skills and do not use Excel validation rules because they are too complicated. They want to view their dataset neatly laid out in a table, and don’t know how to get that view of their data without using Excel.
That last problem, at least, has been solved. There is a free+premium table editor for Windows, Macs and Linux that doesn’t have Excel’s failings. I strongly recommend ModernCSV:
I don’t know how to fix the other two user prejudices.
LikeLiked by 1 person