Dispelling common data sharing myths

This blog is a distillation of a number common misunderstandings about data sharing that I have encountered during my career, especially while working as a scientific editor. There are plenty of reasons why data sharing can be complicated. Academic researchers can face conflicting incentives, a confusing legal landscape, as well as substantial technical challenges, especially when dealing with large or dynamic datasets. Pernicious misconceptions, however, create unnecessary confusion and make data sharing seem more complicated than it needs to be. Below I list five of what are, in my experience, the most common and most easily dispelled of these “data sharing myths”.

1. Excel uses a proprietary file format that is not appropriate for open data sharing

Excel and other Microsoft Office products have used open file formats since 2007. These are zipped, XML-based formats that conform to publicly available specifications. These files can be distinguished from older ‘closed’ formats, by the ‘x’ at the end of the file extension (e.g. .xlsx vs .xls). They can be opened and edited with various open source spreadsheet alternatives, and, after being unzipped, the XML files can be read with a simple text viewer.

Pro Tip: If you use Excel, there is nothing wrong with sharing the XLSX files directly. They are a reasonable way to distribute small to medium-sized datasets in a manner that will be accessible to a wide range of users, especially if you need to include graphs or simple analyses. There are, however, plenty of reasons to be wary of spreadsheets, especially in critical data management situations. For tips on how to work better with spreadsheets see the Turing Way’s guidance.

2. CSV is the ‘best’ file format for sharing data

The “Comma Separated Values” file format (CSV) is loved by digital archivists and data managers for good reasons. Files in this format are simple to open with both text editors and spreadsheet programs, and are appropriate for a wide range of data-types. In my experience, confusion arises, however, when it is promoted as the ‘best’ file format, with the implication that using other file formats must therefore somehow be ‘wrong’. Indeed, the CSV format has its own limitations. It has a range of implementations, which can make machine-processing of files from different sources dangerous, and is less suited to semantic data structures (compared to e.g. RDF, XML or JSON) or very large datasets (compared to e.g. HDF).

Pro Tip: When converting between CSV and spreadsheet formats, watch out for character encoding errors!

3. A non-commercial license will ensure that my data are only used by academic researchers

Commercial use and research use are two distinct concepts. Non-commercial licenses, like Creative Commons’ popular CC BY-NC license, do not limit use to only research settings. This may seem like a fairly obvious distinction, but this is a point of confusion that I have encountered surprisingly often. I have even spoken with research groups that believed a CC BY-NC license would help them distribute human data in line with consent agreements that only permit “research use”. This is a dangerous misunderstanding that risks violating ethical obligations.

Pro Tip: For anyone working with sensitive human data, always speak with an appropriate human data protection expert before choosing any data license or data use agreement. For researchers interested in learning more about open data licensing, I would recommend starting with this paper by Labastida & Margoni (2020).

4. The CC BY license is a good choice for sharing data because it requires attribution

The CC BY license and the CC0 waiver are widely used and recommended for sharing open data. They both give users broad freedom to reuse and redistribute data. The CC BY license, however, adds a legal requirement for attribution. Researchers should be wary of guidance that suggests that CC BY data require attribution and CC0 data do not. Scholarly attribution is mainly enforced by ethical norms, not legal considerations. These ethical norms still apply for data shared under the CC0 waiver. Indeed, a number of scholarly organizations recommend using the CC0 waiver for research data (see also Hrynaszkiewicz & Cockerill 2012). Make sure you attribute your data sources, regardless of the licenses or waivers used.

Whether the added legal attribution requirement in the CC BY license is helpful in a scholarly context is debatable. It can be hard to determine when this requirement would actually be enforceable for factual scientific data. Moreover, researchers should be aware that the kind of attribution that the CC BY license requires may not align with the kind of “credit” they actually care about, particularly formal citations in peer-reviewed papers.

Pro Tip: Want to get credit for sharing your data? In most cases, there are better ways to expend your effort than worrying about licensing details. For instance, make sure your dataset landing page has clear citation information. Not sure what your data citation should include? I built a webtool that generates sample citations for a range of common dataset identifiers. Also, make sure your dataset is as FAIR as possible, so that others will be able to use it and credit you!

5. If it’s on the web, it’s in the public domain

Just because something is freely available on the web, does NOT mean it is in the public domain. “Public domain” is a specific legal designation that applies to creative works to which no copyright or other intellectual rights apply. If a researcher develops a dataset that incorporates content from other online sources, it is their responsibility to determine whether any of this content may be covered by copyright or have other restrictions on reuse.

Facts themselves are not covered by copyright, so usually it is fine to aggregate individual data points from the literature, for example when performing a meta-analysis, without worrying about licensing or copyright. Care must be taken though when copying sets of data, especially from databases or data portals. Always check the terms of use carefully in these instances, and ask for permission if you are unsure whether you can redistribute data. One should also, of course, always cite one’s sources, regardless of copyright status.

Pro Tip: As you perform your research, carefully keep track of the copyright status and any terms of use statements associated with data downloaded from online sources. This will make it much easier to determine later how best to share your data.

6. There is no repository for my data

Some fields, especially molecular biology, have a history of sharing data through repositories that are dedicated to specific data-types, for example the global nucleic acid sequence repositories of the INSDC. While this has been a powerful paradigm, it can lead to the misconception that a researcher needs a specialist data repository to effectively share data. There are now a wide range of excellent “generalist” data repositories serving the research community, including options like figshare, Zenodo & Dryad, as well as a wide range of institutional data repositories. With so many options now available, almost any data-type can be shared, assuming of course that there are not legal or ethical limitations that otherwise prevent open sharing.

Pro Tip: There are a number of resources on the web that can help you find a suitable data repository, including FAIRsharing.org and re3data.org.

For academic research groups that aren’t generating data from humans, or producing other special kinds of sensitive or restricted data, the message we send as open science advocates should be clear: Data sharing is not scary. File formats you are already using are probably fine. There is almost certainly a repository for your data. Pick a common unrestricted open data license or waiver, like the CC0 waiver, and don’t worry about the details. Help others credit you by providing clear citation information for your data. Make sure you set a good example by always citing your data sources in your own publications.

I hope that sharing these thoughts will help some other researchers or data managers grappling with these same myths!

This work is licensed under a Creative Commons Attribution 4.0 International License.

Andrew Lee Hufton

Andrew L. Hufton, PhD