Building data citations from ‘roadmap-compliant’ metadata sources

As an editor, I often need to advise authors on how to format their citations. At Scientific Data, we, unsurprisingly, take particular care with citations to datasets. It can be difficult, however, to divine the ‘right’ citation information for a dataset.

Three papers published in 2018 and 2019 collectively presented a ‘roadmap’ that was designed to take the guesswork out of this task (see further reading for more). A key aspect was a set of standards by which scholarly data repositories could provide machine-readable citation metadata (Fenner et al 2019).

As a bit of a learn-to-code project, I have been building a web tool that checks whether a dataset has citation metadata in one of the roadmap-recommended formats and then builds a formatted citation. The result, so far, is here: Data Citation Formatter. The current version can handle datasets with either DOI or identifiers.org persistent identifiers, and with metadata provided by DataCite or Crossref, or embedded in the webpage as Schema.org JSON-LD. It outputs citations in the Nature Research style, which is similar in content to the DataCite recommended style.

Services like DataCite, Crossref, doi2bib and Crosscite, as well as many reference management software packages, already provide access to DOI-based metadata and can generate formatted citations. But, with data citation standards still evolving, the outputs from these services can vary. And, there are not, to my knowledge, citation formatting services that support identifiers.org datasets. This tool provides a quick, independent way to validate a provided citation regardless of the persistent identifier type or metadata source, and to compare different metadata sources when a repository uses more than one.

A few important caveats: First, I am a terrible coder. The tool is simple and probably also quite buggy. This is a side project of mine, so don’t expect bugs to be fixed quickly or ever. The tool also presently offers no machine-readable outputs. If a BibTeX or RIS output would be useful to you, please let me know. It would be relatively easy to add one of those options if there is interest. In any case, advanced users would be better served by directly connecting with the underlying metadata sources.

The source code is openly available at GitHub (alhufton2/data-citer). The public web tool caches results for some hours, so it may not immediately recognize changes in the metadata sources. You may want to run the tool locally if you need to check for metadata changes on a shorter time scale.

A few words on the metadata sources used: Crossref is not actually mentioned as a metadata source in the Fenner et al roadmap paper, but it predates DataCite and still serves some well-established scholarly data repositories. It also has a dedicated ‘dataset’ record type. Therefore it made sense to me to include it. The tool will also grab repository names from the identifiers.org registry API, another source not explicitly mentioned in the roadmap. This allows it to create a very minimal citation for identifiers.org datasets that do not have other metadata. The tool currently doesn’t support metadata in Dublin Core, DATS or Schema.org microdata. If you know of any repositories using these formats, please send me an example!

It’s worth noting that there are some good discovery portals for biomedical datasets that can be used as indirect sources of citation metadata, particularly DataMed and OmicsDI. OmicsDI provides Schema.org metadata, and at least some of its records are queryable through identifiers.org links and therefore can be used as metadata sources by my tool (e.g. PXD001416 via OmicsDI). DataMed provides metadata in the DATS format, but I haven’t yet figured out how to query it programmatically.

If you have read this blog this far, and still have no idea what this is all about, don’t worry. Scholarly citation can be a pretty arcane topic. Most people will have had to format a reference list at least once for a school report, but if you aren’t working in academic research it is probably hard to appreciate why those lists matter so much, and why there is squabbling over seemingly trivial points like exactly where one puts a period or how one truncates an author list. Reference lists are so important to researchers because they assign credit to earlier work, they help record the evidence supporting a paper, and, in the modern era, they get ingested into all kinds of web services that bind together related content and drive (oft-maligned) point-systems that can affect a researcher’s career-progression. Improving how we handle data citation is therefore an essential step in making data sharing a more fully recognized and rewarded part of research culture.

You may also note that, so far, I haven’t mentioned the ongoing coronavirus pandemic. If you find that refreshing, you’re welcome. If you find it insensitive, my apologies, I appreciate that this is a challenging time for many people, and that the above may seem pretty trivial in the face of a global crisis of this magnitude.

In fact, I managed to push this through to a working stage because I had a week of holiday that I spent at home due to the travel restrictions. I actually found writing bad code therapeutic during this stressful time. Don’t judge me.

Note: Data Citation Formatter is a personal project, and not a service provided by Springer Nature or Scientific Data.

Further reading

  1. Data Citation Synthesis Group. Joint Declaration of Data Citation Principles. FORCE11, https://doi.org/10.25490/a97f-egyk (2014).
  2. Ohno-Machado, L. et al. Finding useful data across multiple biomedical data repositories using DataMed. Nat. Genet. 49, 816–819, https://doi.org/10.1038/ng.3864 (2017).
  3. Perez-Riverol Y, et al. Discovering and linking public omics data sets using the Omics Discovery Index. Nat. Biotechnol. 35, 406-409, https://doi.org/10.1038/nbt.3790 (2017).
  4. Wimalaratne, S. M. et al. Uniform resolution of compact identifiers for biomedical data. Sci. Data 5, 180029, https://doi.org/10.1038/sdata.2018.29 (2018).
  5. Cousijn, H. et al. A data citation roadmap for scientific publishers. Sci. Data 5, 180259, https://doi.org/10.1038/sdata.2018.259 (2018).
  6. Fenner, M. et al. A data citation roadmap for scholarly data repositories. Sci. Data 6, 31, https://doi.org/10.1038/s41597-019-0031-8 (2019).
  7. Parsons, M. A., Duerr, R. E. and Jones, M. B. The History and Future of Data Citation in Practice. Data Sci. J. 18, 52, http://doi.org/10.5334/dsj-2019-052 (2019).
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.