Questions of granularity – Dryad’s use of DataCite DOIs for data citation, and the Annotation Ontology

DataCite is an international organisation, founded in 2009, which promotes the use of DOIs (Digital Object Identifiers) for published datasets, in order to establish easier access to research data, to increase acceptance of research data as legitimate contributions in the scholarly record, and to support data archiving to permit results to be verified and re-purposed for future study.

Its founding members were the British Library; the Technical Information Center of Denmark; TU Delft Library; the National Research Council’s Canada Institute for Scientific and Technical Information (NRC-CISTI); California Digital Library; Purdue University; and the German National Library of Science and Technology. Since its foundation, it has been joined by several other leading organisations from around the world, and it therefore provides a stable basis for the ongoing use of DOIs for data.

This recent availability of DOIs from DataCite for the identification of data entities has made all the difference to data repositories wishing to give unique global identifiers to their data holdings, since DOIs are widely recognised and respected throughout the academic world, because of their widespread prior use for identifying journal articles, made possible by CrossRef.

However, in their recent discussion paper Data Citation and Linking, published on 8th June 2011, Alex Ball and Monica Duke of UKOLN at the University of Bath ask:

“At what granularity should data be made citable? If single datasets are given identifiers, what about collections of datasets, or subsets of data?”

Individual data files and metadata documents will, of course, have their own unique internal identifiers within any data repository, but may not have externally resolvable identifiers such as DOIs.  Practice varies.

This post is to explain how DOIs are employed in the Dryad Data Repository, that specializes in publishing data linked to peer-reviewed biological journal articles, since it is both elegant and addresses at least some of the issues raised by Alex and Monica.

The Dryad DOI usage policy is described at https://www.nescent.org/wg_dryad/DOI_Usage, and involves assigning unique DOIs to each version of every data package, and to each version of every data file, in a principled and easy-to-understand manner. In summary:

  • Each data package is given a DataCite DOI, which can be versioned by adding “.2”, “.3”, etc. after the original DOI to create new DOIs for new versions of the same data package.
  • Within each data package, each data file has a unique DOI defines by suffixing the data package DOI with “/1”, “/2”. etc., with versions indicated as for data packages.

Thus the third version of the second data file in the second version of a Dryad data package would have a DOI of the form doi:10.5061/dryad.1234.2/2.3.

One might argue that it would result in an awfully large number of DOIs if a single data package was made up of thousands of data files. True, but numbers themselves are limitless and free, and the cost of a DataCite DOI is small relative to the cost of data creation and preservation. The real problem at present is lack of identifiable, citable data entities within repositories – to have so many that the cost of DOIs becomes an issue should be regarded as an achievement, not a problem!

Dryad does not have a mechanism for assigning identifiers to a portion of a data file (“a subset of data”), and DOIs are probably not the correct identifiers for that purpose, since they are primarily designed for citation and resource discovery.

A more appropriate method for identifying portions of a data file, or of any other digital object or document, is to use the Annotation Ontology (AO) developed by Paolo Ciccarese of Harvard University, described at http://code.google.com/p/annotation-ontology/wiki/Homepage. AO can be used to identify and annotate portions of a wide variety of resources such as HTML, PDF, Word, Excel, XML documents, images, videos, databases, web services, experimental data and metadata files. Paolo is currently working with a group in Harvard that focuses on biodiversity, who are using OA to address databases and data, and he anticipates publishing version 2.0 of AO in September.

This entry was posted in Data publication, Ontologies, Semantic Publishing and tagged , , , , , , , , . Bookmark the permalink.

4 Responses to Questions of granularity – Dryad’s use of DataCite DOIs for data citation, and the Annotation Ontology

  1. Pingback: JISC Open Citations Project – Final Project Blog Post | JISC Open Citations

  2. When it comes to versions, one could imagine leveraging the HTTP rendering of a DOI (i.e. http://dx.doi.org/…) alongside a datetime. Using Memento, one could actually arrive at the correct version. See, for example, http://arxiv.org/abs/1003.3661.

    When it comes to annotation:
    – As far as I understand, Annotation Ontology does not explicitly provide means to identify portions of resources; it provides hooks to do so. How those hooks are implemented for different types of resources, and for different mime types remains to be made explicit.
    – The same is true for the similar Open Annotation work, see http://www.openannotation.org/ and http://www.openannotation.org/spec/alpha3/ . Via the notion of Constraints, segments of resources can be identified. How exactly that is done needs to be specified, depending on resource type and mime type.

    Overall, I think that segment identification is an issues that the DOI community did not have to deal with in the days DOIs were mainly used for papers. And, in order for DOIs to be really convincing as a means to identify datasets, it is an issue needs to be resolved.

  3. Andrew Gilmartin says:

    Note that the CrossRef deposit schema does allow for a database and/or a dataset to have zero or more components. Perhaps using components is a way of collecting a set of resources into an identifiable object. See http://www.crossref.org/schema/documentation/4.3.0/schema-deposit.html#id168

  4. Dave Bridges says:

    Will this also apply to figures. If i wanted to refer to a specific figure panel, it would be nice to be able to link to something like doi:10.5061/dryad.1234.2/f.2c or something. That might make following references easier than only focusing on datasets

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s