JISC Open Citations Project – Final Project Blog Post

Executive summary

Introduction

To general readers of this blog, this post will appear different from normal posts. Rather than being about a particular topic, it pulls together a summary of the work undertaken over the past year within the Open Citations Project supported by the JISC, and is primarily intended to assist JISC evaluation of the project and its outputs. Details of the work undertaken and the outputs from this project have mostly been described in previous blog posts, to which this post will frequently refer.

Project scope and purpose

The Open Citations Project is global in scope, designed to change the face of scientific publishing and scholarly communication. Specifically, it aims to make it possible to publish bibliographic information in RDF and to make citation links as easy to traverse as Web links.

Project aims

To achieve this goal, we have had four primary aims:

  • To create a semantic infrastructure that makes possible the description of citations, references and bibliographic entities in RDF, since we found existing ontologies inadequate for our purpose.
  • To extend that semantic infrastructure to handle data citations and data entities, as well as bibliographic citations and bibliographic entities, mindful of Philip Bourne’s prediction that soon there will be no meaningful difference between a journal article and a database entry.
  • To provide exemplars of how these ontologies can be applied to real-world data, by creating mappings from existing encodings to RDF, and by creating
    RDF metadata relating to bibliographic and data entities and their citations.
  • To convert the reference lists within all the PMC Open Access subset articles to RDF, and their publication as open linked data that third parties can use in novel ways.

Principle deliverables and outputs

  • The SPAR (Semantic Publishing and Referencing) Ontologies.
  • Graffoo and LODE, two novel tools for ontology visualization and documentation.
  • Mappings of various existing metadata schemes to RDF using SPAR.
  • Development of data citation methods and protocols
  • The Open Citations Blog in which activities and outputs are described.
  • The Open Citations Corpus of bibliographic citation data encoded in RDF and published as Open Linked Data.
  • The OpenCitations.net web site, to provide user access to the Open Citations Corpus.
  •     The Open Citations Project softwareused for processing the Pubmed Central Open Access corpus into Open Linked Data.The net result is open citation data from life science journal articles available on the web, for utilization by academics, for citation network analysis applications, and for tracking the impact of research grant funding.

Primary beneficiaries

  • Scholars worldwide, particularly in the biomedical sciences, by providing better access to bibliographic and citation data.
  • Academic publishers and repository managers, by providing a semantic infrastructure and tools to enable their outputs and holdings to join the semantic web of open linked data.

Background

In 2008, Katie Portwin and I had an enjoyable summer ‘souping up’ a PLoS Neglected Tropical Diseases article by Reis et al. (2008) [1] that I had downloaded as an XML file from the journal web site in late April, one week after it had been published. The resulting enhanced publication, available here, became an exemplar of what is possible in the realm of semantic publishing, while undertaking that work was very influential in shaping the course of my more recent activities.

One of the things we undertook was to mark up the reference list with annotations that clarified the nature of the cited entity (e.g. book, journal article, medical report) and the reason the authors had cited those entities (used data from, obtained background from, extended, etc.) – annotation that we took care to verify with the authors themselves before publishing them!

While we undertook that work manually, it quickly became apparent that what we needed was an ontology from which a controlled vocabulary of such terms could be used to create both human- and machine-readable metadata describing the citations and the cited entities. We therefore developed a draft ontology that we subsequently split to form the basis of the first two ontologies of the suite of SPAR (Semantic Publishing and Referencing) Ontologies described elsewhere on this blog, namely CiTO, the Citation Typing Ontology to describe the relationships between the citing and cited entities, and FaBiO, the FRBR-aligned Bibliographic Ontology, to describe the cited entities themselves.

Using these tools, we were able to mark up the reference list from Reis et al., and publish it as RDF.

From there it was but a small step to dream of the day when the references from all biomedical research articles would be published as open linked data, and to think what we could do to make that dream a reality.

And it was obvious where to start – with the Open Access subset of journal articles available in PubMed Central (PMC), all nicely marked up in XML using the National Library of Medicine DTD.

Table of contents

The various aspects of the Open Citations Project and its outputs are described in the following blog posts, complete with diagrams, data tables, and screen shots where appropriate. These are organized into the following set of distinct topics:

  • Standards
  • The SPAR ontologies for bibliographic and data entities and their citations
  • Graffoo and LODE – tools for ontology visualization and documentation
  • Third-party applications of our ontologies
  • Mappings to the SPAR ontologies, and exemplar RDF encodings
  • Development of data citation methods and protocols
  • The creation of the Open Citation Corpus of linked bibliographic citation data

1 Standards

Advantages of Ontological Standards in Scholarly Publishing

Nomenclature for citations and references

2 The SPAR ontologies for categorizing bibliographic and data entities and their citations

The SPAR ontologies described in the following blog posts were developed jointly with Silvio Peroni, a brilliant graduate student from the University of Bologna, who spent the last six months of 2010 working with me as an intern in Oxford, where he became an honorary member of the Open Citations Project, contributing very significantly to our achievements. His supervisor Fabio Vitali and the Department of Information Science at the University of Bologna are to be congratulated and thanked for their enlightened requirement that all their graduate students spend an internship overseas, since without his collaboration and great skill, much of this development would not have been possible within the available time, if at all.

Introducing the Semantic Publishing and Referencing (SPAR) Ontologies

New web site for the SPAR ontologies

Functional clustering of CiTO properties

Extending FRBR within FaBiO

Categorising bibliographic resources with FaBiO and SKOS

CiTO4Data – a new data-centric citation typing ontology

Using FaBiO to describe data entities

3 Graffoo and LODE – tools for ontology visualization and documentation

These tools have been developed by Silvio Peroni, an honorary member of the Open Citations Project, as explained above.

Graffoo, a Graphical Framework for OWL Ontologies

Using LODE for ontology visualization

4  Third-party applications of our ontologies

Our work to develop a standard semantic infrastructure for bibliographic and data entities and their citation is new. Nevertheless, we have received encouraging responses when we have presented our work at international publishing venues such as the 2010 ALPSP Conference and the 2010 STM Innovation Conference.

Apart from local applications at the University of Oxford, the University of Bologna, and the University of Manchester (for the Utopia Project), and adoption of the SPAR ontologies by the University of Harvard both to complement SWAN (Semantic Web Applications in Neuromedicine) and to mark up astrophysics data (Accomazzi and Dave (2011) Semantic Interlinking of Resources in the Virtual Observatory Era. arXiv:1103.5958), we have expressions of interest from PLoS, Nature and Il Mulino, a major academic publisher in Italy, who are looking to improve their metadata encoding as RDF. We are also interacting with the British Library in mapping the DataCite Metadata Kernel to RDF (see below), and with the Dryad Data Repository in creating RDF mappings of Dryad metadata to RDF and, as part of the JISC Dryad-UK Project, in developing MIIDI and MIIDI-structured RDF metadata for infectious disease papers and datasets, using SPAR ontologies where appropriate, with the aim of permitting authors to submit rich metadata to Dryad.

The following blog posts describe uptake and use of CiTO in CiteULike and WordPress.

Use of CiTO in CiteULike

How to employ CiTO in CiteULike

Using CiTO in WordPress

5 Mappings to the SPAR ontologies, and exemplar RDF encodings

Comparison of BIBO and FaBiO

BIBO2SPAR, an RDF Mapping of BIBO to the SPAR Ontologies

DataCite2RDF – Mapping DataCite Metadata Scheme Terms to ontologies

6 Development of data citation methods and protocols

Nomenclature for data publications and citations

Questions of granularity – Dryad’s use of DataCite DOIs for data citation

How to cite data

Pensoft Journals policy and author guidelines on data publication and citation

7 The creation of the Open Citation Corpus of linked bibliographic citation data

This achievement is almost entirely as the result of the excellent work of our chief data wrangler Alex Dutton, whose skill and natural feel for linked data has done wonders for this project.

The following set of blog posts describe the starting corpus from PubMed central, our transformation of it to RDF, the problems we encountered along the way, the resulting Open Citations Corpus, and the potential uses to which the resulting open citation data can now be put.

Input data for Open Citations – the PMC Open Access Subset

Garbage in, garbage out – problems with bibliographic references

Who wrote this paper? Author list problems in PubMed Central references

Citation correction methods

The citation processing pipeline and the Open Citations Corpus

JISC Open Citations Project web site

Like a kid with a new train set! Exploring citation networks


JISC Administrative Data for Open Citations Project

This information, extracted from the JISC Expo DOAP (Description of a Project) spreadsheet,  is to be found in a separate blog post, here.

The Future

While this is the formal Final Blog Post for the JISC-funded Open Citations Project, that was funded for a year from 1st July 2010, our work is not yet finished. We cherish grand ideas for the liberation of the reference lists from all scholarly journal articles, using the Open Citations Corpus as an exemplar, in collaboration with publishers and organizations such as CrossRef who handle such citation data on behalf of publishers on a daily basis.

This work will only be finished when it is longer up to an individual academic research group to take on the task of citation liberation, but when each publisher publishes the citation data from each of their journal articles as open linked data on their own web sites, marked up using agreed ontological standards that we have proposed, freely available for scholar around the world, from Bangladesh to Zimbabwe, and from Holland to New Zealand, to use and explore, independent of their ability to afford subscription access to the journal articles from which the citations are made.

This entry was posted in JISC, Open Citations and tagged , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

4 Responses to JISC Open Citations Project – Final Project Blog Post

  1. Pingback: Open F1000 reviews? – Carl Boettiger

  2. Pingback: Open Citations and Semantic Publishing | Open Citations and Semantic Publishing

  3. Pingback: Open Citations and Semantic Publishing | Semantic Publishing

  4. Pingback: Open Citations is dead. Long live OpenCitations. | Open Citations and Related Work

Leave a comment