To general readers of this blog, this post will appear different from normal posts. Rather than being about a particular topic, it pulls together a summary of the work undertaken over the past year within the Open Citations Project supported by the JISC, and is primarily intended to assist JISC evaluation of the project and its outputs. Details of the work undertaken and the outputs from this project have mostly been described in previous blog posts, to which this post will frequently refer.
Project scope and purpose
The Open Citations Project is global in scope, designed to change the face of scientific publishing and scholarly communication. Specifically, it aims to make it possible to publish bibliographic information in RDF and to make citation links as easy to traverse as Web links.
To achieve this goal, we have had four primary aims:
- To create a semantic infrastructure that makes possible the description of citations, references and bibliographic entities in RDF, since we found existing ontologies inadequate for our purpose.
- To extend that semantic infrastructure to handle data citations and data entities, as well as bibliographic citations and bibliographic entities, mindful of Philip Bourne’s prediction that soon there will be no meaningful difference between a journal article and a database entry.
- To provide exemplars of how these ontologies can be applied to real-world data, by creating mappings from existing encodings to RDF, and by creating
RDF metadata relating to bibliographic and data entities and their citations.
To convert the reference lists within all the PMC Open Access subset articles to RDF, and their publication as open linked data that third parties can use in novel ways.
Principle deliverables and outputs
- The SPAR (Semantic Publishing and Referencing) Ontologies.
- Graffoo and LODE, two novel tools for ontology visualization and documentation.
- Mappings of various existing metadata schemes to RDF using SPAR.
- Development of data citation methods and protocols
- The Open Citations Blog in which activities and outputs are described.
- The Open Citations Corpus of bibliographic citation data encoded in RDF and published as Open Linked Data.
- The Open Citations Project softwareused for processing the Pubmed Central Open Access corpus into Open Linked Data.The net result is open citation data from life science journal articles available on the web, for utilization by academics, for citation network analysis applications, and for tracking the impact of research grant funding.
- Scholars worldwide, particularly in the biomedical sciences, by providing better access to bibliographic and citation data.
- Academic publishers and repository managers, by providing a semantic infrastructure and tools to enable their outputs and holdings to join the semantic web of open linked data.
In 2008, Katie Portwin and I had an enjoyable summer ‘souping up’ a PLoS Neglected Tropical Diseases article by Reis et al. (2008)  that I had downloaded as an XML file from the journal web site in late April, one week after it had been published. The resulting enhanced publication, available here, became an exemplar of what is possible in the realm of semantic publishing, while undertaking that work was very influential in shaping the course of my more recent activities.
One of the things we undertook was to mark up the reference list with annotations that clarified the nature of the cited entity (e.g. book, journal article, medical report) and the reason the authors had cited those entities (used data from, obtained background from, extended, etc.) – annotation that we took care to verify with the authors themselves before publishing them!
While we undertook that work manually, it quickly became apparent that what we needed was an ontology from which a controlled vocabulary of such terms could be used to create both human- and machine-readable metadata describing the citations and the cited entities. We therefore developed a draft ontology that we subsequently split to form the basis of the first two ontologies of the suite of SPAR (Semantic Publishing and Referencing) Ontologies described elsewhere on this blog, namely CiTO, the Citation Typing Ontology to describe the relationships between the citing and cited entities, and FaBiO, the FRBR-aligned Bibliographic Ontology, to describe the cited entities themselves.
Using these tools, we were able to mark up the reference list from Reis et al., and publish it as RDF.
From there it was but a small step to dream of the day when the references from all biomedical research articles would be published as open linked data, and to think what we could do to make that dream a reality.
And it was obvious where to start – with the Open Access subset of journal articles available in PubMed Central (PMC), all nicely marked up in XML using the National Library of Medicine DTD.
Table of contents
The various aspects of the Open Citations Project and its outputs are described in the following blog posts, complete with diagrams, data tables, and screen shots where appropriate. These are organized into the following set of distinct topics:
- The SPAR ontologies for bibliographic and data entities and their citations
- Graffoo and LODE – tools for ontology visualization and documentation
- Third-party applications of our ontologies
- Mappings to the SPAR ontologies, and exemplar RDF encodings
- Development of data citation methods and protocols
The creation of the Open Citation Corpus of linked bibliographic citation data
2 The SPAR ontologies for categorizing bibliographic and data entities and their citations
The SPAR ontologies described in the following blog posts were developed jointly with Silvio Peroni, a brilliant graduate student from the University of Bologna, who spent the last six months of 2010 working with me as an intern in Oxford, where he became an honorary member of the Open Citations Project, contributing very significantly to our achievements. His supervisor Fabio Vitali and the Department of Information Science at the University of Bologna are to be congratulated and thanked for their enlightened requirement that all their graduate students spend an internship overseas, since without his collaboration and great skill, much of this development would not have been possible within the available time, if at all.
3 Graffoo and LODE – tools for ontology visualization and documentation
These tools have been developed by Silvio Peroni, an honorary member of the Open Citations Project, as explained above.
4 Third-party applications of our ontologies
Our work to develop a standard semantic infrastructure for bibliographic and data entities and their citation is new. Nevertheless, we have received encouraging responses when we have presented our work at international publishing venues such as the 2010 ALPSP Conference and the 2010 STM Innovation Conference.
Apart from local applications at the University of Oxford, the University of Bologna, and the University of Manchester (for the Utopia Project), and adoption of the SPAR ontologies by the University of Harvard both to complement SWAN (Semantic Web Applications in Neuromedicine) and to mark up astrophysics data (Accomazzi and Dave (2011) Semantic Interlinking of Resources in the Virtual Observatory Era. arXiv:1103.5958), we have expressions of interest from PLoS, Nature and Il Mulino, a major academic publisher in Italy, who are looking to improve their metadata encoding as RDF. We are also interacting with the British Library in mapping the DataCite Metadata Kernel to RDF (see below), and with the Dryad Data Repository in creating RDF mappings of Dryad metadata to RDF and, as part of the JISC Dryad-UK Project, in developing MIIDI and MIIDI-structured RDF metadata for infectious disease papers and datasets, using SPAR ontologies where appropriate, with the aim of permitting authors to submit rich metadata to Dryad.
The following blog posts describe uptake and use of CiTO in CiteULike and WordPress.
5 Mappings to the SPAR ontologies, and exemplar RDF encodings
6 Development of data citation methods and protocols
7 The creation of the Open Citation Corpus of linked bibliographic citation data
This achievement is almost entirely as the result of the excellent work of our chief data wrangler Alex Dutton, whose skill and natural feel for linked data has done wonders for this project.
The following set of blog posts describe the starting corpus from PubMed central, our transformation of it to RDF, the problems we encountered along the way, the resulting Open Citations Corpus, and the potential uses to which the resulting open citation data can now be put.
JISC Administrative Data for Open Citations Project
This information, extracted from the JISC Expo DOAP (Description of a Project) spreadsheet, is to be found in a separate blog post, here.
While this is the formal Final Blog Post for the JISC-funded Open Citations Project, that was funded for a year from 1st July 2010, our work is not yet finished. We cherish grand ideas for the liberation of the reference lists from all scholarly journal articles, using the Open Citations Corpus as an exemplar, in collaboration with publishers and organizations such as CrossRef who handle such citation data on behalf of publishers on a daily basis.
This work will only be finished when it is longer up to an individual academic research group to take on the task of citation liberation, but when each publisher publishes the citation data from each of their journal articles as open linked data on their own web sites, marked up using agreed ontological standards that we have proposed, freely available for scholar around the world, from Bangladesh to Zimbabwe, and from Holland to New Zealand, to use and explore, independent of their ability to afford subscription access to the journal articles from which the citations are made.