Following the decision to merge the planned development of the Open Citations Corpus with the Related Work Project, described in the previous blog post, we proposed to develop one underlying data store, one data model and one RDF representation.
We are presently creating a new OpenStack server environment on one of our Zoology Department Web servers, will host the data storage system, to be called the Open Citations Corpus Datastore (OCCD), the reference processing software, and the web server providing web sites, blog hosting and the Related Work user interfaces to the underlying data within the new enhanced Open Citations Corpus. This will be available for use shortly.
We anticipate three primary data streams into the new expanded Open Citations Corpus, from PubMed Central, from ArXiV and from CrossRef.
PubMed Central has more than doubled the number of its Open Access full-text papers since they were last harvested into the Open Citations Corpus early in 2011. There are now over 530,000 articles in the Open Access Subset, the majority of which are new for the Open Citations Corpus. These contain approx. 16 million references to approx. 9 million individual cited papers. The PubMed Central articles have rich metadata, marked up using the National Library of Medicine DTD v3. The current Open Access Corpus from PubMed Central has now been downloaded, prior to processing.
ArXiV has approximately 750,000 papers, containing references to approx. 16 million papers, about 2 million of which are within ArXiV itself. These are held in LaTeX format, and are accompanied by basic Dublin Core elements metadata. The ArXiV corpus was downloaded in October, and will be updated in the near future.
CrossRef holds the reference lists for approx. 18.8 million articles provided by publishers participating in the CrossRef ‘CitedBy Linking’ service. These are marked up in XML according to the CrossRef data schema that differs from the NLM DTD. Access to these reference lists via CrossRef will require specific permission from the publishers concerned, that is currently being sought, as detailed in a recent blog post. Trial downloads of CrossRef-derived reference lists will begin shortly.
New data model
The metadata requirements for the revised Open Citations Corpus, closely based on the data model originally developed by Alex Dutton during the initial JISC Open Citations Project, is now available online at http://goo.gl/eeDe8.
One task that we have already undertaken is to determine from each of the anticipated data input streams those metadata entities (e.g. title) that are already clearly marked up and can be recognisable automatically in the input data from each source – these we call ‘raw’ metadata terms – and those (e.g. author affiliation) that will have to be deduced or parsed out from the text of the input source – referred to as ‘derived’ metadata terms. A spreadsheet showing these mapping is available at http://goo.gl/XTlQ1.
The native format chosen for storing the reference data within is BibJSON, and the mapping from BibJSON to RDF will be achieved using an RDF metadata mapping that closely reflects that already in use for the original Open Citations Corpus, currently being finalized. BibServer will be used as the underlying storage technology for the OCCD, which handles BibJSON natively.
The existing data processing pipeline developed for the Open Citations Corpus is currently being reviewed to handle the new data, and to modularize it and fit it into a messaging system, which will allow simultaneous execution and status controlling of the various tasks that have to be performed at import.
Technical developments on the unified data store and the unified processing pipeline will be undertaken by intimate collaboration between Cottage Labs and Related Work staff, that has already commenced. These tasks include:
- Zip extraction.
- Checking the incoming data for each article from each source for completeness and correctness according to the supplier’s data schema.
- Extraction of the article’s bibliographic metadata and its reference list.
- Conversion from the input format into the common BibJSON format.
- Data clean-up – correction of errors within the references.
- Data augmentation, by use of Web services accessing information from PubMed, CrossRef and other authorities for bibliographic records and DOI identifiers. (Well over half of references in published reference lists lack DOIs – if these exist, they will be found and added).
- Matching of citation targets: many papers cite the same target article in slightly different ways. We will determine whether or not references are to a unique target, and then ensure that the correct bibliographic record for that cited article is used.
- Conversion of the BibJSON records to RDF using an XSLT transformation.
- Authoritative versions of semantically described bibliographic records for the citing and cited articles, together with the citation links, will then be held in both BibJSON and RDF formats in the Open Citations Corpus Datastore, to be used by front-end Related Work services.
The bibliographic reference information will then be indexed and held in an Elastic Search system, which deals natively with JSON data. A variety of indexes will be built, to provide rapid response times for the user-facing services.
The interface provided to data in the original Open Citations Corpus was limited in functionality.
During the short Open Citations Extension Project, which extends into 2013, we will experiment with two sets of user interfaces, that will be developed and maintained over the new data, to permit a ‘show and tell’ comparison at the end of the project, before deciding on which of these demonstration interfaces we will retain for the on-going service.
First, the BibServer software developed to use Elastic Search by Cottage Labs during the JISC Open Bibliography Project will be employed to provide faceted search and browse over the data.
Additionally, the user interfaces currently under development for Related Work will be employed, using additional text-mining activities ‘under the hood’. These will have an emphasis on social media interactions, with personalized auto-completion and instant filtering of reference lists. This will permit the user to move from a selected paper to semantically related papers, to find other papers by the same author, to look at co-authorship and self-citation patterns, etc.
Finally, we hope to develop some time-line visualizations of citation networks.
We aim to provide a clean separation between the underlying data and the services built on top, using RESTful APIs, providing maximum flexibility for future development.
For the time being, until our new combined project server becomes available, the existing Open Citations Corpus and Related Work web sites will be kept active on their existing separate servers, but will eventually be phased out.
At present, each project (Open Citations and Related Work) has its own blog. This will change. I will open this blog to contributions from other members of the Open Citations and Related Works team, and will rename it Open Citations and Related Work. In addition, I will create a new blog, named Semantic Publishing and available at http://semanticpublishing.wordpress.com/, and will use that to host all new posts concerning semantic publishing and related topics that do not specifically relate to our joint open citations work. Previous semantic publishing posts within this blog will also be copied to the new Semantic Publishing blog.
Further development of Related Work and the Open Citations Corpus will be dependent upon additional funding, for which we will apply early this year.
There are two developments that seem particularly beneficial to pursue in the near future:
- Improvements of the reference string matching for the bibliographic reference strings that appear in reference lists, for which DOIs or other identifiers are not given, so that the correct bibliographic record for the cited paper can be associated with the reference string. At present, only two million out of the 16 million reference targets in the ArXiV data are correctly matched in this way.
- User generated annotations, particularly provision of CiTO-based citation typing to characterize why one paper cites another. Tanya Gray has already created a plug-in to Chrome that permits this over native PubMed Central articles, and we are seeking to adapt this to use as a Related Work service with any browser.
While our initial data sources have been in the physical and biomedical sciences, respectively, we hope to expand our services for people in the social sciences, humanities and arts, and provide broad coverage in these other disciplines, if we can tap into open sources of bibliographic reference lists from publications in these areas.