The input PubMed Central Open Access subset XML reference data, our starting corpus, were transformed into Open Citations RDF in multiple stages:
- The original XML was first transformed into an intermediate form using XSLT. The multitudinous ways different publishers have developed of encoding the same information can be more easily handled in this way, by generating an intermediate XML output dataset in which things are described in a more consistent manner, and enabled the resulting information to be parsed more easily from within a non-XML-based programming environment. Our transform pulled out information about articles, people, organisations, in-text reference pointers, and the reference list, and the links between them.
- The intermediate XML dataset was then transformed into BibJSON using a Python script. BibJSON is a relatively standard method of encoding bibliographic information. Each of the ~200,000 generated BibJSON dataset contains the information extracted from one marked-up Open Access article. We extended the standard BibJSON records with additional attributes (named with an ‘x-‘ prefix) for other properties we wish later to encode as RDF. At this and later stages, the BibJSON datasets are packed into a single gzipped tarball. Since it would be unwise to unpack such a tarball into ~200,000 independent files would give data management problems, the contents are extracted from the tarball as required using the Python tarfile module.
- Another Python script was then used to extract all the PubMed IDs, and to use these as inputs to the Entrez API, in order to extract independent information about the cited entities from the PubMed database. The returned PubMed records were then added alongside the original BibJSON records. These additional data were extremely useful for comparison when attempting to spot erroneous citations, as previously described.
- Next we ran a ‘sanitization script’ over the data, which performed the following functions:
- URL normalization (e.g. adding URL schemes, undoing character substitutions (e.g. en-dashes for hyphens, quotation marks for apostrophes).
- Splitting issue information from journal attributes.
- Fixing malformed DOIs (e.g. those missing the ’10.’ prefix). Where DOIs could not be fixed they were removed.
- Pulling “doi:****” DOIs out of “http://dx.doi.org/**** URLs.
- Removing spurious publication dates (those before 1900 and after 2011).
These corrections are easily extensible if we discover other classes of error in the data.
- The records were next unified by taking the transitive closure on a number of identifiers. These identifiers included DOIs, PubMed IDs, PubMed Central IDs and URLs for articles and other cited works, and ISSNs, eISSNs and ISO title abbreviations for journals.
- The BibJSON data were then rearranged so that each dataset contains multiple records believed to reference the same bibliographic entity, if it had multiple citations.
- Owing to mis-citation (in this case, the use of incorrect or incomplete identifiers) there were a number of clearly different that had been mistakenly declared to refer to the same entity. For this reason we use a distance metric to recluster record groups based on similarity.
- Finally, a Python script transforms the BibJSON tarball into RDF. The input tarball contains datasets, each of which comprises records believed to refer to the same entity. The script takes each of these datasets and merges them into a single ‘best’ record using the majority vote procedure previously described. The resultant record is then transformed into a number of quads for inclusion in the final RDF N-Quads Open Citations Corpus, principally modelled using the suite of SPAR ontologies created for this purpose.
This Open Citations Corpus of rdf citation data extracted from the open access subset of PubMed Central, detailing every reference list in the OASS articles, holds each reference list as an individual named graph (hence the storage in N-Quads rather than triples), and comprises 236,499,781 quads occupying 2.1 gigabytes of storage in its compressed state. It includes references to ~20% of all post-1980 papers recorded in PubMed, including all the highly cited papers in every field of biomedical research, and is freely available under a CC0 waiver from http://opencitations.net/data/.
The Open Citations Corpus can be queried via a Web query form or via a SPARQL interface from the Open Citations Project web site at http://opencitations.net/, described in a subsequent blog post, where more information about the project is given.
All the scripts used to transform the OASS input data into the Open Citations Corpus, described above, are available under an MIT Open Source licence at https://github.com/opencitations/.