The Open Citations Project has aimed to liberate bibliographic references from biomedical research literature as Open Linked Data, using as its starting corpus the Open Access Subset (OASS) of articles within PubMed Central. The greatest problem faced during this project, naively unanticipated before we started, was the extend of incompleteness, noise and errors of various sorts within the reference information extracted from the OASS articles. So significant has this problem been, that it has taken almost the entire time and effort of Alex Dutton, our skilled data munger working on the project, to sort out, and without his skill, dedication and effort the project would not have succeeded.
In this context, any deviation of the bibliographic reference in an OASS article from the bibliographic citation text for that paper provided by the original publisher, is taken to be an error. These may be as slight as the substitution of a “β” character by the word “beta” in the title of a cited work by the author of the OASS article as he was creating his reference list, and as severe as including in the reference to one paper the DOI of another unrelated paper.
So it is worth taking time, before explaining how we addressed this problem, to describe its nature and magnitude, and to illustrate it with typical examples. We have found that errors of one sort or another occur in about 1% of all extracted references from the OASS. Since we extracted 6,325,178 individual references from our starting corpus, this constitutes well over 50,000 references containing errors.
Errors have different sources. Authors are largely to blame, for not exercised due care when creating the reference lists of their paper (or earlier, when creating bibliographic records in a reference management system such as EndNote). However, if one of their EndNote records used to populate a reference list had been pulled automatically from some third-party source, the error might be due to that source, something of which the author was totally unaware.
Some references have incorrect punctuation or capitalisation of titles, or omit some sub-part, as exemplified by the following screenshot – note the use of “alternate”, “alternative” or “Alternate”, and of “intensivist” and “intensive” in different OASS references to the same paper. The boxed title is the correct one:
Some references omit one or more author names, or omit diacritics, a tendency particularly correlated with the degree to which a name is ‘foreign’ or ‘unusual’, as the next example illustrates – note the lack of “Mariotte-Labarre S” in the first two references, and the variation in punctuation of the journal name abbreviation in the third:
There are also numerous instances where the text of a reference is correct, but the associated identifiers (e.g. DOIs and PubMed IDs) are incorrect. By way of example, references 15 and 16 of PMC1839102 are both given the same DOI; in PMC2896208 the DOIs for references 52 and 72 are swapped; and in PMC2778786 references 15, 40, and 49 are all given the DOI of another (uncited) paper.
We should point out that these examples are entirely anecdotal, and that we haven’t investigated the frequency with which these or any other classes of error occurs.
Publishers are the other main culprits, particularly for introducing errors into documents during the XML encoding stage, from which it is almost impossible to recover by the automated parsing systems we have used to extract information into our RDF-encoded records.
Individual publishers, while all working to the same National Library of Medicine DTD for encoding the XML markup of their articles submitted to PubMed Central, might take different approaches to encoding the same information. For example, the text
“… was found to be significant[1,3–6]”
might be marked up as either of the following:
“… was found to be significant<sup>[<xref rid=”CR1″>1</xref>,<xref rid=”CR3″>3</xref>–<xref rid=”CR6″>6</xref>]</sup>”
“… was found to be significant<sup><xref rid=”CR1,CR3,CR4,CR5,CR6″>[1,3–6]</xref></sup>”.
Note that in the former case there is no explicit mention of the references with identifiers CR4 and CR5, making things a little harder to parse.
We have found occasions where a four-digit number in the title has been marked-up as a the publication year. For example, reference 21 of PMC2743650 claims that the cited article was published in 7942. The cited article’s real title refers to the bacterium strain PCC7942, information that has been removed from the title in the OACC reference.
The information returned by the Entrez API, which we used as our ‘gold standard’ against which to check OASS references containing PubMed IDs, was itself not without error. We found a number of PubMed records where DOIs had been truncated to just the prefix, or were missing a prefix.
Editors and referees
Others are also culpable. At the end of paragraph 12 of the PLoS ONE paper by Pickart et al. 2006  we find the text:
We consider the current 12% detection rate to be a lower estimate of observable specific phenotypes from the screen, as additional screening will examine the morpholino collection using a variety of novel assays (such as newly generated enhancer and gene trap lines; Balciunas et al., 2004; Kawakami et al., 2004; Parinov et al., 2004) and may reveal developmental and/or functional aspects not readily visible by morphological criteria.
However, these three references do not appear in the reference list and so are totally lost to the system – the authors knew whom they were citing, but no-one else. How this escaped the eagle eyes of the authors, the journal editor and the reviewers is beyond my understanding!
 Pickart MA et al. (2006). Genome-Wide Reverse Genetics Framework to Identify Novel Functions of the Vertebrate Secretome. PLoS ONE 1(1): e104. 10.1371/journal.pone.0000104.