Input data for Open Citations – the PMC Open Access Subset

PubMed, created by the US National Library of Medicine in DATE, holds bibliographic records and abstracts for essentially all journal articles published in the biomedical sciences. It currently records almost a million new entries each year!

PubMed Central (PMC), created as an extension of PubMed, is designed to hold full text articles from among the PubMed entries. At present, PMC holds entries for ~9.3% of the papers indexed in PubMed published between 1980 and 2010, 1,428,675 out of a total of 15,319,102. Many of these PMC articles (192,452 for the years 1980 to 2010, ~13.5% of the PMC holdings) are truly Open Access articles, that users can download and repurpose as they will. However, the majority are articles from subscription access journals deposited in PMC under licence agreements with funding agencies that, while providing read access to the full text, prevent readers from downloading the articles and from making derivative works.

The Open Citations Project has to date worked exclusively with the Open Access subset (OASS) of PMC. As of 24 January 2011, there were 204,637 OASS articles, including a few published before 1980. In almost all of these OASS articles, the reference lists were nicely marked up in NLM-DTD XML, making the task of identifying individual references straightforward. In a few cases, the articles were present as scanned page images, lacking any internal markup – those we were unable to process.

From the XML reference lists of these papers, we were able to identify and extract 6,325,178 individual references, which, together with the bibliographic information we had on the OASS articles themselves gave us 6,529,815 independent bibliographic records of both citing and cited entities. As explained in the next blog post, these records showed varying degrees of completeness and accuracy.

Using the Entrez API, we were able to use PubMed IDs, where these were available in the references, to extract a further 2,304,143 bibliographic records from PubMed, which, in the ideal world, would each exactly duplicate the information we had previously obtained from the OASS bibliographic reference containing that PubMed ID. As we shall describe, these additional PubMed records proved exceptionally useful in correcting imperfect OASS references.

Since the OASS articles cite papers outside the OASS, as well as a few within it, the majority of the bibliographic information we thus acquired related to papers represented within PubMed but not within PubMed Central. And because many OASS papers independently contained references to the most highly cited biomedical papers, many of our records were to the same bibliographic entities.

An important part of our data processing was thus to coalesce independent references from different OASS articles to the same multiply cited papers into a set of unique bibliographic records, each for one paper. Once this had been achieved, we were left with 3,578,598 unique bibliographic records, 204,637 describing the OASS articles themselves, and 3,373,961 describing articles outside the OASS, mostly from subscription-access journals.

The following table and figure tabulates and illustrates the number of papers in each category between 1980 and 2010 inclusive. The most striking thing about these data are that they show how, between these years, the relatively small number of articles in the Open Access subset of PMC (approx. 200,000 articles) referenced >20% of all PubMed Central papers published between 1950 and 2010 (approx. 15.3 million papers), and in doing so reference all the most important highly cited papers in every field of biomedical endeavour. This inclusive coverage means that citation graphs created from the Open Citations dataset will capture all the important aspects of any field.

Table 1

Year

Number of papers

Pubmed

PMC

OASS

Cited by OASS

1950-1979

5,128,602

427,877

8352

146027

1980

278,069

23,218

631

15708

1981

278,069

23,685

543

16627

1982

292,219

25,215

740

18389

1983

305,725

25,688

738

21263

1984

314,737

26,316

543

23249

1985

331,706

25,916

637

25780

1986

345,501

26,721

590

28761

1987

363,754

27,834

555

32222

1988

381,976

28,802

442

36320

1989

398,620

29,855

616

42005

1990

398,620

30,143

704

48422

1991

407,465

31,337

733

53655

1992

412,457

32,325

719

61091

1993

420,935

33,203

1055

70272

1994

431,160

33,456

1279

80206

1995

441,967

34,276

1148

91814

1996

452,218

34,755

1155

101853

1997

451,533

34,800

1314

114967

1998

469,466

36,179

1341

131510

1999

469,466

37,534

1420

146623

2000

528,243

39,047

1608

170330

2001

542,854

40,235

2546

179203

2002

560,006

43,265

3199

195879

2003

590,317

46,442

4015

211423

2004

634,432

51,416

6005

229423

2005

694,687

60,411

10333

236678

2006

740,007

72,295

14264

238387

2007

777,311

87,744

20070

222085

2008

824,612

120,004

31416

190071

2009

862,372

146,413

41848

124894

2010

918,598

120,145

40245

27877

Total

15,319,102

1,428,675

192,452

3,186,987

% of PubMed

9.33%

20.80%

% of PMC

13.47%

Figure 1

Figure 2

The OASS source data give the types of cited entity, aggregated after coalescing, shown in the Figure 3.

Figure 3

About these ads
This entry was posted in JISC, Open Citations and tagged , , , , , , , , . Bookmark the permalink.

4 Responses to Input data for Open Citations – the PMC Open Access Subset

  1. Pingback: Citation correction methods | JISC Open Citations

  2. Pingback: The citation processing pipeline and the Open Citations Corpus | JISC Open Citations

  3. Pingback: JISC Open Citations Project – Final Project Blog Post | JISC Open Citations

  4. Pingback: Where Do You Find Research Article on Massage and Health Issues? | Working Well Massage Coaching

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s