Ongoing work on the Open Citations extensions project is now reaching the point of visualising – at very much a prototype level at this stage – the outputs of our earlier efforts to import and index the PubMed Central Open Access subset and arXiv.
Earlier in this project I asked David to specify a list of questions that he thought researchers might hope to answer by querying our Open Citations Corpus; the aim was to use these questions to guide our developments, in the hope of providing a striking interface that also did something useful – there are too many visualisations of data that look very pretty but that do not actually add much to the data. So, considering that list of questions and how one might visualise the data to not only a pretty but functional end, I set myself the following problem:
Identify what it is in a dataset that is not easy to find in a textual representation, and make it useful for search
Based on our earlier text search demonstrator the answer pretty soon became – of course – interactions; whilst the properties of a result object are obvious in a textual result set, the interactions between those objects are not – and sometimes, it is the interactions that one wishes to use as search parameters.
What I did
Having found my purpose, I set about applying the superb D3.js library to the problem, using it to draw SVG representations of elasticsearch query results directly into the browser. After testing with a number of different result layouts, I settled upon using a zoomable-pannable force-directed network graph and combined it up with some code from my PhD work to build in some connections on the fly. This is as mentioned earlier still a work in progress, but results so far are pretty good.
Take the image above, for example: this is a static representation of the interactions between David Shotton and all other authors (purple) with whom he has published an article (green) in the PMC OA subset. The red dots are the journals these articles are in, and the brown are citations. As a static image this could be fairly informative when marked up with appropriate metadata, and it does look quite nice; but, more than that, it can act as part of a search interface to enable a much improved search experience.
So far, the production of a given image also reduces the result set size; so whilst viewing the above image, the available suggestion dropdowns are automatically restricted to the subset of values relevant to the currenlty displaying image – dropdown suggestions are listed in order of popularity count, then upon typing one letter they switch to alphabetical, and with multiple letters they become term searches. By typing in free text search values or choosing suggestions, this visual representation of the current subset of results combined with the automated restriction of further suggestions should offer a simple yet powerful search experience. It is also possible to switch back to “list” view at any time, to see the current result set in a more traditional form. Further work – described below – will bring enhancements that add functionality to the elements of the visualisation too.
Similarly to the search result list demonstrator, it is possible to embed the visual search tool in any web page. However, as it looks better with full screen real estate I have saved that particular trick for the time being, and simply made it available at http://occ.cottagelabs.com/graphview.
Now before you rush off to try it, given the prototype state, you will need some pointers. Taking the above image as example once more, in order to reproduce it, do the following:
- Choose authors as a search suggestion type
- Start typing Shotton – click on Shotton David when it appears in the list
- (If the error where it appears to return all results again appears – described below – just keep going)
- tick the various display options to add authors, journals and citations objects to the display
The next step would be to click an author or other entity bubble then choose to add that to the search terms, or start a new search based on that bubble or perhaps a subset of the returned bubbles; however this is all still in development.
For a more complex example, try choosing keywords then type Malaria. Once displayed, increase resultset size to 400 so they are all displayed. Then try selecting the various authors, journals and citations tickboxes to add those objects; try increasing the sizes to see how many you can get before your computer melts… On my laptop, asking for more than about 1000 of each object results in poor performance. But here is an example of the output – all 383 articles with the Malaria keyword in the PMC OA, showing all 70 journals in which they are published, with links to the top 100 authors and citations. Which journal do you think is the large purple dot in the middle?
- Numerous buttons have no action yet – clear / help / prev / next / + search / labels. Once these and other search action buttons are added, the visualisation can become a true part of the search experience rather than just a pretty picture
- Searches are sent asyncrhonously and occasionally overlap, resulting in large query result sizes overwriting smaller ones. This needs a delay on user interactions added.
- Some objects should become one – for example some citations are to the same article via both DOI and PMID, and some citations are also open access articles in our index, so they shold be linked up as such.
- There is as yet no visual cue that results are still loading, so it feels a bit in limbo. Easy fix.
- Some of the client-side processing can be shifted to the backend (already in progress)
- The date slider at the bottom is twitchy and needs smoother implementation and better underlying data (see below)
Apart from the above technical tasks, we will need to re-visit our data pipeline in order to answer more of the questions set by David. For example we have very little affiliation data at present, and we are also missing a large amount of date information. Also some data cleaning is necessary – for example, keywords should all be lowercased to ensure we do not have subsets due solely to capitalisation. There are also certain types of data that we have no idea about as yet – for example author location, h-index, ORCID. However, this is all as to be expected at this stage, and overall our ability to so easily spot these issues shows great progress.
More to come
There is still work to be done on this graph interface, and in addition, we have some more demonstrators on the way too. In combination with the work on improving the pipeline and data quality, we should soon be able to perform queries that will answer more of our set questions – then we will identify what needs done next to answer the remaining ones!