Performing live time-traversal queries on RDF datasets

Guest post by Arcangelo Massari, University of Bologna

In this post, Arcangelo Massari, who recently graduated in Digital Humanities and Digital Knowledge under Professor Silvio Peroni at the University of Bologna, shares the results of his master thesis.

A particular problem in information retrieval is that of obtaining data from an evolving dataset, independent of the time at which that item of data was added, changed or removed. To permit such time-independent queries to be performed over evolving RDF datasets, I have developed two new pieces of open source software, time-agnostic-library [1] and time-agnostic-browser [2], that are now available from the OpenCitations GitHub repository.

The time-agnostic-library is a Python library to perform live time-traversal queries on RDF datasets. Time-traversal means being agnostic about time: a SPARQL query that is not run on the current state of the collection but over its entire history or over a specified timespan of that history [3]. This tool allows materializations – obtaining all versions of an entity over time, or its status at a given time. Furthermore, SPARQL queries can be performed to get the delta between two or more versions of one or more resources. Thereby, the time-agnostic-library realizes all the retrieval functionalities described in the taxonomy by Fernández et al. [3].

To complement this query software, the time-agnostic-browser is a web application built on top of the time-agnostic-library to achieve the same results via a graphical user interface.

The primary purpose of these developments is to offer a system for browsing the provenance [4] of RDF statements across time: who produced them, when, where the information was taken from, and what changes were made compared to the previous state of the resource. Knowledge of such information is essential because data changes over time, either because of the natural evolution of concepts or due to the correction of mistakes. Indeed, the latest version of knowledge may not be the most accurate. Such phenomena are particularly tangible in the Web of Data, as highlighted in a study by the Dynamic Linked Data Observatory, which noted the modification of about 38% of the nearly 90,000 RDF documents monitored for 29 weeks, and the permanent disappearance of 5% of them [5] (Figure 1).

Figure 1. Donut chart showing the results of the study conducted by the Dynamic Linked Data Observatory on the evolution of RDF documents [5].

Additionally, the truthfulness of data cannot be assessed without provenance records and a system to query them. In fact, the truth value of an assertion on the Web is never absolute, as demonstrated by Wikipedia, which in its official policy on the subject states: “The threshold for inclusion in Wikipedia is verifiability, not truth.” [6]. The Semantic Web does not alter that condition, and trustworthiness has to be evaluated by each application by probing the context of the statements [7]. It is a challenging task and thus, in the Semantic Web Stack, trust is the highest and most complex level to satisfy, subsuming all the previous ones (Figure 2).

Figure 2.The Semantic Web layers [7]. Trust is the uppermost level of the stack, subsuming all the others.

Notwithstanding these premises, at present the most extensive RDF datasets – DBPedia [8], Wikidata [9], Yago [10], and the Dynamic Linked Data Observatory [11] – do not use RDF to track changes and record the provenance of such changes. Instead, they all adopt backup-based archiving policies. Some of them, such as Yago 4, record provenance but not changes. As far as citation databases are concerned, OpenCitations is the only infrastructure to implement change-tracking mechanisms and to record full RDF provenance records for each data entity. Among the leading players in this field, neither Web of Science nor Scopus have adopted similar solutions.

In accordance with the OpenCitations Data Model (OCDM) [12], a provenance snapshot is generated by OpenCitations every time a bibliographical entity is created or modified. Each snapshot (prov:Entity) records the responsible agent (prov:wasAttributedTo), the generation time (prov:generatedAtTime), the invalidation time (prov:invalidatedAtTime), the primary source (prov:hadPrimarySource), and a link to the previous snapshot (prov:wasDerivedFrom), using terms from the Provenance Ontology. In addition, OCDM introduced a system to simplify restoring an entity’s status at a given time, by saving the delta between two versions as a SPARQL update query (prov:hasUpdateQuery) [13] (Figure 3). This approach enables one to restore an entity to a specific timepoint (snapshot) in a straightforward way by applying the inverse operations, i.e., deletions instead of additions, etc.

Figure 3. Provenance in the OpenCitations Data Model.

This solution is concretely used in all the datasets related to the OpenCitations infrastructure, such as COCI, an open index containing almost 1.2 billion DOI-to-DOI citation links derived from the open reference data available in Crossref [14]. It is important to note that this OpenCitations provenance model is generic and reusable in any other context. Since the time-agnostic-library leverages OCDM, it too is generic and can be used for any RDF dataset that tracks changes and provenance as OpenCitations does.

The time-agnostic-library is released under the ISC license and is downloadable through pip [1]. Test-driven development was adopted as a software development process during its creation [15]. It makes three main classes available to the user: AgnosticEntity, VersionQuery, and DeltaQuery, for materializations, version queries, and delta queries, respectively (Listing 1).

Listing 1. Code template to achieve materializations, time-traversal queries, and delta queries.

All three operations can be performed over the entire available history of the dataset, or by specifying a time interval via a tuple in the form (START, END).

The time-agnostic-browser [2] is also released under the ISC license and can be run as a Flask application. It is organized into two macro-sections: “Explore” and “Query”. In the former, a text input accepts a URI. By submitting it, the entire history of the corresponding resource is displayed. In the latter, a text area receives a SPARQL query, which is resolved on all dataset states. Its main added value is hiding the triples and the complexity of the underlying RDF model: predicate URIs, as well as subjects and objects, appear in a human-readable format. Moreover, all the entities are displayed as links, providing shortcuts to reconstruct the history of the related resources (Figure 4).

Figure 4. Graphical user interface of an entity history reconstruction through the time-agnostic-browser.

The efficiency of time-agnostic-library was measured with two types of benchmarks [16], one on execution times and the other on the amount of computer memory (RAM) required by ten different use cases, each repeated ten times to produce significant results and avoid outliers. In light of these benchmarks, time-agnostic-library has proven effective for any materialization. Regarding structured queries, they are swift if all subjects are known or deductible. On the other hand, the presence of unknown subjects in the user’s SPARQL query involves the identification of all present and past entities that satisfy that pattern, and so requires a more significant amount of time and resources. Specifically, all materializations and the cross-version structured query with known subjects required about half a second and about 50 MB of RAM; conversely, with unknown subjects, 581 seconds and 519 MB of RAM are required. It can be concluded that the proposed software can be used effectively in all cases where the subject is known, that is, for any materialization or formulated SPARQL queries without isolated triple patterns containing unknown subjects.

Other software solutions for such problems have been proposed. Table 1 shows the list of available software to perform materializations and time-traversal queries on RDF datasets. As can be observed, time-agnostic-library is the only one to support all retrieval functionalities without requiring pre-indexing processes. This feature makes it particularly suitable for use in scenarios with large amounts of data that often change over time. Moreover, compared to the approach of Im, Lee and Kim [17] and OSTRICH [18], the OpenCitations Data Model only requires storing the current state of the dataset, rather than the original one, allowing one to query the latest version, without additional computational effort to first re-create the original version.

SoftwareVersion materializationDelta materializationSingle-version structured queryCross-version structured querySingle-delta structured queryCross-delta structured queryLive
PromptDiff [19]+++
SemVersion [20]+++
Im, Lee, & Kim, 2012 [17]+++++
R&Wbase [21]++++
x-RDF-3X [22]+++
v-RDFCSA [23]++++++
OSTRICH [18]+++
Tanon & Suchanek, 2019 [24]++++++
time-agnostic-library[1]+++++++
Table 1. Comparative between time-agnostic-library and preexisting software to achieve materializations and time traversal queries on RDF datasets. (Scroll right to see Columns 6-8).

The OpenCitations Data Model and the time-agnostic-library software are the pre-requisites that will allow OpenCitations to involve third parties, for example members of staff in academic libraries, in the submission, curation and updating of OpenCitations bibliographic and citation data. At this stage, all entities in COCI have a single snapshot — the one made at the time of creation. However, since these entities may become modified, corrected or enriched over time, it is imperative to have appropriate software tools available for use by curators. With the time-agnostic-library software and its associated time-agnostic-browser, it will be possible for a curator to explore the entire history of the changes within an RDF dataset, to know when they were made, based on which source, and by which responsible agent, thus ensuring the reliability and verifiability of data, and facilitating any necessary further changes.

References

[1] A. Massari, time-agnostic-library. 2021. Available: https://archive.softwareheritage.org/swh:1:snp:d7fd1754377f45d16afb61efc770815b5a3c8f83

[2] A. Massari, time-agnostic-browser. 2021. Available: https://archive.softwareheritage.org/swh:1:dir:337f641375cca034eda39c2380b4a7878382fc4c

[3] J. D. Fernández, A. Polleres, and J. Umbrich, ‘Towards Efficient Archiving of Dynamic Linked’, in DIACRON@ESWC, Portorož, Slovenia: Computer Science, 2015, pp. 34–49.

[4] December, ‘Provenance XG Final Report’. 2010. Available: http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/

[5] T. Käfer, A. Abdelrahman, J. Umbrich, P. O’Byrne, and A. Hogan, ‘Observing Linked Data Dynamics’, in The Semantic Web: Semantics and Big Data, vol. 7882, P. Cimiano, O. Corcho, V. Presutti, L. Hollink, and S. Rudolph, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 213–227. doi: 10.1007/978-3-642-38288-8_15

[6] S. L. Garfinkel, ‘Wikipedia and the Meaning of Truth’, MIT Technology Review, 2008, [Online]. Available: https://stephencodrington.com/Blogs/Hong_Kong_Blog/Entries/2009/4/11_What_is_Truth_files/Wikipedia%20and%20the%20Meaning%20of%20Truth.pdf

[7] M.-R. Koivunen and E. Miller, ‘Semantic Web Activity’, W3C, Nov. 02, 2001. https://www.w3.org/2001/12/semweb-fin/w3csw

[8] F. Orlandi and A. Passant, ‘Modelling provenance of DBpedia resources using Wikipedia contributions’, Journal of Web Semantics, vol. 9, no. 2, pp. 149–164, Jul. 2011, doi: 10.1016/j.websem.2011.03.002.

[9] P. Dooley and B. Božić, ‘Towards Linked Data for Wikidata Revisions and Twitter Trending Hashtags’, in Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services, Munich Germany, Dec. 2019, pp. 166–175. doi: 10.1145/3366030.3366048.

[10] Yago Project, ‘Download data, code, and logo of Yago projects’, Yago, 2021. https://yago-knowledge.org/downloads (accessed Sep. 24, 2021).

[11] J. Umbrich, M. Hausenblas, A. Hogan, A. Polleres, and S. Decker, ‘Towards Dataset Dynamics: Change Frequency of Linked Open Data Sources’, in Proceedings of the WWW2010 Workshop on Linked Data on the Web, Raleigh, USA, 2010. Available: http://ceur-ws.org/Vol-628/ldow2010_paper12.pdf

[12] M. Daquino, S. Peroni, and D. Shotton, ‘The OpenCitations Data Model’, p. 836876 Bytes, 2020, doi: 10.6084/M9.FIGSHARE.3443876.V7.

[13] S. Peroni, D. Shotton, and F. Vitali, ‘A Document-inspired Way for Tracking Changes of RDF Data’, in Detection, Representation and Management of Concept Drift in Linked Open Data, Bologna, 2016, pp. 26–33. Available: http://ceur-ws.org/Vol-1799/Drift-a-LOD2016_paper_4.pdf

[14] I. Heibi, S. Peroni, and D. Shotton, ‘Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations’, Scientometrics, vol. 121, no. 2, pp. 1213–1228, Nov. 2019, doi: 10.1007/s11192-019-03217-6.

[15] K. Beck, Test-driven development: by example. Boston: Addison-Wesley, 2003.

[16] A. Massari, ‘time-agnostic-library: benchmark results on execution times and RAM’. Zenodo, Oct. 05, 2021. doi: 10.5281/ZENODO.5549648.

[17] D.-H. Im, S.-W. Lee, and H.-J. Kim, ‘A Version Management Framework for RDF Triple Stores’, Int. J. Softw. Eng. Knowl. Eng., vol. 22, pp. 85–106, 2012.

[18] R. Taelman, M. V. Sande, and R. Verborgh, ‘OSTRICH: Versioned Random-Access Triple Store’, in Companion Proceedings of the Web Conference 2018, 2018, pp. 127–130. Available: https://core.ac.uk/download/pdf/157574975.pdf

[19] N. F. Noy and M. A. Musen, ‘Promptdiff: A Fixed-Point Algorithm for Comparing Ontology Versions’, in Proc. of IAAI, 2002, pp. 744–750.

[20] M. Völkel, W. Winkler, Y. Sure, S. Kruk, and M. Synak, ‘SemVersion: A Versioning System for RDF and Ontologies’, 2005.

[21] M. V. Sande, P. Colpaert, R. Verborgh, S. Coppens, E. Mannens, and R. V. Walle, ‘R&Wbase: Git for triples’, 2013.

[22] T. Neumann and G. Weikum, ‘x-RDF-3X: Fast Querying, High Update Rates, and Consistency for RDF Databases’, Proceedings of the VLDB Endowment, vol. 3, pp. 256–263, 2010.

[23] A. Cerdeira-Pena, A. Farina, J. D. Fernandez, and M. A. Martinez-Prieto, ‘Self-Indexing RDF Archives’, in 2016 Data Compression Conference (DCC), Snowbird, UT, USA, Mar. 2016, pp. 526–535. doi: 10.1109/DCC.2016.40.

[24] T. Pellissier Tanon and F. Suchanek, ‘Querying the Edit History of Wikidata’, in The Semantic Web: ESWC 2019 Satellite Events, vol. 11762, P. Hitzler, S. Kirrane, O. Hartig, V. de Boer, M.-E. Vidal, M. Maleshkova, S. Schlobach, K. Hammar, N. Lasierra, S. Stadtmüller, K. Hose, and R. Verborgh, Eds. Cham: Springer International Publishing, 2019, pp. 161–166. doi: https://doi.org/10.1007/978-3-030-32327-1_32.

This entry was posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, Open Citations and tagged , , , , , , , , . Bookmark the permalink.

Leave a comment