University of Software for ensuring semantic data integrity Nicholas Rejack, MS 1 Christopher P Barnes 1 Michael Conlon, PhD 2 firstname.lastname@example.org email@example.com firstname.lastname@example.org 1 Clinical and Translational Science Informatics and Technology, University of Florida 2 Clinical and Translational Science Institute, University of Florida Introduction Upkeep of data integrity requires constant vigilance. The problem of ensuring data integrity in DBMS is well understood, but this is not the case for semantic web triple data. UF VIVO: a researcher networking application The University of Florida has implemented VIVO ( http:/ /vivo.ufl.edu ) a semantic web application for researcher networking. Although VIVO will reject malformed RDF/XML that does not pass validation, it has relatively few restrictions on the data it will accept. Types of data integrity Unique identifiers must truly be unique per individual, a property defined as a book title must not hold chapter headings, people must not also be classed as organizations, and so forth. Ensuring data is properly defined semantically helps supplementary processes like automated semantic reasoning. Dchecker is a Python script that runs daily on a set of associated SPARQL queries. Queries can be added indefinitely to expand the capabilities. Some examples of data constraints checked : referential integrity : links between authors and their publications must be valid. Positions must be linked to people and organizations domain integrity : numeric identifiers must consist only of integers. semantic integrity : unique identifiers must not be duplicated across people, or on a single person data restrictions integrity : restricted data must not be exposed. people within sensitive organizations must be protected from public display constraint DBMS semantic systems referential integrity foreign keys must reference a primary key in a parent table references to other URIs must be valid domain integrity columns must be declared on a defined domain data properties must hold proper data types (strings, ints, etc.) semantic integrity often defined identically to domain integrity data must conform to the definitions inherent in the ontology being used Conclusion Unlike DBMS systems where data can be de normalized across many tables, semantic data on particular subjects collects on one URI. Future semantic data checking should consider the totality of facts collected on a URI to ensure semantic correctness: people should not have facts that would correspond to books such as page numbers, for example. Future expansion of dchecker should be able to circumvent some of the constraints of the SPARQL query language and perform multiple parallel queries to compare the results. Insertion of problematic entries into the dchecker report would make data repair easy. In addition, rules for automated data correction would enable hands free cleanup. Fig. 1: example SPARQL queries in command line Fig. 2: sample data quality report. Note the non zero entries representing errors that need to be corrected.