University of Florida's dchecker: Software for insuring semantic data integrity

http://www.ufl.edu ( Publisher's URL )
MISSING IMAGE

Material Information

Title:
University of Florida's dchecker: Software for insuring semantic data integrity
Physical Description:
Conference Poster
Creator:
Nicholas Rejack
Christopher Barnes
Michael Conlon
Publisher:
University of Florida
Place of Publication:
San Francisco, CA
Publication Date:

Notes

Acquisition:
Collected for University of Florida's Institutional Repository by the UFIR Self-Submittal tool. Submitted by Michael Conlon.
Publication Status:
Published

Record Information

Source Institution:
University of Florida Institutional Repository
Holding Location:
University of Florida
Rights Management:
All rights reserved by the submitter.
System ID:
IR00003884:00001


This item is only available as the following downloads:


Full Text

PAGE 1

University of Software for ensuring semantic data integrity Nicholas Rejack, MS 1 Christopher P Barnes 1 Michael Conlon, PhD 2 nrejack@ufl.edu cpb@ufl.edu mconlon@ufl.edu 1 Clinical and Translational Science Informatics and Technology, University of Florida 2 Clinical and Translational Science Institute, University of Florida Introduction Upkeep of data integrity requires constant vigilance. The problem of ensuring data integrity in DBMS is well understood, but this is not the case for semantic web triple data. UF VIVO: a researcher networking application The University of Florida has implemented VIVO ( http:/ /vivo.ufl.edu ) a semantic web application for researcher networking. Although VIVO will reject malformed RDF/XML that does not pass validation, it has relatively few restrictions on the data it will accept. Types of data integrity Unique identifiers must truly be unique per individual, a property defined as a book title must not hold chapter headings, people must not also be classed as organizations, and so forth. Ensuring data is properly defined semantically helps supplementary processes like automated semantic reasoning. Dchecker is a Python script that runs daily on a set of associated SPARQL queries. Queries can be added indefinitely to expand the capabilities. Some examples of data constraints checked : referential integrity : links between authors and their publications must be valid. Positions must be linked to people and organizations domain integrity : numeric identifiers must consist only of integers. semantic integrity : unique identifiers must not be duplicated across people, or on a single person data restrictions integrity : restricted data must not be exposed. people within sensitive organizations must be protected from public display constraint DBMS semantic systems referential integrity foreign keys must reference a primary key in a parent table references to other URIs must be valid domain integrity columns must be declared on a defined domain data properties must hold proper data types (strings, ints, etc.) semantic integrity often defined identically to domain integrity data must conform to the definitions inherent in the ontology being used Conclusion Unlike DBMS systems where data can be de normalized across many tables, semantic data on particular subjects collects on one URI. Future semantic data checking should consider the totality of facts collected on a URI to ensure semantic correctness: people should not have facts that would correspond to books such as page numbers, for example. Future expansion of dchecker should be able to circumvent some of the constraints of the SPARQL query language and perform multiple parallel queries to compare the results. Insertion of problematic entries into the dchecker report would make data repair easy. In addition, rules for automated data correction would enable hands free cleanup. Fig. 1: example SPARQL queries in command line Fig. 2: sample data quality report. Note the non zero entries representing errors that need to be corrected.