7 Points about RDF Validation

kendall on May 14 2013

Favorite   Tweet    Share

Given the upcoming RDF Validation Workshop at the W3C here's a brief analysis of the RDF validation landscape. There are three primary systems to consider: SPIN, Stardog ICV, and IBM's Resource Shapes.

Existing Systems: SPIN, IBM Resource Shapes, and Stardog ICV

Users can, uh..., use any of these three systems to do validation of RDF (i.e., Linked Data). SPIN works with TopBraid's toolchain; ICV system works with Stardog, our RDF database...and with any other system that can evaluate SPARQL queries; and IBM's Resource Shapes works (or will work) with parts of IBM's Rational suite of OSLC tools.

So far, so good...and so simple. No big deal.

What are the differences between these three RDF validation tools? They are largely superficial, i.e., a matter of syntax. (That's not entirely true, but it's close enough for now.) The most obvious difference, from the point of view of users, is surface syntax; that is, the syntax that is used to capture the constraints.

  1. IBM Resource Shapes is a grammar approach: users write RDF triples using the Resource Shapes grammar or vocabulary to define constraints against RDF data to be executed by systems that support Resource Shapes.
  2. SPIN uses SPARQL for its syntax (plus some tool support). Users write SPARQL queries to define constraints against RDF data to be executed by systems that support SPIN.
  3. Stardog ICV is a polyglot approach. Users write SPARQL queries, or OWL axioms, or SWRL rules--or a mix of all three--to define constraints against RDF data to be executed by Stardog. Stardog translates SPARQL, OWL, and SWRL into equivalent SPARQL queries to be executed by any system that can evaluate SPARQL queries.

That's why we say Stardog can provide ICV services even for other RDF databases that don't support ICV natively, that is, for all the other RDF databases in the world that aren't Stardog. That's real interoperability available today in a shipping, production system.

From the user's point of view, they have a choice:

  1. Write constraints in SPARQL using SPIN
  2. Write constraints in an RDF vocabulary using Resource Shapes
  3. Write constraints in OWL, SWRL, or SPARQL using Stardog ICV

Some constraints are easier to write in one syntax than in the others. There isn't any particular reason to force users to use one and only one syntax for writing all constraints since the only reasonable basis of interoperability is SPARQL queries. The expressivity of RDF validation should be precisely the expressivity of SPARQL query evaluation against RDF data. No more, no less.

By and large, it will be RDF databases that provide RDF validation services and the lingua franca of RDF databases is SPARQL, not nested for-loops in Jena or Sesame SAILs or OWL axioms or SWRL rules or RDF vocabularies. It should be SPARQL queries as the basis of interoperability and exchange and as many surface syntaxes as the market cares to support.

Depending on how the market turns and how the W3C takes up these matters in a future standardization effort, Stardog will add support for IBM's grammar-based approach because that's trivial for us to do. In fact Stardog will support any constraint syntax that can be efficiently translated into legal, valid SPARQL because life is too short to obsess about syntax.


Okay, so there are some wrinkles that we care about:

  1. there are some meta-syntactic things that people want to constrain... I think that should be handled out of band, but reasonable people may differ. It's a small point in any regard.
  2. when I say "SPARQL is the lingua franca" above, I mean to include SPARQL 1.1 entailment regimes, that is, Stardog ICV constrains the "inferred graph" (handwaving some...) not only or just the explicit graph. In Stardog ICV, that is, an inferred statement can violate or satisfy an constraint just like an explicit statement. That's non-negotiable in terms of functionality for users. TLDR: Closed world for constraints; open world for inference.
  3. Point of pride: Not only is Stardog the only RDF database that does this at all, it's the only system that can explain validation results, that is, can automatically explain to users why (in terms of data and schema) a constraint is violated or satisfied transactionally (or not).
  4. SPARQL syntax is a perfectly good means of exchanging SPARQL queries; there's simply no need to serialize them into RDF. If vendors or systems want to do that, it's fine. But it's not a sane means of interoperability or exchange.


A few words about Stardog ICV's history. We described the idea in a research proposal to NIST, which they funded, in early 2008. That was the culmination of about 18 months of behind-the-scenes conversations in the OWL research community about how to do RDF validation. At that early stage, we were already focused on how to re-use OWL syntax to provide a high-level constraint language. Which we eventually generalized to using SPARQL and SWRL syntaxes, too.

The earliest published (peer reviewed, no less) description of this work from us came at OWLED 2008: Opening, Closing Worlds: On Integrity Constraints.

We delivered the first prototype to NIST in early 2009; that prototype was based on the SPARQL query engine in Pellet. So the ICV work that's in Stardog now is based on work that was done before Stardog development even started. Sometimes research to market is a series of long lines between vague dots.

We released the first version of ICV integrated with Stardog in 2011 and have been working on extending it since then, including the ability to explain ICV results automatically. That explanation work is ongoing today as we're working on automated repair plans for ICV violations. That means RDF validation isn't merely a system that tells a user that data is wrong in some way, but tells users why it's wrong and what they can do to repair it.


Total View Count2439
Times favorited0

More by kendall

Comments (0)     + Write  a Comment

No comments yet... be the first!