“Schemas do not imply any semantics of documents”

Posted on October 9, 2017 by Rick Jelliffe

I liked this quote by RELAX NG inventor Dr Makoto Murata on a mail list recently. I thought it was really clearly put.

Here is the exchange:

C:  The order of elements (in Schema X) {actually} seems to matter.

M:  The schema allows any order.  But this does not mean that any order is equivalent.  Different orders may well have different semantics.

C:  This a little cryptic. If the order changes the semantics then, surely, the order is important and a sequence rather than a choice should be used?

M:  Schemas do not imply any semantics of documents. Schemas merely describe permissible sets of documents.

I think Murata-san is entirely correct as long as you think a schema is what a grammar or data definition language does. However, it show the problem with schemas—three things really:

  • Grammar schema languages don’t express semantics because they cannot. Contrast with Schematron, where the natural language assertion is the primary focus: the assertions attempt to capture the semantics, while the XPaths attempt to model the assertions. For example, when a grammar says one element should come before another, the grammar-based schema languages provide no requirement to say why: Is it because the data is textual and therefore has a print order? Is it because the data is an array with an order?  Is it to keep like things together?  It is to exclude some other elements?
    Now you might say that those questions are not  “semantics”, but I think that is making a virtue out of necessity.  Schematron encourages you to be explicit about those design decisions.  The only reason grammars can get away with not documenting these things is that there are unspoken conventions, obvious to gurus but unclear to newbies, for each kind of document class (literature, data tables, database data) that allows those semantics to be elided without disaster.
  • It is impossible to draw an absolute line between schemas and business rules based on just “semantics”.  For example, most people used to think that co-occurrence rules (if the value of property X is this then value of property Y can be those) fall in the category of business rules.  But reality impinged:  if a rule can never be broken, how is it a business rule?  For example, that February can never have more than 29 days. Or that a CALS table cannot have more columns than the colspec fixes.  I am not saying that everything is ambiguous, of course.  But the demarcation between a document subset (schema) and semantics (business rules) is often really a reflection of a separation of concerns between technical analysts and business analysts:  it is a manifestation of Conway’s Law.
  • You don’t know what rules the developer used to decide what constraints to enforce or not.

So I think Murata-san has a good rule-of-thumb there, that schemas don’t imply any semantics.  Certainly schemas never imply all semantics, and often not even all the  structural pr data type or linking rules.  But that is not to say that schemas don’t reflect semantic considerations through and through.  The implying (ultimately, by humans) is the problem: the idea that any schema completely expresses everything about a document type.

That being said, I expect that a schema that has fixed RDF links for every element type could be said to imply semantics. But the semantics of the order or elements in a free content model would be difficult for RDF to describe, in the same way that describing the semantics of a computer program is difficult to imply (otherwise we would have better automated IDEs that just say “I see what you mean, this is what you should do” (ISWIM), which has been the Holy Grail of IDEs for 30 years.)