Converting Content Models To Schematron

This article originally appeared in an O'Reilly.com blog  November 16, 2006.

Regular grammars model constraints in the form "when we have an element with particular preceding siblings, which elements can be the following sibling?"

To do this in Schematron, we make assertions in the form

<rule context=" element [ preceding-siblings ]">
    <assert test=" immmediately-following-siblings ">The following siblings are
    expected after an <name />.</assert>
 </rule>

The preceding-siblings and following siblings may have several items, connected by the or operator. We only use one-step lookahead, so each path in the following-sibling does not have a predicate; however, the more complex the regular expression, the more predicates (including nested predicates) are needed for the preceding siblings axis.

Lets define a term. the "run", to describe a sequence of elements that corresponds to a group (sequence, choice) in the XML Schema.

Next, lets bootstrap the content model, by providing a rule that determines which elements can appear first. (This is actually an edge case of the subsequent rules, but it simplifies things conceptually to treat it separately.) Given a parent P, make a closure list of all the elements that can start S1.

<rule context=" P ">
    <assert test=" S1 or S2 ... or not(*) ">The <name /> should start with S1 or S2 ....</assert>
 </rule>

where P is the parent element, and Sn are the names of the child elements.

If the content model may be empty, the not(*) test above is used.

Next, we proceed by matching the content model to a case, in increasing order of complexity, and generating the appropriate Schematron rules for the case. Before we do this, lets handle the built-in special cases ANY, ALL, EMPTY (XSD can have particularly baroque markup for the last case). After we do this, we look at handling cardinality constraints.

NOTE: By and large, consider all the terms in the XPaths in this note either as shortcuts for longer statements or as placeholders for information calculated from the schema: I think it is clearer like this.

SPECIAL CASES

Most grammar-based schema languages have some special declared content types. We can get these out of the way.

EMPTY or XSD’s simpleTypes can be handled as

count(child::node()) = 0)

ANY is not needed, when it means completely unconstrained content. When it means a wildcard, so that only declared elements are allowed, the assertion is trivial, as are namespace-based wildcards.

ALL can be handled as

count(*) = count( {e1}...{en} )
       count({e1}) <= e1.occurrence
        ...
       count({en}) <= en.occurrence

where the occurrence is 0 or 1 in XSD 1.0 at least.

Element content can be handled by making sure there is no non-whitespace text, such as

string-length(space-normalize(.) = 0

The version for EMPTY would be something like

string-length(.) = 0

XSD has various type abstractions such as derivation by extension and restiction, however these do not need any special handling at all. (Type derivation checking is an issue for schema-checkers not instance validators.)

CONTENT MODELS

Case 1. First, lets consider the simple case of an element which is only used once in a content model: E1. The rule is easy to construct:

<rule context=" E1">
    <assert test=" F1 or F2 ... or not(following-sibling::*) ">F1 or F2 ... are
    expected after a <name />.</assert>
 </rule>

where E1 is an element name, and Fn is shorthand for following-sibling::*[1][self::Fn]

There is no preceding siblings context required. If the element can appear as the last element, the not(following-sibling::*) test is used.

This rule is enough to cope with XML DTD mixed content, with simple lists of elements, and with many simple content models such as those typically created from ER database schemas.

Case 2. Next, lets consider a case where an element E2 is used multiple times, but each time it is used it has distinct preceding sibling set. The rule to use is:

<rule context=" E2[ P1 or P2 ... or not(previous-sibling::*)]">
    <assert test=" F1 or F2 ...) ">F1 or F2 ... are
    expected after a <name />.</assert>
 </rule>
  <rule context=" E2[ P3 or P4 ...]">
    <assert test=" F3 or F4 ... or not(following-sibling::*) ">F3 or F4 ... are
    expected after a <name />.</assert>
 </rule>

where E2 is an element name, and Fn is shorthand for following-sibling::*[1][self::Fn] and Pn is shorthand for preceding-sibling::*[1][self::Pn]

We have multiple rules, one for each use of E2 in the content model. If E2 can appear as the first item,
in the content model, the not(previous-sibling::*) test can be used.

Case 2a: Simplification is possible: duplicate rules can be removed; if only one rule remains, it can be simplified to use the form of rule from case 1.

Case 2b: Note that because Schematron rules are exclusive, an implementation aimed at providing editing assistance might add another general rule expected to fail such as

<rule context=" E2">
    <assert test="F1 or F2 or F3 or F4 ...">The elements F1 or F2 or F3 or F4 ... may be used
    after a <name />, depending on its position.</assert>
 </rule>

This kind of rule will also help in the presense of wildcards.

This leaves us with cases of elements that are used multiple times with different following sets and non-unique preceding siblings.

Case 3: Next we deal with where an element name is multiply used, but there is some unique element required between each use. abcdabe for example: the d element divides the string into runs in which the b element has different successors.

<rule context=" E3[ not(D1 or D2...)]">
    <assert test=" F1 or F2 ...) ">F1 or F2 ... are
    expected after a <name />.</assert>
 </rule>
  <rule context=" E3[ D1 ][not (D2 or D3 ...)]">
    <assert test=" F3 or F4 ... or not(following-sibling::*) ">F3 or F4 ... are
    expected after an <name />.</assert>
 </rule>

where E3 is an element name, and Fn is shorthand for following-sibling::*[1][self::Fn] and Dn is shorthand for preceding-sibling::Dn

Case 3a: Next we cope with elements which are multiply used and non-identical successor sets but
where each use can be distinguished by a combination of a unique element between each use and the immediately preceding elements.

<rule context=" E3[ not(D1 or D2...)][P1]">
    <assert test=" F1 or F2 ...) ">F1 or F2 ... are
    expected after a <name />.</assert>
 </rule>
  <rule context=" E3[ not(D1 or D2...)][P2]">
    <assert test=" F3 or F4 ...) ">F3 or F4 ... are
    expected after a <name />.</assert>
 </rule>
  <rule context=" E3[ D1 ][not (D2 or D3 ...)]">
    <assert test=" F3 or F4 ... or not(following-sibling::*) ">F3 or F4 ... are
    expected after an <name />.</assert>
 </rule>

Case 4: Next we extend the string matching to multiple characters. So that P1 becomes P2P1: [preceding-sibling::*[1][self::P1]] and [preceding-sibling::*[2][self::P2]] D1 becomes D2D1: preceding-sibling::Dn[preceding-sibling::*[1][self::D2]]

Case 4b: We can also conceivably extend the match to do string matching on successors, at the cost of our 1 character lookahead (i.e. streaming) to some extent. So F1 becomes F1F2: following-sibling::*[1][self::F1] and following-sibling::*[2][self::F2]

We now are down to a fairly small class of elements: particles which are used multiple time where a few successors are possible but we cannot determine which successor to use by any combination of the strings that appear immediately before the particle and the presense or absense of unique strings before or after the element.
This kind of occurrence would occur when there is a group (sequence) with duplicate element runs, where that group has multiple repititions, and is itself multiply repeated. In a sense, a grouping is an unmarked-up structure, and XPath works on mark-up not unmark-up. Furthermore, these kinds of structures, being difficult to describe in XPaths, may indeed be difficult to for humans to understand too.

We could further elaborate case 4, and allow Dn discriminators themsleves to have context and sub-discriminators, and these in turn to also allow context and sub-discriminators, but at increasing complexity and diminishing utility. XPath2 offers a richer approach too: the FLWR construct in an XPath may allow some kinds of
runs to be extracted more easily and validated against.

For the purposes of this note, we treat any remaining elements as case 4: this gives weaker validation but no false negatives.

CARDINALITY CONSTRAINTS

The content model rules above handle the cases ?, + and * as used by DTDs and RELAX NG, but not cardinality requirements of XSD: minOccurs >1 and maxOccurs > 1. One strategy would be to unroll the cardality constraints, but this would be susceptible to pathological combinatorial explosion.

Instead, lets model the constraints as futher rules.

Case 1. First, lets consider the simple case of an element which is only used once in a content model: E1. First we give some totals

<rule context=" P ">
    <assert test=" count(E1) >= E1.effectiveMinOccurs ">
        In a P, an E1 must occur at least <value-of select=" E1.effectiveMinOccurs "/> times.
    </assert>
    <assert test=" count(E1) >= E1.effectiveMaxOccurs ">
        In a P, an E1 must occur maximum <value-of select=" E1.effectiveMaxOccurs "/> times.
    </assert>
  </rule>

where P is the parent element, E1 is the child being counted, E1.effectiveMinOccurs is the multiple of all minOccurs on E1 and all ancestor groupings (zero if any are zero), and E1.effectiveMaxOccurs is the multiple of all maxOccurs on E1 and all ancestor groupings (unbounded if any are unbounded.)

If the element can appear in multiple consecutive runs with an occurrence different to the effective occurrence (e.g. (a[1-4]bc)[2-5]) then we can add assertions for these constraints as well:

<rule context="P/E1[not(following-sibling::E1)]">
        <let name="run-length" value="position() - preceding-sibling::*[not(self::E1)][1])/position()">
                <assert test="$run-length > E1.minRunLength">
                The minimum number of consecutive <name /> elements is <value-of select="E1.minRunLength"/>.
                </assert>
                <assert test="$run-length > E1.maxRunLength">
                The maximum number of consecutive <name /> elements is <value-of select="E1.maxRunLength"/>.
                </assert>
   </rule>

Case 1a: Instead of the bounds, the ancestor occurs could be used to create a formula

Case 2. Next, lets consider a case where an element E2 is used multiple times, but each time it is used it has distinct preceding sibling set.

<rule context=" P ">
    <assert test=" count(E2 [ P1 or P2 ..]) &gt;= E2.1.effectiveMinOccurs ">
        In a P, an E2 must occur at least <value-of select=" E2.1.effectiveMinOccurs "/> times
        when it occurs immediately after a P1 or P2.
    </assert>
    <assert test=" count(E2 [ P3 or P4 ..]) &gt;= E2.2.effectiveMinOccurs ">
        In a P, an E2 must occur at least <value-of select=" E2.2.effectiveMinOccurs "/> times
        when it occurs immediately  after a P3 or P4.
    </assert>
    <assert test=" count(E2 [ P1 or P2 ..]) &gt;= E2.1.effectiveMaxOccurs ">
        In a P, an E2 must occur  maximum  <value-of select=" E2.2.effectiveMaxOccurs "/> times
        when it occurs immediately  after a P1 or P2.
    </assert>
    <assert test=" count(E2 [ P3 or P4 ..]) &gt;= E2.2.effectiveMaxOccurs ">
        In a P, an E2 must occur maximum <value-of select=" E2.2.effectiveMaxOccurs "/> times
        when it occurs immediately  after a P3 or P4.
    </assert>
  </rule>

where P is the parent element, E2 is the child being counted, E2.n.effectiveMinOccurs is the multiple of all minOccurs on E2 and all ancestor groupings (zero if any are zero) of the nth use of E2, and E2.n.effectiveMaxOccurs is the multiple of all maxOccurs on E2 and all ancestor groupings (unbounded if any are unbounded) of the nth use of E2.

If the element can appear in multiple consecutive runs with an occurrence different to the effective occurrence then we can add assertions for these constraints as well:

<rule context="P/E2[not(following-sibling::E2)][P1 or P2 ...]">
        <let name="run-length.1" value="position() - preceding-sibling::*[not(self::E2)][1])/position()">
        <assert test="$run-length.1 > E2.1.minRunLength">
                The minimum number of consecutive <name /> elements in this position
                is <value-of select="E2.1.minRunLength"/>.
                </assert>
                <assert test="$run-length.1 > E2.1.maxRunLength">
                The maximum number of consecutive <name /> elements in this position
                is <value-of select="E2.1.maxRunLength"/>.
                </assert>
    </rule> 

   <rule context="P/E2[not(following-sibling::E2)][P3 or P4 ...]">
        <let name="run-length.2" value="position() - preceding-sibling::*[not(self::E2)][1])/position()">
                <assert test="$run-length.2 > E2.2.minRunLength">
                The minimum number of consecutive <name /> elements in this position
                is <value-of select="E2.2.minRunLength"/>.
                </assert>
                <assert test="$run-length.2 > E2.2.maxRunLength">
                The maximum number of consecutive <name /> elements in this position
                is <value-of select="E2.2.maxRunLength"/>.
                </assert>
   </rule>

Case 3: Next we deal with where an element name is multiply used, but there is some unique element required between each use.

<rule context=" P ">
    <assert test=" count(E3[not(D1 or D2 ...)]) &gt;= E3.1.effectiveMinOccurs ">
        In a P, an E3 must occur at least <value-of select=" E3.1.effectiveMinOccurs "/> times
        when it occurs before a D1 or D2.
    </assert>
    <assert test=" count(E3[not(D1 or D2 ...)]) &gt;= E3.2.effectiveMinOccurs ">
        In a P, an E3 must occur at least <value-of select=" E3.2.effectiveMinOccurs "/> times
        when it occurs after a D1 or D2.
    </assert>
    <assert test=" count(E3[D1 or D2 ...]) &gt;= E3.1.effectiveMaxOccurs ">
        In a P, an E3 must occur maximum  <value-of select=" E3.1.effectiveMaxOccurs "/> times
        when it occurs before a D1 or D2.
    </assert>
    <assert test=" count(E3[D1 or D2 ...]) &gt;= E3.2.effectiveMaxOccurs ">
        In a P, an E3 must occur maximum  <value-of select=" E3.2.effectiveMaxOccurs "/> times
        when it occurs after a D1 or D2
    </assert>
  </rule>

where E3 is an element name, Dn is shorthand for preceding-sibling::Dn, E3.n.effectiveMinOccurs
is the multiple of all minOccurs on E3 and all ancestor groupings (zero if any are zero) of the nth use of E3,
and E3.n.effectiveMaxOccurs is the multiple of all maxOccurs on E3 and all ancestor groupings (unbounded if any are unbounded) of the nth use of E3.

If the element can appear in multiple consecutive runs with an occurrence different to the effective occurrence then we can add assertions for these constraints as well:

<rule context="P/E3[not(following-sibling::E3)][not(D1 or D2 ...)]">
        <let name="run-length.1" value="position() - preceding-sibling::*[not(self::E2)][1])/position()">
        <assert test="$run-length.1 > E2.1.minRunLength">
                The minimum number of consecutive <name /> elements in this position
                is <value-of select="E2.1.minRunLength"/>.
                </assert>
                <assert test="$run-length.1 > E2.1.maxRunLength">
                The maximum number of consecutive <name /> elements in this position
                is <value-of select="E2.1.maxRunLength"/>.
                </assert>
    </rule> 

   <rule context="P/E2[not(following-sibling::E1)][D1 or D2 ...]">
        <let name="run-length.2" value="position() - preceding-sibling::*[not(self::E2)][1])/position()">
                <assert test="$run-length.2 > E2.2.minRunLength">
                The minimum number of consecutive <name /> elements in this position
                is <value-of select="E2.2.minRunLength"/>.
                </assert>
                <assert test="$run-length.2 > E2.2.maxRunLength">
                The maximum number of consecutive <name /> elements in this position
                is <value-of select="E2.2.maxRunLength"/>.
                </assert>
   </rule>

Case 4: Constucted on similar lines.

For the purposes of this note, we treat any remaining elements as case 4: and combine them This gives weaker validation but no false negatives.

POSITION-DEPENDENT TYPES AND OTHER FEATURES

The discussion above assumes named typing where the name of an element determines its type, such as in DTDs.

However, in XSD and RELAX NG, an element may have different content models depending on its position in a content model as well. In these cases, the context also may be tested against the various cases in exactly the same way as above. (If there are multiple element uses remaining, the constraints of all of them need to be unioned, for weaker validation with no false negatives.)

This note does not deal with other features such as RELAX NG's interleave, SGML's & connector, XSD's nillibility, xsi:type, or substitution groups. I don't anticipate they will pose unsolvable problems.

SUMMARY

It is readily possible to convert many grammar-based schemas to Schematron mechanically. We don’t need to give up streaming implementations either, though obviously Schematron implentations build on XSLT would usually just use an in-memory DOM-like structure that limits file size: a streaming implementation is possible, since forward axes such as descendent:: are not used: strictly only the child, self, ancestor axes and their siblings need to be held in memory at any time.

The method shown is not complete: some identifiable elements in some kinds of rare, convoluted content models may have some of their positional constraints merged with the constraints of another element with the same name that appears in the same context and some invalid subranges of cardinality on elements and runs may not be checked. In both cases, this slightly weaker validation will give no false negatives.