Converting XML Schemas to Schematron: (#11) Partial order validation for following sibling elements

This article originally appeared on the O'Reilly blog site, January 30, 2008.  Some links to the WayBack Machine may not work.

This article is part of a series describing how to convert from W3C XML Schemas to ISO Schematron. They are very different schema languages! This time we look at some code with quite complex XPaths: we want to validate that the element that follows another element in a document is one that “goes after” the first. But not necessarily immediately after: the schema might require extra elements in between, for example.

Why would we want to do that? Well, because we are approaching this systematically, and gradually expressing constraints from the most general to the most specific. XML schema content models are rather difficult, even when simplified in the way we already do when pre-processing the schema. By having a pattern that validates consecutive elements for partial order we can cope with all manner of inter-nested choice and element groups and cardinalities. We leave testing of required immediate following elements to another test (See the previous in this series Required pairs in sequences for a start.) By plugging the hole with a big rock, we need smaller pebbles for the remaining gaps.

(I have previously raised the use of partial order for schemas in my single-element schema language Hook,   and readers.)

Output

Lets start off by showing what we want to achieve. We want to generate from any XSD content model rules like the following:

<sch:rule context="Address/StreetOrPOBox">
         <sch:assert test="not (following-sibling::*)  or
                   (following-sibling::*[1][name() ='Suburb']  or
                    following-sibling::*[1][name() ='State']  or
                    following-sibling::*[1][name() ='Postcode'])">
                        When in a  "Address" element, the element "StreetOrPOBox" can only be followed
                        (perhaps with other elements intervening)
                        by the following elements: Suburb, State, Postcode
      </sch:rule>

And for elements which cannot have any followers, we want to generate rules like the following:

<sch:rule context="Address/Postcode">
         <sch:assert test="not(following-sibling::*)">
                When in a "Address" element, the element "Postcode" should not be
                followed by any other element.
         </sch:assert>
      </sch:rule>

Main Loop

Here is the start of the named template

<xsl:template name="generate-following-elements-checking-rule">

In this we first select all the element declarations or references in the XSD schema which are particles in a content model. Remember that we have pre-processed the schema modules into a single file so we don’t need to worry about import, include and global complexType declarations. And we are not supporting some features at this stage, such as wildcards, substitution groups and dynamic typing, which simplifies our life quite a lot, though unfortunately not entirely.

<!--  For every use of an element in any content model -->
        <xsl:for-each select="//xs:schema/xs:element//xs:element[not(parent::xs:all)]">
                <!-- Sort them so that local declarations come before globals, and so that deep path
                declarations come before shallow ones -->
                <xsl:sort select="count(ancestor::xs:element)" order="descending" />

Note that we are not worrying about elements in an xs:all group, because these have no partial order constraints that are not tested by the patterns for allowed elements and required elements. We sort our particles longest first so that local declarations are tested before global ones.

Now in this scope we make some convenience variables:

<!--  Store the name of the parent element -->
                <xsl:variable name="parent-element-name" select="ancestor::xs:element[1]/@name"/>
                <xsl:variable name="parent-element" select="ancestor::xs:element[1]"/>
                <!--  Store the context path -->
                <xsl:variable name="path-to-parent">
                        <xsl:for-each select="ancestor::xs:element"
                            ><xsl:value-of select="@name"/>/</xsl:for-each>
                </xsl:variable>

Handling repeating choice elements

Next, we handle a special case. Probably there are more special cases like this, and identifying them would help trim the output Schematron schema and reduce redundant messages.

This is the common case of (a | b | c)*, a single repeating choice group. Like an xs:all there is no need to generate declarations for these (though declarations could be made.)

<!--  Handle special case -->
        <xsl:when test="parent::xs:choice
                [@maxOccurs='unbounded' or @maxOccurs > 1 ]
                [parent::xs:complexType
                        [count(xs:choice)=1]
                        [count(xs:sequence)=0]
                or parent::xs:element
                        [count(xs:choice)=1]
                        [count(xs:sequence)=0]]
                [count(child::xs:choice)=0]
                [count(child::xs:sequence)=0]">
                <!--  If the parent is a repeating choice element and its parents only have that choice,
                        and that choice element only has element particles for children
                        then we can treat it as a special case: it has no extra positional constraints than the
                        presence constraints don't catch. -->

                <!--  only generate the rule when we come to the first subelement -->
                <xsl:if test="not(preceding-sibling::xs:element)">
                        <xsl:comment> No sequence constraints for element <xsl:value-of select="$parent-element-name"/>.</xsl:comment>
                </xsl:if>
                </xsl:when>

Identify followers

Now comes the more heart of the matter. We want to identify various kinds of followers, each in variables containing the sequence of possible elements; we use various XPaths to locate the possible elements and put them into variables. This is fraught with error!

At the end, we collect them into a variable followers which hopefully has all the elements we need.

Remember that we are looping through all the element particles in all the content models (except for xs:all and (a | b | c)* models) one by one.

The first variable repeating-cousins holds all the elements particles which belong to the same parent as the current element, but have anywhere between them and the parent element, some kind of repetition&,dash; it could be on the element itself or on a parent sequence or choice. Any of these elements can follow our candidate element, by partial order.

The second variable subsequent-cousins traverses up the document tree-of-nodes from our candidate element and finds every time there is a sequence element: all elements that are direct particles are selected.

The third variable subsequent-nephews is a more elaborate version of this: it selects all the descendants element particles of following groups in sequences.

<!--  Handle the normal case -->
                <xsl:otherwise>

                <xsl:variable name="repeating-cousins"
                    select="$parent-element//*
                        [@maxOccurs='unbounded' or @maxOccurs > 1]
                        [.//*=current() or .=current()]
                        /descendant-or-self::xs:element
                                [ancestor::xs:element[1] is $parent-element]" />

                <xsl:variable name="subsequent-cousins"
                        select="ancestor-or-self::*
                                [parent::xs:sequence]
                                [ancestor::xs:element[1] is $parent-element]
                                /following-sibling::xs:element "/>

                <xsl:variable name="subsequent-nephews"
                        select="ancestor-or-self::*
                                [parent::xs:sequence]
                                [ancestor::xs:element[1] is $parent-element]
                                /following-sibling::*//xs:element "/>                        

                <xsl:variable name="followers"
                        select="$repeating-cousins | $subsequent-cousins | $subsequent-nephews" />

Now we are set up to generate our rules.

Note: I suspect that people would expect the code to work by generating a transitive closure for the reachable following elements of each particle, finding the possible immediate following sibling elements, then finding their possible immediately following sibling elements, and so on repeated. But recursion in this situation seems to me to be prone to exploding (in time, if not in memory) based on some other recent work I was doing on XML schema re-factoring. However, the method above (if it is correct!) uses no recursion and may be better for that reason.

Rules for elements with followers

The odd use of concat() in the context attribute is just to cope with some element particles being locally declared and others being globally declared.

The code here is not difficult. There is a little xs:if section to customize the assertion text when there is only one possible follower.

<!--  Make a rule using the current context path -->
                <sch:rule context="{concat($path-to-parent, (@name | @ref))}">
                  <!--  select  all the elements that are under any choice or sequence group which allows
                        repetition and has the current element under it-->

                <xsl:if test=" $followers " >
                            <sch:assert>
                                <xsl:attribute name="test">
                                <xsl:text>not (following-sibling::*)  or (</xsl:text>
                                <xsl:for-each select=" $followers ">
                                    <xsl:choose>
                                        <xsl:when test="@name"
                                                >following-sibling::*[1][name() ='<xsl:value-of select="@name  " />']</xsl:when>
                                        <xsl:when test="@ref"
                                                >following-sibling::*[1][name() ='<xsl:value-of select="@ref  " />']</xsl:when>
                                    </xsl:choose> 

                                 <xsl:if test="position()!=last()"> or </xsl:if>
                                </xsl:for-each>
                                <xsl:text>)</xsl:text>
                                  </xsl:attribute>
                        When in a  "<xsl:value-of select=" $parent-element-name" />" element,
                        the element "<xsl:value-of select="concat(@ref, @name)" />" can only be followed
                                <xsl:if test="count( $followers ) != 1">(perhaps with other elements intervening)</xsl:if>
                                by the following elements:
                                <xsl:for-each select=" $followers">
                                        <xsl:value-of select="@name | @ref" />
                                        <xsl:if test="position()!=last()">, </xsl:if>
                                </xsl:for-each>
                        </sch:assert>
                </xsl:if>

Rules for elements with no followers

Finally, we handle the case of elements at the end of the content model. This is the case where there are no elements in the follower set.

<xsl:if test=" not( $followers ) ">
                        <sch:assert test="not(following-sibling::*)">
                        When in a "<xsl:value-of select=" $parent-element-name" />" element,
                        the element "<xsl:value-of select="concat(@ref, @name)"/>" should not be
                        followed by any other element.
                        </sch:assert>
                </xsl:if>

                </sch:rule>

          </xsl:otherwise>

          </xsl:choose>
        </xsl:for-each>
</xsl:template>

(Acute people might be wondering whether we need any rules to test that an element must start with a particular element. When the same element can appear more than once in a content model, that might indeed be useful. When it only appears once, and is required, the required element rules will report it. Where it is optional, if it goes in the wrong place these partial order rules should catch it. So I am not sure it is an important case at this grain, for us: we are not really doing making any effort to handle multiple particles, though certainly I expect most patterns we have used so far will cope with them. But it remains another issue to audit!)