Converting XML Schemas to Schematron: (#4) Validating the XSD built-in simple types

This article first appeared on a blog at O'Reilly on October 26, 2007.

Because we are using XSLT2 as our query language for the generated Schematron schema, validating built-in simple types from XSD is almost trivial. If you want to validate that, say, an element is a valid boolean, then we can use the test . castable as xs:boolean

What we get is an abstract rule declaration for each built-in type.

<sch:rule abstract="true" id="xsd-datatype-boolean">
  <sch:assert test=". castable as xs:boolean">
    <sch:name/> elements or attributes should have an xs:boolean type value.
  </sch:assert>
</sch:rule>

which can then be used by an element like this: say the element imametadataman is boolean:

<sch:rule context="imametadataman">
     <sch:rule extends="xsd-boolean-datatype"/>
   </sch:rule>

(What the optimization in the previous entry in this series does is merge types, so that instead of multiple rules we can just have multiple combined. )

<sch:rule context="imametadataman | ockadocknocka ">
     <sch:rule extends="xsd-boolean-datatype"/>
   </sch:rule>

So hurray for XSLT2 and XPath2!

Not so fast Boy Wonder

There is a rub, however. XSLT2 defines a basic conformance level (which is what the free SAXON XSLT2 transformer uses) that uses the basic XPath2 features. However, the XPath2 working group apparantly went mad with their desire to make life simple for implementers, and decided the basic level of XPath2 would understand (for castable) the built-in primitive type of XSLT but not the built-in derived types. Err, well except for integer. So then of course, because it is silly and confusing for them to be missing, the diligent implementer like Michael Kay of SAXON has to add support as a custom extension: no-one’s life is made simpler.

So in order to use castable without the gratuitous ommissions, it means that SAXON has to be invoked with a special attribute, which in turn has meant I have had to alter the Schematron skeleton to generate that code. (I’ll release it in the next few days.) I hope the XPath2 committee realizes that the more distinctions they make, the more complex their technology and the more difficult for us punters. I am less than impressed. Boo for XSLT2 and XPath2!

Peek at the code

Anyway, here is the basic code, which is part of the larger converter script. This is the most straightforward part of the whole project! Hurray for XSD Datatypes! First we have a list of all the type names, so we can refer to them later. Move constants to headers!

<xsl:stylesheet version="2.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        xmlns:sch="http://purl.oclc.org/dsdl/schematron"
>

<xsl:output method="xml" encoding="UTF-8" indent="yes" omit-xml-declaration="no"/>

<!-- supported by Basic XSLT 2.0 processor and XPath 2.0 -->
<xsl:variable name="standard-datatypes">
        <datatype>anyAtomicType</datatype>
        <datatype>anyURI<</datatype>
        <datatype>anySimpleType</datatype>
        <datatype>anyType</datatype>
        <datatype>base64Binary</datatype>
        <datatype>boolean</datatype>
        <datatype>date</datatype>
        <datatype>dateTime</datatype>
        <datatype>dayTimeDuration</datatype>
        <datatype>decimal</datatype>
        <datatype>double</datatype>
        <datatype>duration</datatype>
        <datatype>gDay</datatype>
        <datatype>gMonth</datatype>
        <datatype>gMonthDay</datatype>
        <datatype>gYear</datatype>
        <datatype>gYearMonth</datatype>
        <datatype>hexBinary</datatype>
        <datatype>integer</datatype>
        <datatype>QName</datatype>
        <datatype>string</datatype>
        <datatype>time</datatype>
        <datatype>untyped</datatype>
        <datatype>untypedAtomic</datatype>
        <datatype>yearMonthDuration</datatype>
</xsl:variable>

<!-- not supported by Basic XSLT 2.0 processor -->
<xsl:variable name="extended-datatypes">
        <datatype>byte
        <datatype>ENTITIES
        <datatype>ENTITY
        <datatype>float
        <datatype>ID
        <datatype>IDREF
        <datatype>IDREFS
        <datatype>int
        <datatype>language
        <datatype>long</datatype>
        <datatype>Name</datatype>
        <datatype>NCName</datatype>
        <datatype>negativeInteger</datatype>
        <datatype>NMTOKEN</datatype>
        <datatype>NMTOKENS</datatype>
        <datatype>nonNegativeInteger</datatype>
        <datatype>nonPositiveInteger</datatype>
        <datatype>normalizedString</datatype>
        <datatype>NOTATION</datatype>
        <datatype>positiveInteger</datatype>
        <datatype>short</datatype>
        <datatype>token</datatype>
        <datatype>unsignedByte</datatype>
        <datatype>unsignedInt</datatype>
        <datatype>unsignedLong</datatype>
        <datatype>unsignedShort</datatype>
</xsl:variable>

...

Now generate a set of abstract rules for each of these types. The unrestricted string type never needs validation, so its assertion test is always true(). We also generate a custom diagnostics element for each abstract type too.

<xsl:for-each select="$standard-datatypes/datatype">
                <xsl:variable name="dataType" select="."/>
                <sch:rule abstract="true" id="{concat('xsd-datatype-', $dataType)}">
                        <sch:let name="norm" value="normalize-space(.)"/>
                        <!-- Facet: check if it is a float type -->
                        <xsl:choose>
                                <xsl:when test=" $dataType = 'string' ">
                        <!--  strings don't need checking -->
                        <sch:assert test="true()"
                                diagnostics="{concat($dataType, '-diagnostic')}">
                                <sch:name/> elements or attributes should have a </xsl:text>
                                <<xsl:value-of select="$dataType"/><xsl:text> type value.</xsl:text>
                        </sch:assert>
                                </xsl:when>
                                <xsl:otherwise>
                        <sch:assert test="{concat('$norm castable as xs:', $dataType)}"
                                diagnostics="{concat($dataType, '-diagnostic')}">
                                <sch:name/><xsl:text> elements or attributes should have a </xsl:text>
                                <xsl:value-of select="$dataType"/><xsl:text> type value.</xsl:text>
                        </sch:assert>
                        </xsl:otherwise>
                        </xsl:choose>
                </sch:rule>
        </xsl:for-each>

And here is the code for generating a diagnostic element. The intent is that a user can tailor these if needed. (In Schematron we make a distinction between the assertion text, which is a positive statement of what should be true in the document, and the diagnostic, which contains specific messages for describing, locating and correcting the problem.)

<!-- generate disgnostics for standard datatypes check -->
<xsl:template name="generate-standard-datatypes-diagnostics">
        <xsl:for-each select="$standard-datatypes/datatype">
                <xsl:variable name="dataType" select="."/>
                <sch:diagnostic id="{concat($dataType, '-diagnostic')}">
                        <xsl:text> "</xsl:text><sch:value-of select="."/>
                        <xsl:text>" is not a value allowed for xs:</xsl:text>
                        <xsl:value-of select="$dataType"/><xsl:text> datatypes.</xsl:text>
                </sch:diagnostic>
        </xsl:for-each>
</xsl:template>

And finally we use it when we find an element with a built-in simple type:

<sch:extends rule="{concat('xsd-datatype-', substring-after($baseon, ':'))}"/>

where $baseon is the prefixed built-in simple type name.

On top of this, of course, there is great scope for adding much better diagnostics for problems the datatypes. But not yet.