Using XSLT as a validation language

This is the article that lead to Schematron; it was published in a US newletter too. The innovation was validation as transformation; it had a surprising amount of resistance from some SGML-ers.

Francis Norton followed it up in May with an article Generating XSL for Schema Validation suggesting that using an XSLT to convert parts of a conventional schema language into XSLT would be useful.

From that, the next step was "why do a poor job on a conventional language, why not make a schema language entirely to take advantage of XPaths and validation-generation?" and Schematron was born by the end of the year. - Rick Jelliffe

Using XSL as a Validation Language

Rick Jelliffe
Academia Sinica
Taipei, Taiwan

1999-01-24

Abstract

XSL can be used as a validation language. An XSL stylesheet can be used as a validation specification. Because XSL uses a tree-pattern-matching approach, it validates documents against fundamentally different criteria than the content model. This paper gives some examples.

XSL can be used on structured documents which do not use markup declarations. And XSL used in consort with XML markup declarations seems a very nice and straight forward approach: two small languages, each good at different things.

What is missing? The current XSL does not have some features which would be desirable (how to report the current line and entity, in particular) for a user-friendly system. Regular expression pattern matching on strings would be very useful. (The main thing missing from this note is a definite way to create the message "This file is valid"; validity is shown by an empty list of validity errors.)

Definitions

A validator is software which examines a structured document (e.g., an XML or HTML document, a WebCGM document) and reports on the conformance of that document's structures against some patterns.

A validation specification are these patterns expressed in some formal way, in particular for use by a program. In object-oriented software engineering terms (refer B. Meyer), a validation specification give the pre- and post-conditions we want to assert about a structured document's structures; it is useful to make such assertions, because it clarifies a programmers tasks and the capabilities and nature of the data. It also can have a valuable role in contractual conformance. In markup terms (refer TEI), a validation specification (such as a DTD) gives a theory about a document's structure.

A validator can be specified with a general purpose language, or a specific validation language. A validation language therefore embodies a theory about which kinds of patterns are common, useful, important, interesting, expected by users, easy to implement, or which have patterns that can not be validated readily by other validators or validation languages. Theories about which patterns are common, useful, etc. are in turn judgements based on particular technologies and usage domains.

Just as with programming languages, the syntax and operation of a validation languages are contraversial. So a validation language also embodies a theory about which syntactic and paradigmatic features are common, useful, important, interesting, expected by users, easy to implement, or which are not available in other validation languages.

A schema is a collection of rules about a document's structures. A schema definition language is not a validation language, but may contain a validation language. A schema definition language may also allow any of the following:

information about data storage, encoding, transmission and notation;
human readable documentation;
information to allow the automatic construction of input front-ends;
information about the meaning of elements, and various linkages to other schemas.

An important distinction between a schema language and a validation language is that a schema language will specify, for example, "this element is a date", while a validation language will concentrate on more lexical/structural issues: "this element should conform to the regular expression /nnnn-nn-nn/".

Examples of validation languages are:

W3C XML markup declarations;
ISO SGML markup declarations, which are a superset of XML markup declarations;
ISO Architectural Forms, which allow a document to be validated against multiple parallel content models, keyed not only against element type names, but also against attribute values;
ISO Lexical Type Definitions, which allow element or attribute values to be validated against a POSIX regular expression;
DDML (formerly XSchema), a subset of the XML markup declarations expressed using XML instance syntax;

Limitations of Markup Declarations

The XML markup declarations (in particular, the content models) have many desirable properties as a validation language:

terse;
declarative;
simple, and modest in its aims;
fragment-friendly, since the interpretation of content models does not depend on the document context;
familiar, since their operation is familiar to people exposed to BNF or formal grammers;
standard, through the ISO heritage;
widely implemented;
understood--the nature and deficiencies of content models have been well explored for more than a decade on many projects.

However, there are situations which the markup declarations do not address, and some other system would be useful:

the markup declarations are not available as structured documents in their own right (in the absense of nodes in DOM to do this);
this in turn prevents hypertext linking, structured annotations, and extending the validation language to become a full schema definition language;
various kinds of partial validation, where only targetted structures are checked;
extended validation, where more than the immediate context is checked--for example to check that

if a certain attribute is specified with a particular value, some other attribute has also been specified; or
that if a certain element type should not be used if its parent's parent is some other element type (e.g., to exclude an RDF:RDF element from any subelement of an RDF:RDF element).

The XML markup declarations are part of XML. In my view, there is scope for the development of a validation language which complements XML markup declarations rather than reinventing them. (No disrespect, criticism or lack of enthusiasm for any schema definition language or validation language is intended by this comment.)

XSL Match Patterns

Such a language already exists: XSL. XSL match-patterns represent a very different view of a document's structure than XML content models. XSL match-patterns therefore can be used to complement and enhance XML content models, as well as any other content-model-based validation language.

Doing this enables us to see validation as merely another kind of document transformation. In this case, the input document is transformed into a document which marks up structures in the original which are not valid.

(Note, a kind of validation can also be provided by treating validation as a kind of formatting: for example, a CSS stylesheet could be provided which highlights in red any element which is not valid. The CSS pattern-matching rules may be complex enough to create a useful validator based on this idea in some circumstances.)

This use of a transformation language for validation is hardly novel. Indeed, one reason why SGML system constructed on top of transformation languages (e.g. OmniMark, Perl) have a good rate of success is that system developers can (and do) build extended validation systems readily. Such validators help the programmers discover structural patterns: useful or pathological. They can also allow looser and simpler content models in the markup declarations, resulting in better layering of validation.

The advantage of using XSL as a validation language are

terse--the match patterns are very terse, like XML content models;
declarative;
simple, and modest in its aims;
fragment-friendly, since the interpretation of content models does not depend on the document context;
familiar, since their operation will be familiar to people using XSL for transformation or formatting purposes;
widely implemented--James Clark and IBM already have XSL tools available;
understood--the nature and deficiencies of tree-based patterns have been well explored for more than a decade on many projects in languages such as OmniMark.

Template for the Validator

Following is a stub which can be used to construct a validator.

<?xml version="1.0"?>
<!-- Template for XSL Validator -->
<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/TR/WD-xsl" 
    xmlns="http://www.w3.org/TR/REC-html40" 
    result-ns=""
    xmlns:rdf="http://w3.org/TR/1999/PR-rdf-syntax-19990105#"
><!-- add any other namespace declarations above -->
  <!-- Root template - start processing here -->
  <xsl:template match="/">
    <HTML>
      <HEAD>
        <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
        <META http-equiv="Expires" content="0"/>
        <TITLE>Results of Validation (using XSL)</TITLE>
      </HEAD>
      <BODY>
        <H1>Results of Validation (using XSL)</H1>
        <UL>
          <xsl:apply-templates/>    
        </UL>
      </BODY>
    </HTML>
  </xsl:template>                        
  
  <xsl:macro name="element_warning_message" >
    The invalid element is found at tree location <xsl:number  level="multi" count="*" format="1." />
    <xsl:if test='.[@ID]'>
    The element's ID is <xsl:value-of select="@ID" />.
    </xsl:if>
    <xsl:if test='..[@ID]'> The element's parent's ID is <xsl:value-of select="../@ID" />.
    </xsl:if>
  </xsl:macro > 
  
    <xsl:macro name="attribute_warning_message" >
    The element with the invalid attribute is found at tree location <xsl:number  level="multi" count="*" format="1." />
    <xsl:if test='.[@ID]'>
    The element's ID is <xsl:value-of select="@ID" />.
    </xsl:if>
    <xsl:if test='..[@ID]'> The element's parent's ID is <xsl:value-of select="../@ID" />.
    </xsl:if>
  </xsl:macro > 
  
 <!-- Good patterns. Put your instructions here. -->
                       
 <!-- Bad patterns. Put your instructions here. -->

  <!-- Do not change after here. This handles defaulting. -->  

  <xsl:template match="text()" priority="-1">
    <!-- strip characters -->
  </xsl:template>

</xsl:stylesheet>

Accept good patterns using the following template:

<xsl:template match="pattern" priority="2" >
    <xsl:apply-templates/> 
   </xsl:template>

Validate against bad patterns using the following template:

<xsl:template match="pattern">
    <LI>
        <!--put message here-->
        <xsl:invoke macro="node_warning_message" />
    </LI>
    <xsl:apply-templates/> 
   </xsl:template>

You can use these in two ways.

The positive way is to make "good patterns" which cover every context in which your element type (if that is what you are validating) is allowed to appear. Then you put a simple case which catches simple occurrances of the element as the "bad pattern".

The negative way is to make "bad patterns" which find element types in contexts you specifically want to deem invalid. The "good pattern" can contain any excepts to this. You can use the "good patterns" to create a stop list of specific cases which break a more general rule about "bad patterns". Use the priority attribute to show that the "good patterns" should be tested before the "bad patterns".

Examples

These examples were developed with the LotusXSL beta. There may be slightly different syntaxes required for the other XSL betas (i.e., James Clarks' and Microsoft's). The examples each validate something which an XML markup declation cannot directly specify.

1: Unwanted Element

This example imposes additional requirements compared to the HTML DTD. It acts a little like an SGML global exclusion, in that the content model of the markup declarations may allow the blink element, but this validation layer exposes the invalidity.

<!-- Put this in the "bad patterns" section in the template -->
   <xsl:template match="BLINK">
    <LI>        
     Element "BLINK" has been used. This is against our house style.
    <xsl:invoke macro="element_warning_message" />
    </LI>
    <xsl:apply-templates/> 
   </xsl:template>

If a BLINK is found, a warning is generated. The location in the tree is given. The ID attribute of the element (if any exists) is given.

2: Element Context

This example checks that an rdf:RDF element never appears as a descendent of another rdf:RDF element.

<!-- Put this in the "bad patterns" section in the template -->
   <xsl:template match="rdf:RDF[ancestor(rdf:RDF)]">
    <LI>        
    The element "rdf:RDF" has been found inside another element "rdf:RDF".
    <xsl:invoke macro="element_warning_message" />
    </LI>
    <xsl:apply-templates/> 
   </xsl:template>

3: Attribute Context

This example checks that an "other-unit" attribute can only be specified if the value of the "unit" attribute is "other".

<!-- Put this in the "Bad patterns" section of your template -->
<xsl:template match='fig[(@unit="other") and (@other-unit="")]' priority="2" >
                <LI>
                The element "fig" has attribute "unit" specified as "other".
                But the attribute "other-unit" has a zero length.
                <xsl:invoke macro="attribute_warning_message" />
                </LI>
    <xsl:apply-templates />
</xsl:template>

<xsl:template match='fig[(@unit="other") and (not(@other-unit))]'>
                <LI>
                The element "fig" has attribute "unit" specified as "other".
                But the attribute "other-unit" has not been specified.
                <xsl:invoke macro="attribute_warning_message" />
                </LI>
                <xsl:apply-templates/>              
</xsl:template>

Checking attributes requires answering two questions. First, has the attribute specified in the document? Second, even if it is specified, does it have a zero-length value?