Converting XML Schemas to Schematron (#1)

This article first appeared in a blog on O'Reilly on September 24, 2007. It was part of a series of articles.

In my blog Converting Content Models to Schematron   I outlined some code ideas. Recently we (Topologi) have been working on an actually implementation for a client: a series of XSLT 2 scripts that we want to release as open source in a few months time.  (xsd2sch)

Why would you want to convert XSD to Schematron?

The prime reason is to get better diagnostics: grammar-based diagnostics basically don’t work, the last two decades of SGML/XML DTD/XSD experiences makes plain. People find them difficult to interpret and they give the response in terms of the grammar not the information domain. And error messages are reported in terms of where the error was detected, not where the error was. For example, given a content model (a, (b, c)?, c, d ) and a document <a/><c/><c/><d/> you will get an error “Expected a d” at the location of the second c element; however the problem really is that the b is missing.

Schematron converted from a grammar still does not have much info to go on. Of course, the Schematron scripts should be easier to customize for tailored assertions and diaganostics. But also the phase mechanism is very useful: we can implement multiple different ways of checking the grammar and let the user decide on which one provides the best information.

A secondary reason is that Schematron only needs an XSLT implementation. There is still quite a suspicion that XML Schema implemantations are partial or broken. Japan Industrial Standards’ comment on Open XML were that they could not in fact even get the Schemas to run under Xerces and another major implementation. XSLT is much more common. However, we have decided to use XSLT2, and SAXON in particular, because it offers us some short cuts.

Possibilities

One shortcut that is quite fun is this possibility (I am not sure whether we will implement this method this round, it is outside our initial brief): by converting the children element names of an element into a string, such as “H1 p div div div table ht p” for example, and the converting a grammar such as ( (H1 | H2 | H3 | P | div | table )* into a regular expression equivalent, we can actually use the built-in regex recogniser of the XPath2 functions to validate the document. Just using a vanilla CSLT2. And this even copes with the minOccurs/maxOccurs cardinality contstraints, too.

This is rather exciting as these things go because it means that we can have a fallback validator that completely covers all the constraints of a grammar system, without leaving Schematron or the world of assertions. The downside? If implemented in a simple way, you only get the same kinds of diagnostics as a conventionally implemented XSD system will give you. But the advantage of having a complete Plan B means that we can concentrate on useful messages for the Plan A.

I’ll blog on how we implemented it over the next few weeks. Basically, we have a two-stage architecture: the first stage (3 XSLTs) takes all the XSD schema files and does a big series of macro processes on them, to make a single document that contains all the top-level schemas for each namespace, with all references resolved by substitution (except for simple types which we keep). This single big file gets rid off almost all the complications of XSD, which in terms makes it much simpler to then generate the Schematron assertions.

We have so far made the preprocessor, implemented simple type checking (including derivation by restriction) and the basic exception content models (empty, ALL, mixed content), with content models under way at the moment. I think the pre-processor stage might be useful for other projects involving XML Schemas.

An unexpected difficulty

Actually, the difficulty has been in an unexpected direction. XML Schemas is so unpleasant to work with, that one programmer asked to be take off the project because it was simply too much to cope with, and another has left the company (to take up an overseas appointment) but not before also getting frustrated, boggled and bogged down by XSD! Things like complex type with simple content derived by extension from a simple type with simple content etc become a maze or ratnest. (Hopefully we have that under control and we’ll be able to attend to our backlog of other work ASAP: we have been pretty poor.)

It is interesting that in all the last almost eight years of Schematron, I don’t recall anyone complaining it was too difficult. Instead, I regularly get surprised to hear of quite important projects where it has been quietly used without fuss or drama, and just chugs away doing its thing, with everyone involved feeling (and being) in control. This week for example I heard about UK taxation office’s use of Schematron for checking incoming documents being lodged. I think some of the reason for the success might be that because Schematron is small, it can be kept under control and understood, and that because there is zero support from the large software players, it is never used as part of an attempt to up-sell big hardware or message busses or protocols or enterprise systems etc.: it gets used for POX (Plain Old XML) sites.