Jürgen Rennau’s Location Trees welcomed

Posted on November 7, 2017 by Rick Jelliffe

The Proceedings of XML London 2017 don’t seem to be on their front page, but you can find them here (PDF).  One paper that has really caught my eye is Hans-Jürgen Rennau’s Location trees enable XSD based tool.It seems to provide a great missing step in  making XSD (W3C XML Schemas) integratable using XML-based tool-chains.  I was on the committee that designed XSD and the persisting problem with XSD (even though XSD 1.1 was subsequently much improved on XSD 1.0) is that XSD closes the door on XML processing as much as it opens the door to processing by other systems.There are two problems.

  • First, there is no standard format for showing a fully-resolved and denormalized version of an XSD schema: all the tricks and gotchas of assembling a schema from the modules and namespaces into a single document where everything you need to know about any element is in one place only.
    Without this, the schemas are dead-ends for XML processing: when processing a document with, say, XSLT, you cannot refer to a document with the plan for the documents, without a lot of complexity.
  • Second, there is no standard format (actual or de facto) for showing an XML document’s Post Schema Validation Infoset (PSVI) in XML: again it means that processing the document is a dead-end for basic XML tools.
    This situation is made a little better by the advent of schema-aware languages, such as XSLT3. But that means that instead of querying element names and attribute values, you need to have special API calls, and you are captured by a non-minimal toolchain.

I have experienced this problem myself: more than a decade ago, JSTOR put in exploratory funding for my company to develop an XSLT2-based XSD validator, where we converted XSD to Schematron. We only needed to support the subset that JSTOR used, and the approach worked but the project was terribly unpleasant for the developers.  The code is up on the Github Schematron site, but it is semi-abandonware.

Now, the architecture is two passes: first create an quite-fully resolved and denormalized schema document: a single file with the whole schema for all namespace in it, and complex/simple type definitions inlined instead of their references; second, run off Schematron equivalents.  The second pass was easy, the first was horrible.

As a side note: I was interested to see that the XSD 1.1 specification actually itself provides an XSLT transformation showing how to perform inclusions properly.  Well done, Michael or whoever was responsible for that. Perhaps the XSD to Schematron converter showed that XSLT is useful in some situations, or perhaps that running code

But this first pass, in essense is what Jürgen Rennau’s Location Trees does. If I were doing the XSD to Schematron project today, I would certainly use his code as the first step.

But wait there’s more! He goes on to provide some things that Location Trees might be useful for:

  • Schema queries, such as “Does the schema allow element X?”
  • Treesheets, which are pretty-printed summaries for convenient reading
  • Fact trees, which is the Location Tree decorated with information gleaned from a particular document: such as the number of occurrences of a particular element in that context.
  • Metadata trees, which is the locator tree stripped of schema-ish info but decorated with extra processing or metadata attributes. The example he looks at is transformation info, to say “when you find this element in this context, output it” (using this name, for example): so you start with the Locator Tree as a template for a kind of pull-based stylesheet.

I am really excited by the queries and the fact trees because, bang here is an implementation of some thing I have been wanting (and sketching) for years. I have mentioned that it would make the XSD to Schematron transformation more tractable. And it would obviously help context-sensitive editors: you can query if the context in the actual document was allowed by the schema, or ask for suggestions about what was next: that would have been great to have when developing the Topologi Markup Editor, where we left that part out thinking that surely some library would come along.

But the Fact Tree interests me most.   Having such a document available allows several kinds of optimizations that I have been sketching out over the years, wishing I had the energy or time or reason to implement them:

  • for DOM  builders (know exactly how big your arrays need to be to hold your document rather than using the heap, and get XPath position information for free from the index of the node in the array)
  • for Schematron you could first check each rule and assertion to check whether they will always fail: lets say you have a <rule context="book"><assert test="title">A book should have at least one title</assert></rule>. if your Fact Tree says there are no book/title elements, you could replace the assert/@test with false() and fail every time. If the Fact Tree said there were no books too, you could just remove or disable that whole rule and never evaluae it. More generally, you partially evaluate XPaths in assertions or variables using information in the Fact Sheet. Is it worthwhile? It depends on how fast the fact-sheet runner is: but for queries that may have exponential performance (such as tests with absolute descendant tests “//X” that go through the whole document), it might be useful.
  • Jürgen Rennau does not mention it, but the Location Tree or the Fact Tree can be generated directly from an instance, without a schema. In effect, it forms a kind of trie data structure for the document’s markup

Which brings us to the conclusion that XML performance is weaker than it could be because it is so well layered: we cannot pre-index documents or find information about them on one day that can help faster processing of the information another day.  I guess that is the business justification for XML databases, but it seems a bit sub-optimal to me to have either no indexabilty/optimizability with the file-based tools and too much for the database tools. Fact Sheets (or something like them) would be a way to have a middle ground.

But, more immediately, Location Trees provide a real missing link for XML document processing.

Now it all comes crashing down. The old problem we had with the XSD to Schematron converter was that it was implemented in XSLT2.  The functional style was not friendly to developers, nor is XSLT2 a standard part of any OS or platform.  So it was not a project I would expect to get a community, and the code could not readily be run or turned into a standalone library.

Locator Trees have a similar problem. The code is availailable from a GitHub project xsdplus, but it is written in another niche language, XQuery.  Using the niche language is great for prototyping and getting features for free (my Schematron implementation benefited from this) but it makes integration and distribution harder: the more that we get used to App Stores and Package Managers, the less this will be viable. (This is currently an issue for Schematron too.)