“Schematron is inefficient”? plus a challenge!

Posted on January 8, 2017 by Rick Jelliffe

I think the only real complaint I hear sometimes about Schematron is that that some users find it inefficient.

I am completely cool with this, in general: Schematron is a general-purpose tool/technology designed to be trivially constructed from COTS components, and the modern world’s answer to inefficiency is usually to add more grunt, such as cloud solutions. We are CPU and storage rich nowadays. But many organizations have budget for software development rather than hardware, or have failed to implement their validation in a horizontally scale-able way. (Or they they have an optimization itch that needs to be scratched: I think that is a really good itch to have, by the way, as long as you know all the places to scratch and don’t get preoccupied with just one spot.)

So I am fine with the idea that Schematron implementations may sometime be less efficient than would be perfect, and that sometimes Schematron can be considered the first step in getting your business rules under control and expressed in a testable concrete format, rather than in toy languages that cannot be executed (Lord, spare us!); sometime having the business rules captured then makes the job of creating high-efficiency validators for them tractable.

Nevertheless, with hackles firmly unraised, when I look at implementations that claim inefficiency, I have each time found that the inefficiency was not intrinsic to Schematron at all. In one case, benchmarks against a proprietary product used perhaps the most inefficient implementation of Schemaron possible. This indicates that caution is warranted.

Lets be brutal: any claim “Schematron is inefficient” is bunk on the face of it: Schematron is an abstract technology not an application. A particular implementation of Schematron may indeed be inefficient compared to some customized program to do the same task. And it may be that some users have purchasing or SOE constraints that limit them to particular implementations that are inefficient. But that is a different matter to saying that Schematron is necesarily inefficient.  But too often I see a sleight-of-hand switch from attributing the performance of a particular implementation of Schematron to Schematron itself.

So the first question to ask when confronted with “Schematron is inefficient” is “What implementation and architecture are you using“? Are you using ISO Schematron, SVRL, and the normal skeleton engine? Are you using a horrible version of XSLT (such as Xerces) rather than Saxon or the .NET version? Are you using XSLT1 rather than XSLT2 (XSLT2 has a lot of thing that are complex but add power, such as regular expresions.) If you are using Java, please tell me you are not firing up the JVM for each invocation of each stage of the Schematron pipeline; please tell me you are not recompiling the schema each time and then passing off these platform costs as if they were intrinsic to Schematron: there are tools such as Nailgun to pre-start JVMs, and anyone implementing Schematron in service based system would do so to keep the JVM operating; please tell me you have not switched to JSON because it is supposedly more efficient but then you are attributing the cost of the JSON-to-XML conversion latency as belonging to Schematron (and XML), rather than manning up that JSON can bring certain inefficiencies too.  If you are using .NET, don’t forget that there is .NET version of Saxon, including free ones, that have quite good performance.

Second, have you forgotten how, before Schematron, you couldn’t actually get the validation working and distributable properly? If you make up your own home-made language, you have to rely on developers and their skill. A system that works is always more efficient than a system that doesn’t work!  I am pretty proud that Schematron is not snake-oil: it has proved itself time and time again; beware vendors of proprietary systems and NIH junkies who may end up finding that, in order to get something of equal power and generality, they end up with something more complicated and less efficient.  (Is that FUD?  I think I am just talking babies and bathwater.)

Third, is there some pathological or unique problem that causes the issue? Many systems do not cope well with very large XML documents. Once your documents can go over a hundred meg, you need to pay attention to memory (RAM), particularly with Java and Windows. If the virtual machine’s memory is getting full and it is forced to start to reallocate or into futile garbage collection or starts to swap out while attempting to garbage collect, that can indeed show up as inefficiency, but the problem is really platform congestion. One simple way to cope can be simply to see if your document can be preprocessed into sub-documents first, each of which get validated separately, and the results combined. This is not rocket science to implement, and can get rid of some issues.

Fourth, do your Schematron XPaths have some problem that is causing an explosion of path exploration? Factoring out common parts of XPaths into variables to reduce re-evaluation can be very critical, especially whenever you see the dreaded //; using xslt:key to allow indexed lookup;  using phases so that you only test the assertions of interest to that kind of document; are you reloading large tables from external XML files each time (consider putting in a web service that the Schematron script can call to ask for the data values for some key, or even direct ODBC/JDBC connection to a DBMS using the XSLT custom function capability)?  Indeed, have you factored your Schematron validation out into a simple web service to simplify the integration point?  If you need to fail fast in a validation, consider making use of  the terminate-on-assertion-fail option?

I guess these point can be summarized “Have you seriously bothered to optimize“?   If you cannot be bothered to optimize Schematron, why would you expect a redeveloped system in some other language would not also need optimization?

Now let me say again that of course I accept that for some schemas, it may indeed be possible to implement a faster system using some custom tools. Simple static constraints on small flat documents; a small number of business rules relative to the document size; there are several cases I easily expect might allow less validation latency.  For example, if the constraint was “The text ‘HOW SILLY’ should appear once in any document” could be tested using UNIX grep or Windows find utilities much faster (assuming no complication of XML entities). No problem. The aim of Schematron is to provide a good tool for interrogating our data in certain areas where there has been a complete gap; the aim of Schematron is not that everyone should use Schematron all the time!  But Schematron’s combination of executability plus natural language does provide a ‘virtuous circle’, that I don’t see in custom solutions. Indeed you could operate by having Schematron rules for all constraints, but moving out assertions that have some external implementation into a phase whose only purpose was to hold assertions that are externally implemented.

And finally, let me put out a friendly challenge:

Instead of spending your money on replacing your Schematron implementation with a non-Schematron solution, why not spend the same money to make an optimized version of Schematron more suited to your uses, and make it open source? Be a contributor: make the technical eco-system that reflects your organization’s druthers.  Write code that can benefit from a world-wide network effect rather than just being a drag on your internal budget.

Let me give an example of how this works: in the 1990s I had a project, just one day a week or less, from an Australian company that really tried to look at efficient markup systems in a smart way: it was to figure out how to improve SGML to cope with Australian and Asian requirements. As part of this, everytime I went to a conference or client I asked about their SGML systems. What I found was in so many cases, people were not using full SGML tools, but had some ratty home-made simplification of SGML.  Knowing this then fed into the efforts to reform by simplifying SGML, which ultimately produced XML (because so many other organizations had the same problem.)  XML has succeeded beyond reason (literally), and Allette Systems is still in business, and markup systems are easier to build for everyone now: participation in user-sponsored efforts to channel the technical eco-system in directions suitable for organizations in peripheral countries is not doomed to failure.

Schematron is used throughout the world, on mission critical projects for numerous government agencies including homeland security. And do you know how much money these organizations have spent on developing Schematron to improve its performance? Zero. (chuckle: I am not say “employ me”!  I am saying “be a winner: take charge of your destiny: contribute code or expertise or *anything* to the project”.)  Now many organizations have great education and evaluation efforts; I admire US NIST in particular, which have helped move Schematron forward.  I am not trying to be negative or condemn anyone or anything! But there is such scope for Schematron to be improved, if only organizations could manage to be less inward-thinking and more community-minded. For example, XSLT3 allows streaming XML; STAX does too: this kind of approach is much better for memory handling and can be quite efficient.  Create a  three year project to the SDLC of using XPath3 with Schematron.  Or some other Schematron-based component:  for example,  for integrating ISO Schematron efficiently into Maven, Jenkins, Selenium, JUnit or whatever.  (Here in Australian, this is particularly a problem: technical management see technology as something that the Gods pass down from America, which we mortals are incapable of influencing: the failure to hop aboard the Open Source train as full participants does nothing to ensure that the direction of major stream of technology is favourable to our companies.)