Using Schematron to Test Transformations: Lesson Learned

Posted on April 15, 2017 by Rick Jelliffe

This decade, I been twice involved in quite intense year-long projects to do exhaustive acceptance testing of XML transformations using Schematron. What lessons did I learn?

Projects

So the context here is the need for error-free transformations of large numbers of high-value complex documents, where the number of documents and the dollar values of the project are in the multiple millions.

One, at the Australian Prudential Regulation Authority (APRA), we had up- and down-translations to available for a system that converted between financial XBRL documents and a pre-existing inhouse format suited to the database: so the testing was round-trip testing (A-B-A) to compare a re-generated XBRL document to the original.
The other was at a major publisher to test conversions of documents (laws, cases, commentary, etc) from the previous generation of schemas to the new generation that their ambitious new platform needed. These were more like acceptance tests: there was an whiff of pioneering that the offshore conversion team had. There was not a one-to-one correspondence with the original and new schemas: one old document might be split into several new documents of different schemas. So this tested a family of related schemas (and related transformations) not just a single one. Ultimately, we made a single generic Schematron schema (slightly parameterizable) for assertions that would be true for every transformation in the family, and a smaller Schematron schema for each specific transformation.

In both cases, the tests can be summarized as figuring out what is invariant in any pairs of input/output documents: the basic thing is to count corresponding elements, and to compare corresponding data values. For the XBRL documents, tests had to be order neutral, while for the legal documents, order was critical.

In both cases, the variation possible in document was enormous.

In the APRA case, it was possible to generate a comprehensive set of test documents with some effort and use those. Importantly, the set was generated by finding every possiblity afforded by the schemas, not by some lesser abstraction like what simple XPaths could be found in the documents.
In the publisher case, we decided to take the approach of testing the conversions of the entire corpus, millions of documents, because of the high risk of “edge cases” that had emerged in early testing. After the first six months of testing when many issues were fixed, we moved to a quasi-statistical basis (assuming normal distribution of errors, and wanting 99% error detection at 95% confidence, we would use a random set of 10,000 documents. Towards the end of the project, as most issues had been found, we moved to even fewer samples. We would be presented with the output documents from the latest run of conversions every few months by the offshore conversion team.

Lessons

So both projects were quite complex and successful. For me, it was interesting to try out various techniques that took Schematron out from the realm of single document validation.

Lesson: Schematron is good for this kind of testing. But the better you know XPath the more sophisticated your tests can be.
Blind Freddy. The initial validation of a document transformation does not need to be complete or thorough to be useful. In fact, jumping in to do large-scale testing may just delay prompt resolution of dumb mistakes and swamp you with errors. Lesson: don’t run validation until you first check a couple of files by eye, to check that there is not some silly human mistake. When you run validation, just run the smallest number you can first, and get those initial in-your-face errors actioned fast. Then switch to the automated tests. [velocity]

Errors cascade. So a single error can, if you are not careful, cause multiple results in the SVRL report. Lesson: Have some way of validating broadly first: for example you could have a simpified schema, or you could ignore some assertions, or you could have a Schematron phase to group the patterns, or you could filter out results from the SVRL output. One useful approach is to have a guard variable, which I will show below [velocity]
Sample size. In order to get good results, you probably need to have much bigger sample sizes than you expect. A sample size of a dozen or a few dozen may be good for the kind of Blind Freddy testing mentioned above, but is useless for a high level of quality assurance. (Unless your documents have highly uniform with only trivial variation.) Be very careful about any non-random method of generating test sets, because the brilliant criteria you use to remove unnecessary files may in fact embody the problem that needs to be tested. Lesson: Try to use statistics to guide how many documents to sample, given your quality criteria. If you don’t have the expertise to estimate, then err on the side of a large number, such as 10,000 documents randomly selected. [estimation]
Plumbing. A deal of work is involved in setting up the framework. We used XProc for one of the cases, and I don’t know that it made life less complicated, but in a Java environment it certainly reduced load times. Lesson: Estimate enough time for implementing the running and reporting framework. [estimation]
Primary document. The main technical decision is whether to either validate the input document as-is, and pull in the output document into a variable by passing the file name as a command-line parameter, or to make a fake root element and copy the input and output documents under that. This makes a definite difference in Schematron: in ISO Schematron you can only have rules matching patterns in a single document. (In ISO Schematron 2016, you can run patterns on external documents too.) So in the first method, you can only make assertions like “when you find this context in the original, you should be able to extract some count and that count should be also found in the output document”, while in the second you can validate forward and back. Lesson: Putting the input and the output documents under a single dummy element provides the most flexibility.
<sch:rule context ="/dummy/input/../chapter"> <sch:assert test="count(p) = count(/dummy/output/.../chap/para[@id=current()/@id]/para)"> The output should have the should be the same number of paragraphs in each chapter as the input has. </sch:assert> </sch:rule>
[effectiveness]
Reporting. If you are validating tens of thousands of documents against hundreds of assertions, you need to distill the results. I found spreadsheets, with a row for each document and a column for each assertion, with a count for the number of failures per document, was most useful. This could be further distilled into totals per assertion. This can then feed into your reporting and prioritization: if you have a large corpus of documents and you find a handful or fewer document fails an assertion, check whether there is some alternative markup that can be used to avoid triggering the flaw: sometimes this is a more pragmatic approach than getting the bug fix described and implemented. If there is some incredibly common structural error, is it a sign that the transformation is doing something sensible but the schemas are not sensible? Lesson: Use spreadsheets to aggregate the SVRL into counts and numbers that can inform management on the big picture: details are great but the big picture is vital. [value]
Severity. Each assertion should be given (for example using the assert/@role attribute) a severity level. For example, you might decide that dropped or out-of-order text is a SEV1 error, while text that has the bolding tags omitted is a SEV3. This provides a rational way of prioritizing errors, and is critical for acceptance testing, to know what is beyond the pale. And, going the other way, try to develop assertions for the high-severity/high-risk/high-impact errors first. Lesson: Relate assertions to business value, and report in terms of that value. [value]
Sharing. For the publishing conversion, the offshore conversion team asked to be given our tests, to help their development. Initially, I was not keen: wouldn’t it mean the developers would just try to satisfy our acceptance tests rather than their conversion instructions? And how would we coordinate bug fixes in the Schematron schemas, given that a certain proportion of failed assertions were always because of bad test XPaths, and given that we added more tests as we found problems. My product was the validation results, not the validators. Would they have the sophistication to maintain the very complex code (we had XPaths with 40 predicate expressions in some cases, easily four times more complex XPaths as I have ever seen in the wild before)? In the end, I think I relented and sent the generic Schematron schema. Lesson: The closer the tests are to the source of problem, and the more that they can be automated at that point, the better. [feedback]
Human effort. When you have a large number of documents to test, and a large number of assertion failures, you still need human eyes to triage the results. And that puts a resource limit on the number of errors that can be inspected. In particular to spot assertions that have failed because of an error in the assertion XPaths. Lesson: Machines are good for things that humans are bad at, and vice versa. [estimation]
Masking errors. And often you may find that an assertion fails because of more than one flaw. So adding in more specific assertions, with guards as described below, to distinguish cases can be very useful. When you find and report a failed assertion in terms of a specific flaw you find in triage, refactor the schema to have an assertion to cover that specific flaw only, and a guard so that the broader assertion is still otherwise active. Lesson: Don’t assume that only one flaw triggered all the failures of an assertion in a corpus that is too large to check every failure in. [effectiveness]
Focus. For the publishing, we tried to test everything we could for order and text dropping, but we only added tests for attributes that were particularly valuable or risky: id attributes in particular. This was a pragmatic choice. When testing found a problem in passing, we would add a test for that flaw, as part of the feedback loop. Lesson: Respond to risk, and don’t lose your investment in test creation. [value]
Weakness. Sometimes, you need a bit of imagination. Will you have to make do with a weaker test that captures most but not all of a particular assertion test? Will you need to do something difficult to really get the test you need? An example, was that since we decided that test order was a SEV1 error, we wanted to test it better: the method we used was for each paragraph in the input document, we searched for a paragraph in the output document that started with the same 10 characters and which had the same immediately preceding non-space text. For example, if there was some input document pattern government.</p></section><section><title>Responsible ... then we would look in the output for document for
title[starts-with('Responsible')] [ends-with(preceding-sibling::text()[not(normalize-space(.)) = " "]), 'government')]
This assertion does not actually test if the corresponding element is in the correct position, since there could be more than one occurrence of the input and output pattern: instead it says is there at least one of these expected outputs, and that is good enough. In most cases, the cost of the acceptance testing need to be cheaper than the cost of the development.
Lesson: Use a good weak test if a strong test is impractical [effectiveness]

Guard Variable

A “guard” is a negative predicate or term in an XPath that filters out unwanted cases.

Schematron requires fewer guard constructs than many other languages. This is because in a pattern, the lexically first rule whose context matches a node is the one that fires. A pattern is a big case statement of if/then/else if/then chain. A more specific rule can come before a more general rule. A nice side-effect of this is that you can have a final catch-all rule with just a context of “*” and it will catch all elements that have not been caught by a previous rule.

But sometimes guards are still necessary, for example when there are document dialects or classes of documents. We make a variable so that the guard has a name, in this case that we are interested in documents about books only:

<sch:pattern> <sch:let name="is-book" value="/*/@type='book' and /*/@isbn"/> <sch:rule context="paragraph[ $is-book ]"> ...

So how to use a guard variable to reduce duplicate the SVRL outputs?

The scenario is that you want to chain assertions. So you put each assertion test into a variable, and then you can those variables as guards in subsequent assertions.

<sch:rule context ="/dummy/input/../chapter"> <sch:let name="same-para-count" select="count(p) = count(/dummy/output/.../chap/para[@id=current()/@id]/para)"/> <sch:let name="same-child-count" select="count(*) = count(/dummy/output/.../chap/para[@id=current()/@id]/*)"/>

<sch:assert test="$same-para-count">
The output should have the should be the same number of paragraphs in each chapter as the input has.
</sch:assert>
<sch:assert test="not($same-para-count) and $same-child-count">
The output should have the should be the same number of child elements in each chapter as the input has.
</sch:assert>

</sch:rule>

So in the above we have two assertions, a more specific assertion that the number of paragraphs is the same in input and output branches, and a more general assertion about any children. But we don’t want the second assertion to be tested if the first assertion has failed. So we put in the guard as shown, so that the second assertion will only fail if the first assertion has succeeded. That avoids floods. This kind of guard is actually very useful for tables, where you might have assertions for tables, rows, cells, colspecs and so on, with lots of opportunity for multiple errors caused by the same underlying flaw.

(There has been a request to put in a command-line option for Schematron, so that if an assertion fails then subsequent assertions in the same rule will not be tested. It seems a reasonable thing to provide, being no different to filtering the SVRL and avoiding this kind of guard.)