SHRVL: Schematron Hierarchical Report View Language

It is possible to take an SVRL document and convert it into a hierarchical XML document. This can give us another weapon in our toolbelt for feature extraction, understanding patterns of invalidity, and for detecting or taking a stab at implicit structures in documents without useful explicit markup.

The advent of ODF, OOXML and Epub/XHTML means it is not a big struggle to pull word-processed documents into our XML pipelines, and we have a fair idea of what the markup means. 

But this does nothing to simplify the remaining problem, that a document that has not been marked up structurally may have dozens of ways to represent the same thing.  And then the issue becomes one of how do we keep potentially hundreds of intertwined and overlapping considerations under control? How do we stop our code from becoming malignant spaghetti which we cannot reason about?

I am calling this approach SHRVL, as it shrinks an SVRL document.  I have looked at related aspect before, in posts on Feature Grammars and Probabilistic Schemas, Hidden Markov Models, Neural Nets for XML.  (I don't expect to put out any implementation of this, unless some job requires it, but it if is useful, please go ahead and roll your own.)

An SVRL document may be transformed into SHVRL document, or a Schematron implementation could generate it directly instead of the SVRL.

Here is the Schematron pattern that describes any SHRVL document:

<sch:pattern name="SHRVL">
  <sch:rule context="\*">
     <sch:assert test="self::shrvl">The top-level element is "shrvl"</sch:assert>
  </sch:rule>
  <sch:rule context="*">
     <sch:assert test="number(@pos)">
     Elements other the top element "shrvl" need a pos attribute.
     It is the position of this element under its parent, in the original document.
     </sch:assert>
     <sch:assert  test="count(@*) = count(@pos) + count(@found) + count(@missed)">
     The only attributes allowed are pos, found and missed. 
     @found are any svrl:successful-report/@role tokens for that position.
     @missed are any svrl:failed-assert/@role tokens for that position.
     </sch:assert>
</sch:pattern>

Example

Lets imagine we have run Schematron over some large HTML document that has no styles etc, just simple elements.

  • For simplicity, our "Schematron Schema for Title Detection"  only uses sch:report not sch:assert.
  • So as not to swamp the reader, I will give this Schematron schema later.

We have generated an SVRL output, and selected that the locations should provide the path-position-name form of XPath (see examples below.) 

Now lets fiddle with our SVRL document:

  • discard every element except svrl:successful-report elements;
  • remove attributes except  @role and @location;
  • sort them by their @location XPath into document order, if not already so;

Lets see three examples. This is actually loosely based on a real case, and the problem should be familiar to many people in text processing. All three examples are the same scenario: we want to find the title of the document, but the title we are interested in (saying which goverment legislation is being dealt with) has been marked up in many different ways, including not as a heading at all but with spurious preceding other markup.

This is the common, nightmare scenario: a  large, growing, undisciplined corpus of undisciplined (here, XHTML, but they could be OOXML, ODF, etc.) documents, sourced from a variety of humans and processes and times and toolchains and original binary formats.  The messiness, variety and ill-discipline of the markup is our problem. 

 In the examples below, the heading we really want to ultimately use is marked with a ¶. So you can see there are many things that look like headings that are not, and some headings that do not look like headings but are...such as the heading interspersed in a table, for a presentation effect.

Example 1.   

This one just has the heading marked up as a paragraph.

Input

<html>
  <head>
     ...
  </head>
  <body>
    <title />
    <p><b> ¶ Government Rule 1: Nutty Duck Noise Abatement</b></p>
    <p>Keep your crazed drakes quiet.</p>
    <p>But you might also look at <b>British Government Mallard Psychopathy Report 2011:37</b> for more information</p>
   ...
</html>

SVRL

In this case the fiddled SVRL looks like this:

<svrl:successful-report  role="p-with-potential-heading"
    location="/html/*[2][self::body]/*[2][self::p]" />
<svrl:successful-report  role="p-with-potential-heading"
   location="/html/*[2][self::body]/*[4][self::p]" />

SHRVL

And here is our first sighting of SHRVL:  we (or our software) transposes that fiddled SVRL into this:

<shrvl>
 <html>
  <body pos="2">
     <p pos="2" found="p-with-potential-heading"/>
     <p pos="4" found="p-with-potential-heading"/>
  </body>
 </html>
</shrvl>

You can see that the XPaths in @location has been transposed into ordered elements, and that the SVRL @role attribute has been added as @found.

So what we have is a view of the original document, with important features marked up and extraneous markup removed.

As promised, here is the Schematron schema we are running to generate the SVRLs.

<sch:pattern id="p1">
  <sch:rule id="r1" context="div//table[count(.//td) &lt; 5]">
    <sch:report id="a6"
     test=".//b and not[title]"
     role="div-with-simple-table"  
     subject="ancestor::div[1]">A small table without a title but with some bold text may be the document heading.</sch:report>
  </sch:rule>
  <sch:rule id="r2" context="table[count(.//td) &lt; 8]">
    <sch:report id="a6"
test=".//b and not[title]"
role="simple-table-not-in-div"  >A small table without a title but with some bold text may be the document heading.</sch:report>
  </sch:rule>
<sch:pattern>
<sch:pattern id="p2">
  <sch:rule id="r33" context="div/p">
    <sch:report id="a7"
      test=".//text()[contains(., 'Government')][contains(., ':')]" 
      role="div-with-potential-heading"
      subject="parent::div">
      A div with the right words might be a heading
    </sch:report>
 </sch:rule>
  <sch:rule id="r88" context="p">
    <sch:report id="a88"
test=".//text()[contains(., 'Government')][contains(., ':')]"
     role="p-with-potential-heading"
     >A p with the right words may be a heading
    </sch:report>
</sch:rule>
<sch:rule id="r99" context="td">
     <sch:report id="a99"
        test=".//text()[contains(., 'Government')]  
        [contains(., ':')]"
        role="td-with-potential-heading"
        subject="ancestor::table[1]
   >A table data with the right words may be a heading
   </sch:report>
</sch:rule>
</sch:pattern>

Example 2.   

This one has the heading in a table.

Input

<html>
  <head>
  ...
  </head>
  <body>
    <title >Duck News</title>
    <div>
         <h1>Legislation update</h1>
         <table>
           <tr> ¶
             <td><b>Government Rule 1: </b><td>
              <td><b>Nutty Duck</b></td>
           </tr>
           <tr><td></td><td>Noise Abatement</td>
           </tr>
         </table>
         <p>Keep your crazed drakes quiet.</p>
    </div>
    <div>  
        <h1>What does this mean to YOU?</h1>
        <p>Look at <b>British Government Mallard Psychopathy Report 2011:37</b> for more information</p>
   ...
</html>

SVRL

In this case the fiddled  SVRL looks like this:

<svrl:successful-report  role="div-with-simple-table"
    location="/html/*[2][self::body]/*[2][self::div]
"/>
<svrl:successful-report  role="td-with-potential-heading"
   location="/html/*[2][self::body]/*[2][self::div]/
      *[2][self::table] "/>
<svrl:successful-report  role="div-with-potential-heading"
   location="/html/*[2][self::body]/*[3][self::p]" />

SHRVL

And we (or our software) transpose this into the SHRVL document:

<shrvl>
 <html>
  <body pos="2">
       <div pos="2"  found="div-with-simple-table">
          <table pos="2" found="td-with-potential-heading"/>
       </div>
       <div pos="3"  found="div-with-potential-heading" />
  </body>
 </html>
</shrvl>

Example 3.   

This one has the heading marked up as a paragraph, but there is a disguising table before it.

Input

<html>
  <head>
    ...
  </head>
  <body>
    <title >Duck Hunters' News</title>
    <div>
         <h1>Keeping Track...</h1>
         <table>
           <tr>
              <td><p>
              <b>Latest Government Insanity #2:</b>
           How Fulvous is your Whistling Duck anyway?<p>
               </td>
            </tr>
            <tr>
              <td><p>
                 <b>Latest Government Insanity #5:</b>
                Marbled Duck losing its marbles?<p>
              </td>
            </tr>
            <tr>
            <td><p>
               <b>Latest Government Insanity #8:</b>
Mama, bring my Canvasback!<p>
                </td>
            </tr>
          </table> 
    </div>
    <div>  
        <h1>What does this mean to YOU?</h1>
         <p><b>¶ Government Rule 1: Nutty Duck Noise Abatement</b></p>
   ...
 </html>

SVRL

In this case the fiddled SVRL looks like this:

<svrl:successful-report role="td-with-potential-heading"
  location="/html/*[2][self::body]/*[2][self::div]/
*[2][self::table] "/>
<svrl:successful-report  role="td-with-potential-heading"
  location="/html/*[2][self::body]/*[2][self::div]/
*[2][self::table] "/>
<svrl:successful-report  role="td-with-potential-heading"
  location="/html/*[2][self::body]/*[2][self::div]/
*[2][self::table] "/>
<svrl:successful-report  role="d-with-potential-heading"
   location="/html/*[2][self::body]/*[2][self::div]" />

SHRVL

And we (or our software) transpose this into the SHRVL document:

<shrvl>
 <html>
  <body pos="2">
       <div pos="2" >
          <table pos="2" found="td-with-potential-heading td-with-potential-heading td-with-potential-heading"/>
       </div>
       <div pos="3"  found="div-with-potential-heading" />
  </body>
 </html>
</shrvl>

How to use it?

So now we have our three distinct patterns of documents discovered.  Our three SHRVL documents, to recap, are:

1. Heading in paragraph

<shrvl>
 <html>
  <body pos="2">
    <p pos="2" found="p-with-potential-heading"/>
    <p pos="4" found="p-with-potential-heading"/>
  </body>
 </html>
</shrvl>

2. Heading in table

<shrvl>
 <html>
  <body pos="2">
    <div pos="2" found="div-with-simple-table">
      <table pos="2" role="td-with-potential-heading"/>
    </div>
    <div pos="3" found="div-with-potential-heading" />
  </body>
 </html>
</shrvl>

3. Heading in paragraph with disguising table

<shrvl>
 <html>
  <body pos="2">
    <div pos="2" >
      <table pos="2"
        role="td-with-potential-heading td-with-potential-heading td-with-potential-heading"/>
    </div>
 <div pos="3" found="div-with-potential-heading" />
  </body>
 </html>
</shrvl>

We now can write code to decide, based on this SHRVL, where to find our title.  We might decide to favour the first td-with-potential-heading  we find (as long as it is under a div-with-simple-table), but otherwise take the first div-with-potential-heading or p-with-potential-heading.

Here is a sample XSLT.

<xsl:variable name="shrvl" select="document('shrvrl.xml')"  />

<xsl:variable name="title-in-table"  as="element()?"
          select="($shrvl//*[@found='div-with-simple-table']//
              *[@found='td-with-potential-heading'])[1]"/>

<xsl:variable name="first-candidate-outside-table" as="element()?"          
          select="($shrvl//*[@found="div-with-potential-heading"
    or @found="p-with-potential-heading"])[1]"/>

<xsl:template name="select-title">
  <!-- Determine the best title from the candidates and document features -->
  <xsl:value-of select="
     if ($title-in-table  [not(preceding::*[@found='div-with-potential-heading'])])
        then 
           my-fun:get-title-from-corresponding-table( / , $shrvl,
                      $title-in-table )
        else
 my-fun:get-title-from-corresponding-p-or-div( / , $shrvl, 
                      $first-candidate-outside-table )" />
<xsl:template>

<xslt:template match="/">
     <section>
      <title>
         <xsl:call-template name="select-title" />
      </title>
...

Of course, where the decisions are uncomplicated, you can do it without the Schematron->SHRVL stage.    But...  the SHVRL shows clearly what you are detecting, and you can see what your logic is going to have to be to decide what to do. So it provides a very intuitive way to look synoptically at the SHVRL and your troublesome new input, in a way that SVRL is not.

  • You can see whether a document has no candidates, or whether they appear in the wrong place in the document, or whether they meet some rules but there are better candidates available.
  • You are not limited to making a decision on the spot, when you find some markup: your decision can be based on all the features found anywhere in the document.
  • When a new document comes in that does not have the heading properly recognized, you see from the SHRVL if it is just an issue of already-detected features in some new arrangement that is not recognized, or if there needs to be more Schematron rules.
  • The SHVRL document might be easier to navigate using rather than an XPath, for most developers. (Of course, XPath 3's eval() could be useful here.)

The reason for doing it this way may become more obvious if we consider that we might, say, want to process our document by finding milestones, each constrained by some other milestone, before or after.

For example we might decide that image references before this detected main title are to be stripped while image references after must be processed. (This is a good job for a cascade of variables.)  But at any time, there may be some extra milestones or special cases discovered.   A recipe for spaghetti!

So it makes sense to divide our transform into two processes: first to locate all patterns of interest (hence Schematron->SHRVL is used), second to make decisions based on the patterns found. 

Also

What this also gives us, for those interested, is a set of observations on a document: this could be pulled into a Hidden Markov Model to allow a more probabilistic approach to automated markup and document classification.

It also can provide a way to do meta-grammars (architectural forms-ish): where you have a RELAX NG (or Schematron) schema to validate the SHVRL document.  This can act to let you know when your flat document has the features expected for some transformation, without having to actually do that heavyweight transformation.