Highly Generic Schemas

This article originally appeared in a blog on O'Reilly on July 6, 2010.

Developer  Christophe Lauret recently commented

A schema is like an aircraft: it can be designed for stability or maneuverability but not both.

I recently have been trying a different method for designing intermediate schemas in publication chains. It is an exercise in taking the three-layer model for XML with Schematron  to an extreme.

Below, I make a highly generic schema for the well-known Purchase Order: basically

  • an top-level object has html and other objects
    • html has a body and rich-block
      • rich-block is h1, h2, p, ul etc
        • these have rich-text
  • non-top objects hava name, metadata and sets
    • a set has metadata and data-objects
      • data-objects have objects, price, quantities, codes, flags, members, text, rich-text, rich-block, a

The best name I can think of this is Highly Generic Schemas but probably there is something better. (It is in the same kind of area that might be addressed by abstract base types in XSD and in the old architectural forms for that matter.) I have only tried it on one particularly problematic application so far, but it seemed to reduce the amount of programming time to implement a pair of transformations from initial data in to the intermediate format to publication out by about 2/3, which looks pretty good.

The application has some peculiarities: the data inputs are changing in schema as well as data (not only the immediate inputs, but all the components of the processes that feed the initial data are being renovated over time; the business rules are changing; and the outputs and their formats are changing regularly. In fact, the changes to the actual data values are relatively few each month, compared to the size and coordination effort of coping with the system and requirements changes. The output fields in each publication have many different variations too: calculated values for different markets and different statutory wording: coping with these small variations in records that were otherwise similar was difficult for developers, who have in the past felt there was always some fresh unstated gotcha lurking. The lack of specific-enough identification of fields created a problem in expressing and understanding requirements. And there is a must-have monthly deadline.

So the emphasis has to be on agility, clarity of markup, and hackability: it must be easy for the developers to understand immediately what the data in the intermediate format is, what kind of operations they will need to be looking at, and how to refactor it with minimal disruption and minimal bureaucratic impediments.

The Approach

The approach is to take the corollary of the requirement that the specific requirements are changing rapidly and that data from different sources with slightly different semantics may be present: viz "How do we have markup that reduces the amount of cross-referencing a developer needs to do?" i.e. how do we make as much of this context and metadata manifest?

The approach is a combination of

  • dividing the schema organization into parts that more closely fit the organization (shades of Conway's Law?), with the use of the highly generic schema, the controlled vocabulary and the arrangement (detailed below),
  • using highly generic element names which make is obvious the nature and purpose of the field: if it is a <flag> element then it will only be used for tests rather than printed, for example; but if it <rich-block> then it may have paragraphs and inline markup that needs to be dealt with,
  • giving each field quite long, unequivocal and explicit names (using @is) which prevents any need to resort to context to understand the semantics of the element (and maintaining this in a controlled vocabulary), and
  • denormalizing the data so that if the developer is interested in a set of information it is all there (i.e. both set-specific fields and fields that belong to the context) but if they are iterating over multiple sets (or multiple objects) then the fields relevant to the object are available immediately

We separate the concerns of the schema into three parts:

  • highly generic schema (we use a RELAX NG Compact grammar like the one below and provide an XSD mapping of it using Trang) which has a minimal number of elements (about 20), is highly systematic, and has a low semantic fanout which I will explain below. The elements in this highly generic schema are not related to the particularly business at all, but the general class of document: we have containers object, set, metadata, and fields price, quantity, member, code, flag, text, rich-text, rich-block, object-ref and sort, all of which can have attributes @is (giving the controlled vocabulary name), @code (giving the machine processable value of the field), and @id
  • controlled vocabulary (we use Schematron) which has complete lists of the specific names that can appear in @is attribute values. Anything that belongs to a business rule is in this schema
  • An arrangement (we use UML and text) which says how information is arranged into the generic schema using the controlled vocabulary. For example, that we want to have one intermediate file per output (to maximize the independence of developers working in parallel on different outputs) and so on.

So the highly generic schema is fairly fixed, and can be validated against easily using the simple grammar, but without that validation checking any business rules. The idea of semantic fanout is that rather than having possibly hundreds of element names or rely on XML context to understand the meaning, it is better to figure out a small number of common names which makes the general processing semantics very obvious to the programmer: in this case we came up with about 10 field names, 3 containers names, and a handful of HTML elements:

  • an intermediate data file can have multiple objects,
  • objects contain multiple named sets of information plus object metadata,
  • sets contain multiple fields and objects or object references plus set medatata such as effectivity dates and sort keys,
  • fields are quantities, prices, codes, flags (like boolean), members (cross-object grouping), text, a (HTML's a rich-text (inline HTML subset) or text-block (block HTML subset).
  • Plus almost everything can have the common attributes metioned: @is, @code and @id.

The controlled vocabulary is a much more ad hoc (or, at least, extensible) affair: indeed, one of the aims is that by bringing the controlled vocabulary under the control of the developers, it provides a mechanism for them to cooperate where they can prototype changes or make initial versions without having to wait for prior review from the schema god.

A Highly Generic Schema

Here is a version of the schema

# RELAX NG Compact schema for Highly Generic Information
# Written: Rick Jelliffe, Allette Systems
# Version: 2010-06-25

namespace local = ""
datatypes xs = "http://www.w3.org/2001/XMLSchema-datatypes"

start = objects

## objects is the top-level element
## It contains a section in simplified HTML which should explain the
## purpose of the file and give links to codes and object types enough
## so that other developers can easily use it.
objects = element objects { html?, object+ }
## The object is the primary unit of organization.
## An object is equivalent to an important table, such as a Drug, or Brand
## 
## Fields:
##      @is     The specific kind of object. Objects are very generic,
##          so @is gives the details.
##          This is a token, controlled by the project. Required.
##      @code   If the object has a natural code, this is where it goes.
##              This is a token, depending on the data. Optional.
##      @id             A unique identifier. 
##                      This may be any token, not just conforming to XML ID rules. Optional
##      @sort   A sort key that may be used for objects of this kind
##      name    A clear name for the object. For higher-level objects this is
##         probably for information and clear markup, not for direct usage.
##      metadata        These are housekeeping elements used to tie into the feeding
##         system and help with any effectivity requirements
##      set             All information in an object is grouped into sets.
##          This allows more accurate representation of containers              
object = element object { attribute is { xs:token }, attribute code { xs:token }?,  attribute id { xs:token }?,
            attribute sort { text }?, element name { text }?, metadata?, set+ }   # eg drug
## A set is a group of related information. 
## A set is equivalent to an important collection, in particular for a
##  line in a schedule
## 
## Fields:
##      @is     The specific kind of set. Sets are very generic,
##           so @is gives the details.
##          This is a token, controlled by the project. Required.
##      @code   If the set has a natural code, this is where it goes.
##              This is a token, depending on the data. Optional.
##      @id             A unique identifier. 
##                      This may be any token, not just conforming to XML ID rules.
##      Optional
##      metadata        These are housekeeping elements used to tie into the feeding system and
##                      help with any effectivity requirements
set = element set { attribute is { xs:token }?, attribute code { xs:token }?,  attribute id { xs:token }?, 
                        metadata?, data-object+ }   
## A rich and generic set of data objects are allowed.
## These use @is and @code to keep business requirements (almost)
## completely out of this schema. 
##
## price        Any simple price. Units can be represented using @units if needed.
## quantity     Any simple quantities of any amount. Units can be represented
##        using @units if needed.
## code         Any simple code value. (The value of the element is the code,
##        not the @code attribute.)
## flag         Any boolean value. True is signalled by having the element.
##      False by not having the element.
##                      Any binary information that will simplify formatting the data etc.
##      can be a flag.
## member The member group code
##                      goes as a data value, not a @code.
## object-ref   This is a reference to an object elsewhere in the XML,
##      by the target's @id. 
##                      These might be used, for example, for if shared restrictions were useful
##                      The sort attribute is an order key that may be used for ordering
##       the object. 
## text         A simple plain text field. Be careful to only use this when necessary
## rich-text    A rich-text field. This text is marked up using the simplest
##      HTML blocks
## a            An HTML link to related information.

data-object =
    object |
        element price { attribute is { xs:token }, attribute code { xs:token }?, attribute units { text }?,
                        attribute upto { "Y" | "N" }?, text } |
        element quantity { attribute is { xs:token }, attribute code { xs:token }?, attribute units { text }?, text } |
        element code { attribute is { xs:token }, attribute code { xs:token }?, text } |    
        element flag { attribute is { xs:token }, attribute code { xs:token }?} |
        element member { attribute is { xs:token }, attribute code { xs:token }?, text } |
        element object-ref { attribute is { xs:token }?, attribute code { xs:token }?,
        attribute sort { text }?, attribute ref { xs:token }} |
        element text  { attribute is { xs:token }, attribute code { xs:token }?, text } |
        element rich-block { attribute is { xs:token }, attribute code { xs:token }?, rich-blocks } |
        element rich-text { attribute is { xs:token }, attribute code { xs:token }?, rich-text}  |
        element a { attribute is { xs:token }, attribute code { xs:token }?, attribute href { text }, text } 

##      metadata        These are housekeeping elements used to tie into the feeding system and
##                              help with any effectivity requirements
##              start-date Optional.
##              end-date         Optional.
##              last-updated    Optional.
##              action-sheet    The number of the action sheet used to drive the change. Optional.
##              sort            Sort keys that may be used for this object or set of information.
metadata = element metadata {
        element start-date { attribute code { text }?, text }?,
        element end-date  { attribute code { text }?,text }?,
        element last-updated { attribute code { text }?, text }?,
        element action-sheet { attribute code { text }?, text }?,
        element sort { attribute is { text }, attribute type { 'alpha' | 'numeric' }?,
                 attribute primary { text }, attribute secondary { text }? }? }

#  Rich text uses a simple subset of HTML
html = element html { element body {  rich-blocks } }

rich-blocks =
   ( element p { attribute is { xs:token }?, attribute class { xs:token }?, rich-text } |
    element h1 { attribute is { xs:token }?, attribute class { xs:token }?, rich-text } |
    element h2{ attribute is { xs:token }?, attribute class { xs:token }?, rich-text } |
    element ul { attribute is { xs:token }?, attribute class { xs:token }?,
                element li { rich-text }+ }  )+

rich-text = ( text | 
        element sub { text } |
        element sup { text } |
        element a { 
                attribute class{ xs:token }?, 
                attribute href { text },
                text }  )+

Example

So here is a little example, which is enough to show what the structures are but undoubtedly too simple to demonstrate why you would want to do something like this.

Here is a conventional XML instance:

<dataset> 
            <BillingAddress>
                 <set>
                     <po-box >22224<po-box>
                     <suburb   code="2009">Pyrmont</suburb>
                     <city  >Sydney</city>
                     <country  code="AU">Australia</country>
                </set>
            </BillingAddress>
            <OfficeAddress>
                 <set>
                      <street >2/73 Union St</street>
                     <suburb   code="2009">Pyrmont</suburb>
                     <city  >Sydney</city>
                     <country   code="AU">Australia</country>
               </set>
            </OfficeAddress> 
</dataset>

In this conventional case, we might have an XML Schema which has abstract base types (or elements) for objects, text etc and use type derivation (or substitution groups) for the specific schema.

The version of this instance using a highly generic schema is a permutation which takes the generic information (the abstract type or substitution group head) and uses it as the element name (i.e. the generic identifier), and denormalizes (repeats) common fields:

<objects>
  <object is="Address">
       <set>
             <text is="suburb"  code="2009">Pyrmont</text>
             <text is="city">Sydney</text>
             <text is="country" code="AU">Australia</text>
       
            <object is="BillingAddress">
                 <set>
                     <text is="po-box.billing">22224<text>
                     <text is="suburb.billing" code="2009">Pyrmont</text>
                     <text is="city.billing">Sydney</text>
                     <text is="country.billing" code="AU">Australia</text>
                </set>
            </object>
            <object is="OfficeAddress">
                 <set>
                      <text is="street.office">2/73 Union St<text>
                     <text is="suburb.office"  code="2009">Pyrmont</text>
                     <text is="city.office">Sydney</text>
                     <text is="country.office" code="AU">Australia</text>
               </set>
            </object>
         </set>
    </object>
</objects>

I think this second way has advantages over the first when maneuverability is a concern.

XML Schemas is weak in the area of maintenance: if you want to mark that a certain element is obsolete but still can be used, there is simply no reliable way. In Schematron, you just make a report element that lets you know that an old name is being used, but that it is not significant for validation. Which makes it a poor choice when the schema is changing rapidly. And to have to look up the (abstract) types in the schema is a level of indirection (and knowledge about schemas) more than should be necessary: many good XML developers have or want or need little awareness of XML Schemas. Furthermore, using an XML Schema will inevitably mean that making a schema change cannot be done ad hoc but will have to be negotiated with some supervising party.

And the lack of denormalization means that the developer using the data will always have to do an extra step of looking up: if the city field is part of the Address element then it won't be in the BillableAddress and the OfficeAddress, while if it is part of those two it won't be part of the Address element.

Functions make using a Highly Generic Schema and XSLT2 easier

In this approach, access to data using XSLT2 has some extra safeguards too: we use custom functions for XPaths used in xsl:for-each or other selection and sorting cases. This overcomes the common problem that long XPaths need to be documented, but infrequently are: having to make up a function name is a way to tightly couple simple documentation (i.e. the name) with an XPath. So function accesses look like this:

<xsl:for-each select="mps2:get-every-dog-in-window( / )">
<xsl:sort select="mps2:get( . , 'Name.WindowDog') "/>

  <xsl:choose>
    <xsl:when test="mps2:get( . , 'IsPartOfLitter.WindowDog')">
       <xsl:call-template name="mps2:handle-litter"/>
    </xsl:when>
    <xsl:otherwise>
      <xsl:call-template name="mps2:handle-single-dog"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:for-each>

where the get() function can be a little fuzzy, like

<xsl:function name="mps2:get">
  <xsl:param name="context"/>
  <xsl:param name="key"/>
  <xsl:choose>
    <xsl:when test="$context[@is=$key]">
      <xsl:sequence select="($context)[1]"/>
    </xsl:when>
    <xsl:when test="$context//*[@is= $key ]">
      <xsl:sequence select="($context//*[@is= $key ])[1]"/>
     </xsl:when>
    <xsl:otherwise>
       <!-- Attempt to find it elsewhere -->
      <xsl:sequence select="($context/ancestor::object[1]//*[@is= substring-before( $key, '.' )  ])[1]"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:function>