Weak Validation

This article was written at Academia Sinica Computing Centre, Taiwan on 2 June, 1999. 

The further idea of feasible validation (a document is feasibly valid if elements can be added to make it valid) was presented at XML Europe 2002 in a paper "When Well-formed is too much and Validity is too little" and implemented in James Clark's Jing validator.  - Rick Jelliffe

Regular-expression content-models can be processed in various ways to obtain constraints which are easier to implement. Weak validation has some attractive properties that may coincide with the nature of many XML documents.

Different Algorithms have Different Strengths

Some content models are difficult to validate: in particular, content models containing many "&" operators may be subject to combinatorial explosions if validated using a conventional automaton.

Conversely, validators implemented using XSL, as proposed in my note Using XSL as a Validation Language, can validate content models containing many "&" operators readily. However, they may suffer a combinatorial explosion from content models using the various repeating and optional operators.

An alternative approach has been mooted in the XML Schema draft, which is to allow "open" and "closed" schemas.

This is an interesting idea; it means that we can interpret content model schemas in different, but useful ways.

Weak Validation

One way that I think may be promising is to allow forms of weak validation.

One form of weaker constraint that can be extracted from a content model is to find a list of all element types that are always-required. For example, taking the following models:

<!ELEMENT eg ( a, b, a?, ( c | ( f, b, c)), (d | e)*>

a weak content model would be (if #ANY means a single element):

<!ELEMENT eg ( a , b, #ANY+,  c, #ANY+ ))>

or (if #ANYSEQ means any string of elements in any following positions)

<!ELEMENT eg ( a , b, (#ANYSEQ* &  c ) ))>

where that content model would be interpreted as "very open" (that is the function of the #ANY or #ANYSEQ tokens). The leftmost consecutive required element types are specified using "," but after the first optionality or grouping indicator, the "&" connector is used. As mentioned, this kind of content model is trivial for XSL validation.

Why would this be useful?

  • The first reason is because it is very simple for implementors of validators: they just need to look down a two lists: the "," sequence and the "&" sequence.
  • It is a useful form of validation for implementors of software that accepts XML. Software is often written to tests non-required element types: in some scenarios, there is little value in validating for non-required elements types.
  • Database records are often flat, and typically position-independent; the "&" connector meets this requirement.
  • The "required element missing" error message is the most common and important for document systems. For simple content models, weak validation provides these messages.
  • When a DTD is "very open" or extensible by arbitrary cut-and-paste, only the always-required can be salvaged from the original content model as originally designed. In other words: (a, b, c* ) +(#ANY) is the equivalent of ( a & b & #ANY*). This is likely to be the case with XML document moved into word processors. We want to be able to retain as much validatability as possible, but simple enough to be implemented an understood, with known behaviour.

Suggestion

It would be useful if XML supported a kind of weak validator that supported very-open content models with only "," or "&" connectors: this validates "always-required" element types only. This could use information from XML markup declarations and XML schema, but be trivial to implement.

Note

The above examples raise an interesting issue. Is

( (a)+ & b)
different from
( (a+) & b )
or

( a & a* & b )

Other Forms of Weak Validation

Also worth considering is the issue How can validation schemas cope assist the authors of in-progress documents? It seems that the distinction between valid and invalid is too extreme to be useful during authoring. I note that the FrameMaker+SGML structured editor presents the document creator with several interesting choices in this regard:

  • the elements menu can be arranged to show all elements, elements available in the parent in any order, elements currently possible at the current position. The second one is in fact a weaker content model: as if all sequence operators have been replaced by or operators.
  • when creating an element, an option is available so that the schema (the "EDD") can nominate some another sub element that is automatically inserted also. This is a form of stronger schema: it gives an example of why a schema for constructing documents can have different information than a validator. Perhaps there is something in between "optional" and "required" for an element type: "should be present by default"!

So we can identify several different strengths of validity:

  • valid/invalid against a DTD or closed schema;
  • valid/invalid against an architecture or open schema;
  • weakly valid/weakly invalid against a DTD, architecture or schema;
  • potential validity against a DTD or closed schema;
  • well-formed/ill-formed (though this is strictly not a schema issue).

The fourth type, potential valdity, is where no impossible elements are present. It can be performed by weak validation, where a model like

(a, b?, (c | d), (b)*)

is replaced by unambiguous

(a?, (b | c | d)* )

or the ambiguous

(a?, b?, ( c| d)?, b* )