Three Text-based Complexity Metrics for Schemas: RoughDTD, RoughXSD, RoughSchematron

Posted on July 18, 2019 by Rick Jelliffe

Sometimes we want to measure the complexity of schemas.  Here are three metrics that provide some count of declarations and alternatives in the schema. A schema that uses its own macro facility a lot (for DTDs, entities; for XSD, named datatype declarations; for Schematron, abstract patterns and rules) with have less complexity measured than equivalent one that fully expands the macros (puts the details on each rules explicitly).

The emphasis here is on easiness to create. They only require text processing, not parsing, but should be more meaningful than LOC. The three metrics are probably not directly comparable with each other, but a metric values for different schemas of the same type of schema can be compared. (They are probably better thought of as metrics for the complexity of the schema as much as the complexity of the language defined by the schema.)

RoughDTD:  Do not count inside comments “<!--” to “-->“.  Count every occurrence of regex “<!.[^INO]”  to get a count of the ELEMENT and ATTLIST declarations. Add a count of every occurrence of “|” and “,”  as well.  Aggregate the resulting number for every DTD file, ignoring files that only contain ENTITY declarations.

Justification: complexity is the number of declarations for elements and attributes, with the number of choice points in the content models.

Limitation:  items in IGNORE marked sections may be counted.

RoughXSD:   Do not count between “<.*:annotation” to “</.*:annotation“.   Do not count inside comments “<!--” to "-->“.   Count every occurrence of regex “<[^!/]”  to get a count of the elements. Aggregate the resulting numbers from processing every XSD file.   (Adjust “.*:” if no prefix is used in schema.)

Justification: apart from annotations, each element in XSD represent a single kind of complexity, even if this complexity is one that tightens up what a broader type allows.

Limitation: this does not count XPath or regex complexity.

RoughSchematron:  Do not count between “<.*:p[>\s]” to “</.*:p[>\s]“.   Do not count inside comments “<!--” to "-->“.   Count every occurrence of regex “<[^!/]”  to get a count of the elements. Add a count of every occurrence of “/“,  “ |“, “ [”  and “,” (if processing as text, this is anywhere in the file, if as SAX then only in attribute values).  Aggregate the resulting numbers from processing every Schematron file.   (Adjust “.*:” if other no prefix is being used in schema.)

Justification:  Most of the complexity of a Schematron schema is in the XPaths, therefore we need to count the contexts and so on in the XPaths.

Limitation: no count for use of functions and the XQuery-ish parts of XPath2.