The Goals of XML at 25: and the one change that XML really now needs

XML had a goal: terseness is of minimal importance. JSON seems to prove otherwise, some people think. But does JSON really demonstrate that terseness of markup is important, or does it even moreso demonstrate the terseness of declaration is important. I suggest one change that would give XML a lot more bang per buck.

XML was standardized in 1998, and has had only minor successful revisions since then. It had two subsequent upgrades: XML Namespaces and XML Schemas. I had a ring-side seat on the development of all of these, and was able to put my own 10c in.

In XML's cases, the method of development was to filter SGML (as successfully enhanced with my ERCS proposal for better support of CJK documents) and HTML through a filter called the "Goals": at time of writing this post, these Goals are coming up to 25 years since they were formulated. Lets look at how these have stood up.

Here is what was in the first published draft (2006):

The design goals for XML are:
1. XML shall be straightforwardly usable over the Internet.
2. XML shall support a wide variety of applications.
3. XML shall be compatible with SGML.
4. It shall be easy to write programs which process XML documents.
5. The number of optional features in XML is to be kept to
the absolute minimum, ideally zero.
6. XML documents should be human-legible and reasonably clear.
7. The XML design should be prepared quickly.
8. The design of XML shall be formal and concise.
9. XML documents shall be easy to create.
10. Terseness is of minimal importance.

After 10 years experience with XML, and particularly after the re-complicifiation of XML by XML Namespace and its bloatification by XML Schemas, there were growing calls for a Simplified XML or a Minimal XML. I resisted these, as I believed then (as now) that the issues they were reporting would not, in the main, be solved by reducing the features of XML: you would be trying to fit a marginally smaller square peg into a round hole.

Several attempts at a minimal XML were attempted, but none found traction. Similarly, some attempts at minimal XML in other pet syntaxes were tried and the only one that has been a runaway success is of course JSON, which is not a markup language at all.

I was frankly delighted to see JSON, because I think it does provide that round peg: of course, its dictated approach meant that it lacked some things that were really needed: comments in particular. Reusing Javascript data declaration syntax was a great simplifying idea. (Now, predictably, JSON's minimalism meant that people have had to make extensions to make it useful: JSON is enough for ephemeral data exchange, but to use it even for static configuration files people adopted the superset YAML.)

I want to look at one of these goals only:

10. Terseness is of minimal importance.

Of all of XML's goals, this was the most important: it provided a chisel to carve XML out of SGML (a la Da Vinci).

SGML was developed in the 80s with the goal of making a tree-based markup language system that allowed, hopefully with minimal change, all the kinds of markup languages that then existed:

from fully tagged languages (XML is an example of this)
to tag-ommitted tagging (HTML is an example)
to delimiter-tagged (JSON is an example of this)
to unnamed-delimiter tagged (like CSV)
to lines-starting-with-special-character (like troff markup)
to what is now called MarkDown (years ago, I wrote a blog for O'Reilly giving the DTD for this in SGML.)

The idea was that this provided an easy on-ramp so that syntax would not matter: you do the work (a lot, as it turned out) of making the right declarations, and then with a bit of massage your documents become part of a unified ecosystem, and more capable of being re-purporsed and re-targetted as new systems came along.

The trouble was that SGML was not sufficiently layered, and so contained mostly features that any individual publisher would not use: indistinguishable from bloat to most people.

XML comes along, and with Goal 10, trims them. And the consequence? MarkDown immediately come along, premised that in fact terseness is of extreme importance for humans typing documents or for data transfer.

(Now, it is true that if you are using some editor, the format makes no difference, and if you are compressing your data, then documents in any format which have the same information should compress down to about the same sizes. But MarkDown users wanted to be able to type directly; and JSON has type information in delimiters that in XML would require some system of extra annotations that means the XML document could not, in all likelihood, compress down to the same size.

Doing this typing externally was the hope of XML Schemas, of course. But the idea of downloading potentially very large schemas (even if cached) everytime you wanted to parse was ludicrous in practise, and a level of formality that did not meet the more ad hoc way that developers made JSON systems work.

XML is clearly way on the downwards slope of the hype cycle: not a bad thing. And JSON is clearly use for the things it is good for: no complaints from me. And HTML went off on its own way: IMHO its early pollenation with SGML ideas and then XML ideas (XHTML) did it no harm, and it is a good thing that it can age gracefully now.

A Better "XML" for Editing?

When people raised the idea of a simplified XML, I sometimes would say that I wanted simplification, but that what I considered simplifications were what others considered complexifications.

For example, I thought that the way to remove external (and internal) entities from XML would be to build-in the standard ISO/WC3 entity set into XML. And I could see no reason why limited end-tag ommission could not be enabled again: if you find an end-tag you close off any unclosed end-tags; this could be enabled as a form of error-handling, so that SGML well-formedness would still be in place. Both of these are utterly trivial to implement, but the people who wanted simplificaiton and minimalization were really attached to matching tags.

In 2002, I put out my own idea for SlackXML: or Editor's Concrete Syntax. It was implemented as a mode my Topologi Collaborative Markup Editor, and was intended to allow less Draconian XML and many HTML/SGML documents. It was designed for DTD-less interactive colouring editors, the idea being that the user would not have anything to prevent you from concentrating on the portion of a document you were working on (because of some previous non-well-formedness or invalidity): for example the SGML-isms that if there was a </> it would close the current tag, and if you found an end-tag, you could close any existing tags. Notably, it pre-defined all the standard SGML/WCS entity sets.

I still think that XML development went wrong by not building in these standard entities for characters. The combination of people who thought this complicated the parser, or who thought people should use numeric character references only, or who thought that if we adopt UTF-8 only as the character encoding for files meant we needed no references (entity or numeric), and so on. (It never happened, but I suspect that might be some Math people who might that because there is a single character in the MathML sets that requires bolding, and therefore is DTD dependent, adopting all but that character would be impossible.)

So we are left now where entities cannot be removed from XML, because so many documents rely on them: e.g. documents using this math markup. And that means that DTDs could never be superseded. Which has meant that there is no impetus to reform XML Schemas (in the direction of a simplified core) as people doing documents rather than data often stick to DTDs (or switch to RELAX NG).

The One Thing XML Now Needs

Now, what would be the result of removing DOCTYPE declarations entirely from XML but, as I suggest, building-in the declarations for the standard SGML/W3C/MathML public entity sets for "special characters"?

It would allow in-place parsing. This is perhaps the big thing that prevents various kinds of optimized XML parsers, and when you look at academic papers on techniques for speeding up XML parsing, you can see that most of them come up with some brilliant technique then sheepishly admit it only parses a subset of XML.

The problem is that entity expansion means that to parse an XML document and transform it into its infoset cannot simply be done merely by opening the file adding a tree pointing to various parts of the file: You may have to open multiple files, you may have to open multiple declaration files, and where you find an entity reference &hello; it may expand to more than 7 characters and may even contain more markup.

At this point, people then say, well then, lets just get rid of entity declarations; indeed the whole DOCTYPE declarations. And then, immediately, all the people who do need to type these characters by name not number or character put the kibosh on it.

But this does not need to apply to these standard characters, because every one of them (except for that bold one) expands in Unicode (UTF-8 or UTF-16) to the same or fewer bytes than then entity reference took. Removing the DOCTYPE declaration halves the complexity of XML; pre-defining the entities is really no extra complexity, it is just a big lookup table.

But the advantage of in-place parsability is that it allows techniques such as incremental parsing and parallel parsing, which speed up XML.

XML Standard needs to be revised, at W3C or just de facto

How would I define this "simplified XML" language? I would define it using Canonicalized XML (without ordering restrictions) as the base, UTF-8 or UTF_16 only, then allow these standard entity references.

In other words, I think Goal 5 is getting in the way. We need this one dialect.

5. The number of optional features in XML is to be kept to
the absolute minimum, ideally zero.

And I think Goal 10 needs to be revised:

10. Terseness of markup is of minimal importance.
Terseness of declarations is of utmost importance.

Isn't it too late now? Yes. But maybe those who really are better off with JSON are now happy with that, and the remaining markup types who would like revive XML as a viable format in the incoming world of parallel parsing might benefit from a way forward.

For several decades I have dabbled with methods to speed up parsing UTF-8 and XML using SIMD and parallel parsing: my conclusion is that the approach I am suggesting here is the only feasible way for XML to not be sidelined as slow and complex. I think the lack of papers and experience demonstrating otherwise indicates it too.)

(I saw an article last week discovering that the major bottleneck of JSON was all the parsing that was required, by the way!

Now I said at the start that this would give XML a lot more bang per buck: what I should have said that this would give XML almost the same bang (range of uses) for much less buck (complexity of implementation and understanding).