Recently I have been looking at *probabilistic schemas*. For the sake of helping along the conversation, what untapped possibilities are
there? Lets look at some ideas in increasing complexity:

**Probabilitistic Pairs: **This approach is just a table of each possible element, and the probability of the
some relationship between one or the other. You might have a table for the XPath following::
axis, or for child:: or following-sibling:: and so on. You can use inductive logic
on these pairs.

**Bayesian Networks:** Bayesian refers to systems where the probability of some event is determined from
knowledge of other prior events. (An example of a non-Bayesian probability is a
coin toss. The entertaining Monty Hall problem can be cast as a Bayesian problem.) For example, see this paper which gives the probability some element will contain text based on that element alone
– i.e. probabilities for each x/text(), and probabilities that an element will contain
some child based on the parent as well – i.e. probabilities for each x/y/z, if I read
it correctly.

**Probabilistic Grammars:** These are schemas to which probabilities have been added for each child. To get a basic feel consider, consider this case
where we take away any consideration of position or requirement from the grammar,
just using Probablistic Mixed Content Models:

`<!ELEMENT SECTION ( #PCDATA#0% | TITLE#100% | P#100% | TABLE#10% )* >`

where the percentage indicates how many of each parent would contain that child in a large document corpus or average document. You can imagine the uses: if the percentage is very low you have identified outliers that might be better engineered out, if the percentage is 100% you might check that the schema enforces this, you might use this information to provide better editor prediction of what elements come next, and so on.

**Hidden Markov**
**Models: **a hidden Markov model (HMM) is a two layer model (think states and recognizers, called *latent variables* and *observations*) and typically would be applied to make a schema out of the linear stream of SAX
events, say, (i.e. the probability that start-tag for TITLE follows the start-tag
for SECTION. The characteristic Markov Property is that the probabilities of the
next state depend only on the current state and the observation made, not on previous
observations. So a schema using HMM would really interested in typical sequences,
on XPath’s following::*[1] and child::*[1]). It could be used to detect outliers
of sequence (i.e. where a common child element appears in an unusual position), and
for implying tags in sparse documents. When you have this model where the probabilities
of observations from a state and of going from state to state are inplace, it can
be used for all sorts of statistical inferencing about documents, especially for calculating
what is the most likely missing markup.

A HMM schema would be most effective for schemas where there is no recursion or re-use
of element names in different contexts, and no element grouping or multiple occurrence
in a content model. So `( a, b?, c{1-10}, d*, e) `

etc.

**Context-Free Semi Hidden Markov Models:** a HMM might be classed as a kind of state machine, but if we use a stack machine
instead, allowing a push and pop as one of the action to go to the next state, (i.e.
to return to a previous state) then we get a Context Free Semi-Hidden Markov Model
(semi, because the Markov Property does not hold for pop operations.) So here we are
looking at elements proper as a tree, not as a linear list of start- and end-tags.
In this case, we can get a much closer implementation of the Probabilistic Mixed Content
Models above: the latent variables are the LHS (element name) and the obvservations
are the RHS (the content model).

Many variants are possible, and in fact using the parent as the latent variable might get rid of one of the nice features of HMM: that you can express probabilities that vary in different contexts (e.g. it rains more in Winter than Summer). So instead of each latent variable being the simple element, you could have one latent variable for each absolute path (in which case, actually you don’t need a stack and risk combinatorial explosion.) But more interesting would be to allow latent variables not based on explicit markup, for example that the first paragraph in a section has no likelihood of containing lists.

**Neural Networks**: It should be clear that if you relax the Markov property, you can start to get things
that look more like neural nets.

**Recursive Neural Networks:** A recursive neural network (RNN) is a tree structure of networks. There is in fact
a 2014 research paper on this: Self Learning Recursive Neural Networks for Structured Data Classification where they make a trie for the document corpus so that each unique absolute path
in the corpus has a corresponding absolute path of neural networks, one per element.

The paper arrives at an interesting conclusion: “*The results show that equipping RNN with semi-supervision mechanisms can enhance the
classiﬁcation accuracy of structured data*” which relates to the training problem: training is expensive and time consuming
and requires lots of examples, so if you can help with some upfront information (to
label some documents as typical of known clusters) then that provides a headstart
for calculations and outlier detection.

Readers: If you know any other research or ideas on probabilistic schemas, please let me know and I may update this article accordingly.

Feature detection and clustering are key techniques with many potential applications to XML processing, for large-scale industrial publishers with large corpuses. However, there has not been much research or public information about projects that do it. I think, in part, it is because the industry tends to have a reactive rather pro-active approach to these things: you fix a problem as it occurs rather than trying to second guess it upfront.