Extended Lexical Datatypes

RAN attributes names or values can be marked up with double quotes These are treated as strings, with no parsing of their contents.  (Unlike XML, apostrophe cannot be used as the delimiter, and the “=” character cannot be use directly. Use &eq;.)

However, if they are just tokens with no double quotes, then their lexical signature will be used to assign them to a datatype, which they should conform to. These types are very extended compared to, e.g. XML Schemas Datatypes or JSON Datatypes.

A start- or end-tag can also include ellipsis “...”  which indicates that the markup is known to be incomplete: this does not affect the lexical typing of tokens, it merely means a validator should not report errors relating to absent or incomplete items.

Token Classification


Attribute values specified using tokens (i.e. not inside double quotes) can be allocated to various types quickly using the following (where any is like a regex   .*  )

TOKEN         ::= DATE_TIME_INTERVAL | PREFIX  |  INFIX | SIMPLE

DATE-TIME-INTERVAL ::= any+ "-" any+

PREFIX        ::= QUANTITY  | MONEY | RELATIVE-PATH
QUANTITY      ::= ("+" | "-" | digit)  any+
MONEY         ::= ("$" | {>ASCII} && \p(Sc})  any+ 
RELATIVE-PATH ::= "."  any+

INFIX         ::= ANCHOR-PATH | URI |  PREFIXED-NAME
ANCHOR-PATH   ::= any+ ( "+" | "~" ) any+
URI           ::=  any* "/"  any+PREFIXED-NAME ::= SIMPLE ":" SIMPLE+

SIMPLE        ::= any+

These productions are in order, with a  SIMPLE as the fall-through case. The SIMPLE production covers names and logic identifiers.

Note: This means that a SIMPLE token is slightly from XML's names,  and cannot starts with "." or have any occurrence of "-". Element, attribute, link and PI names cannot be the same as a logic keyword.

Lexically-Typed Tokens

Any attribute value that is not specified inside double quotes is parsed (lexed) using lexical typing: the presence of some non-letter character in the token determines its type.  

lexically-typed-token ::= quantity | money | date-time-interval
    | uri  | relative-path | anchor-path | logic |  name

Logic

Familiar: The traditional Boolean tokens of "true" and "false" are recognized as logic values.

Extended: RAN recognized various tokens useful for major 3- and  4-value logic : no system of logic is implied, the system only provides token recognition.

For 3-value logic:

  • for SQL-style 3-value logic, where there is some comparison with NULL,  the token "unknown" may be appropriate;
  • for Kleene-style 3-value logic, where no information is available, the token "unknown" may be appropriate;
  •  for  Jan Łukasiewicz-style 3-value-logic, where the value has yet to be determined (i.e. for a future event) the value "possible" may be appropriate

For 4-value logic:

  • for Belnap-style 4-value logic, 
    • "both" means conflicting reports have been made, or that some parts are true, some are false.
    • "neither" means no information is available, or that some answer other than Boolean true and False are
  • for SAE J1939  CAN logic:
    • "error"
    • "unknowable"  may be used  for the CAN Not Installed, indicating that no e.g. Boolean answer was possible because the component to determine it was not installed (such as a sensor.)  
  • for Stoic modal logic, loosely:
    • "possible"   - can become "true"
    • "impossible"   - cannot become "true"
    • "necessary"  - cannot become "false"
    • "non-necessary" - cann become "true"
  • For Larson logic  (?)
    • unknown
    • unknowable

Some ad hoc additions:

  • "dontcare" is also available (from IEEE 1164) appropriate
    • a result was not calculated for pragmatic or role reasons,
    • a Boolean value was not
  •   "missing" as logic equivalent to NULL.
logic         ::= boolean | indeterminate | modes | unavailable | significance
boolean       ::= “true” | “false”
indeterminate ::= "both" | "neither" | "unknown"
modes         ::= "possible" | "impossible" | "necessary" | "non-necessary"
unavailable   ::= "unknowable" | "missing"
significance  ::= "impossible" | "error"  | "dontcare"

Example:

Here is an example of  ad hoc use of the logic datatypes.

<<<"Vote Report"        id:=vr>>>
   <vote value=true     name:="Pete">Pete agreed</vote>
   <vote value=false    name:="Computer">Computer says "no"</vote>
   <vote value=missing  name:="hoffa"
              >The vote by Hoffa is registered but not found</vote>
   <vote value=error    name:="Santa"
              >Santa is not real, so the vote is recorded as an error</vote>
   <vote value=impossible name:="Napolean"
              >The late French Emperor's vote is recorded as not possible.</vote>
   <vote value=possible name:="Julian"
              >Julian's vote is possible, but not yet received</vote>
   <vote value=both     name:="Vicky Pollard"
              >Vicky voted "Yeah but no but yeah"</vote>
   <vote value=dontcare name:="Lauren Cooper">Lauren ain't bovered</vote>
<<</"Vote Result>>>

Quantity

Familiar:  The traditional numbers such as 5 and +6.2  and -099.999   are recognized as numbers. RAN supports hexadecimal numbers, with 0x prefix, such as 0xBEEF.  

Extended: RAN supports metric, such as  99_kg  for 99 kilograms.  The underscore followed by a Système international (d'unités)  metric prefix  followed by any name is recognized.  As well, conventions exist for regional units.

quantity      ::= ( number unit? )  
number        ::= ( “+” | “-”)? ( decimal | hexadecimal )
decimal       ::=  DIGIT+ (“.” DIGIT)?
hexadecimal   ::= "0x" (01-9a-fA-F)+  (“.” (01-9a-fA-F) )?
unit          ::=  "_"  si_unit | regional_unit  
si_unit       ::= metric_prefix? name (("/" | "·") name)*  
regional_unit ::= name "_" name
metric_prefix ::= "Y"|"Z"|"E"|"P"|"T"|"G"|"M"|"k"|"h"
                |"d"|"c"|"m"|"μ"|"n"|"p"|"f"|"a"|"z"| "y"

In general, one would expect that quantities would be marked up just as numbers, with an elided quantity that is given elsewhere, e.g. in a column header.  Such column headers can represent the units using "0" introducer:  e.g.  <head-col number=1 unit=0_Hz >   

Hint: An implementation may use this as a picture, to allow fewer characters in markup: e.g. the header column can have  .  <head-col number=1 unit=00000.00_Hz >     and a data value of "440" would be padded to "00440.00", e.g. for comparison purposes. This is not required by RAN.

Note: Capitalization is very significant.  Prefix "da" for x 10 is not available, as it used two letters.

Examples:

  • Simple numbers:  100, +1, -1, 1.000, -001.0
  • Hex numbers: 0xBEEF  (uppercase)
  • Quantity with simple unit:  26_kg
  • Quantity with simple unit (derived):  18_kΩ
  • Quantity with simple compound unit:  +83_kg·m/s
  • Non-SI quantity with regional qualifier:  24_pints_US

"·" has a higher precedence than "/", as conventional.  Compound units using "/" must start with "-" or "+" to prevent lexical clashes with paths.  Complex primitive units, such as s² are not supported directly, nor any bracketing or digit: the units are not intended to provide mathematical markup.  For compound quantities, make up your own unit, e.g. 

An extended example is this:  

<consumption of:=["beer" "alcohol"] 
             amount =[12.5_pints_UK  %1500-X ]
>12.5 Elizabethan pints</consumption>

In this case, we  are defining that this current element has anchors of "beer" and "alcohol".  It has an amount that is 12.5 of something called "pints" of somwhere called "UK" at a year of approximately 1500 (and any month or day). 

Common Units  of Metrology

Unit names are application-dependent and not built-in to RAN, but the conventional SI quantities of g, m, s are preferred.  The 7 base units:

The standard 22 derived units:  radian (rad), steradian (sr), herz (Hz), newton (N), pascal (P), joule (J), watt (W), coulomb (C), volt (V), farad (F), ohm (Ω Unicode U+2126), siemens (S), weber (Wb), tesla (T),  henry (H), lumen (Lm), lux (Lx), becquerel (Bq), grey (Gy), sievert (Sv), katal (kay),  degree celcius (oC using U+00B0).  

Additionally:

  • Common derived metrical units: liter (l), metric ton (ton or tonne)  
  • Temperature units can be represented using "o" ( using U+00B0)
    • Celcius "oC", Kelvin "oK", Fahrenheit "oF"
  • Regional measures can be used, but must have the

For time, consider using the more powerful date-time-interval datatype.

Money

Familiar:

Extended:  Money can be an amount or a currency.  Both start with a Unicode currency character  a  [01-9]+ number. Compound currencies are supported and the ISO 4217 currency codes.  

Some examples of amounts:

  • $10                       - 10 dollars (country not specified)
  • $10.00_USD     - 10  U.S. dollars (ISO4217 3-alpha currency code)
  • $10.0_US           - 10  US  dollars (ISO3166 2-alpha country code)
  • ₵10_GHS           - 10 Ghanaian cedi
money       ::= sign (amount | currency)
sign        ::= ( $   |   ¤   |  
€   |   \p{Sc} )
amount      ::= digit+ letter* ("." digit+ letter)*  ("_" code)?
currency    ::= (commonname ("_" code )?) |  code
code        ::= iso4217  |  iso3166
iso4217     ::= letter{3}
iso3166     ::= letter{2}
commonname  ::= letter+ ("." letter+)

A currency that does not have a unique symbol must use ¤ to indicate a currency is being specified, and the country code. The choice of whether to specify the currency (using the common name, ISO 3166 country code or ISO4217 currency code, or some mix) is up to the user.   The RAN lexer/parser will merely divide the money into its parts. 

Further examples of amounts

  • ¤100zł                   - 100 of the currency to be called "zł"  i.e. "100 zł"
  • ¤ 1000_IDR         - 1000 Indonesian  rupiah     i.e.  "1000 IDR"
  • ¤1000Rp             - 1000 of currency to be called "Rp"   i.e. "1000 Rp"
  • ¤1000Rp_IDR   - 1000 Indonesian rupiah to be called "Rp"    i.e.  "1000 Rp"
  • ¤10bucks_USD  - 10  U.S. dollars (ISO4217 3-alpha currency code) with no currency sign specified, and to be called "bucks"   i.e. "10 bucks"

Examples of compound currencies:

  • £4.3s.8d        -  4 pounds, 3 shillings, and 8 pence (using "." instead of space)
  • ¤100.10zł_PLN   - 100.10 Polish zloties, i.e. "100.10  zł"
  • ¤0zł.10gr_PLN      - 10 Polish grosz ,  i.e. "0zł, 10 gr"       

Examples of currencies:

  • ¤Rp                 - The currency specified as "Rp"
  • ¤ PLN            -  Polish złoty,  the unambiguous currency
  • ¤ zł_PLN      -  Polish złoty,  the currency specified as "zł"
  • $dollars.cents_AUD  - Australian currency, called dollars and cents

 Example of using tuples to group an amount, a point in time (for exchange rate), a scenario and a validation status:

  • income =[ ¤ 1000000_IDR  20

Date Time Interval

Familiar: The simple ISO 8601 date-year-month wil be recognized, such as 2021-12-20. All dates must use the “-” delimited form in order to be recognized as dates (not URIs or numbers) and, at implementer discretion, be lexically-checked and type-converted.  

The usual time and timezone indications can be used:

  • Times  2021-10-07T

Due to lexical typing,

Extended: more of the capabilities of ISO 8601 are supported:  intervals, uncertainty, wildcards. 

The standard for dates, ISO8601:2019 has several enhancements19 whose lexical form is also allowed20. These include

  • Specifying intervals, etc, such as 1999/2021 meaning 1999 to 2021. An interval may be open-ended such as 2000/.. or unknown ending such as 2046/

  • Wildcarding dates, etc. Such as 202X-XX-01 meaning the first day of any month in the decade of the 2020s.

  • Indications that a date etc is approximate, uncertain or both. Such as 1000-01-01~ meaning the approximately first day of that millennium, with some uncertainty. Or a birthday of %1818-?01-?15 signifies that the year is approximate but the date and day are uncertain.

  • Putting these together, we can specify a time range as X-XT12:01/X-XT13:01 meaning times from 12:01 to 13:01 on any (or every) day. (The X-X is needed to satisfy the detection requirement of having at least one “-”.)

As well the “open end-time interval” and “unknown end-time interval” of level 2 are allowed.21 As well the group and individual qualifications of Level 3 are allowed, to represent uncertainty, unspecified and approximate.22 Unless the standard precludes, the following patterns are possible23:

modern-date ::= date-time-interval | range
range ::= date-time-interval “/” date-time-interval

date-time-interval ::= date ( “T” time (“Z” | ((“+” | “-”)? shift))?)?
date ::= [year][“-”][month]([“-”][day])? Month or day precision.
24

year ::= qual? [\dX]+ qual?
month ::= qual? [\dX]+ qual?
day ::= qual? [\dX]+ qual?

time ::= qual? [\dX]+ qual? ( “:” qual? [\dX]+ qual? )+
shift ::= qual? [\dX]+ qual? ( “:” qual? [\dX]+ qual? )?
qual ::= ? | % | ~

Anchor Path

An anchor path is like an XPath, except that you specify the fragment key plus any  anchors of elements. For example:

  • "f1+e22+g44"  means "the (first) element with anchor of "g44" contained in the element with anchor "e22" contained in the fragment with fragment key of "f1".
  • "*~person~city"  means "all elements with anchor of "city" in all elements with anchor of  "person" in any fragment.  This is a 1:many link, with wildcarding.
  • "*~" is a wildcard meaning all fragments.
anchor-path   ::= "*~" |      
          ((number | name | "*" )
           (("+" | "~") (number | name ))*
           ("+" | "~") (number | name | "*")?)

Relative Path

A relative path is a directory path starting from a current location provided by the application. For example:

  • ../x/y
 relative-path ::= “.” "."? ((“/”) ( ".." |TEXT+))*

To specify an absolute file path, use e.g. a URI starting with "file:" 

URI

A URI follows the W3C conventions.

uri    ::= LETTER{1..16} “/” TEXT+

Name

Any other token is treated as a name. 

name ::= name-token (“:” name-token+ )?
name ::= [^\.\-][^-:]*

A name should follow the constraints of XML names.  For RAN, syntax checking should confirm that, for any character < U+00B0 is only has the alphabetic and numeric characters and ":", ".", "-" or "_": i.e. no control characters or whitespace or other punctuation characters. 

Footnotes

17Implementation Note: This functionality must be implemented and exposed in some API. Given a reference to F1:X123 the value can be found in unparsed raw text by by first scanning for the fragment (a linear scanning of the text for “<<[^/] until the fragment with @id of “F1” is found, then scanning start-tags until the next “>>” for the first attribute value of “X123”. This is a rough-and-ready text operation that can be performed on the raw text, to simulate ID/IDREF or keyed links, and relies on the target attribute value to be unique among all attribute values in the fragment (not only ID values: it has nothing to do with ID types).

19See http://www.loc.gov/standards/datetime/

20Implementation Note: The implementer may provide parse-time lexical checking according to the rules above, or some subset, or may defer it, or may make it a user option, or may only check for the “-” character. The implementer decides which features the transducer reports (i.e. what is exposed in an API.) If some other form of ISO 8601 date etc is required, it must be put in a quoted attribute literal and catered for as a string.

21Implementation Note: An example of an open end-time interval is “1985-04-13/..”. An example of an unknown time-end interval is “1985/” (Open or unknown start-times are not supported.)

22Implementation Note: A number in the date-time-interval may have an X instead of any expected digit, which is a lexical wildcard.

23Implementation Note: “%” before a year, month, etc, indicates that it is approximate. “?” indicates it is uncertain. “~” indicates it is both uncertain and approximate. “%” or “?” or “~” applies this to everything to the left.

24A year by itself will be treated as a number, not a year. The main point of the datatypes is that it allows lexical checking: the lexical checking of a year is trivial. The secondary point is reduce explicit type-conversion in clients.