RAN-CSV

(Initial version: 2021-09-16)

RAN-CSV is an adjunct to RAN, to provide a built-in conventions for storing a relational database and retrieving certain row-sets (grouped data-rows) fast.  CRUD operations on row-sets are defined below. 

An implementation does not need to support all features for simplicity. Dynamic creation of tables is allowed. 

The basic model is that each relational table is a RAN fragment.  Within the RAN fragment, there are one or data-rows elements containing unordered CSV. The CSVs can be grouped into separate data-rows each with some group, corresponding to some common course-grain query option, perhaps even the most common query. The fragments are in logical order, such as by date, and so allow appending.

As with RAN fragments, this allows many common retrieval operations to be performed without fine-grained parsing.  (This allows the document to perform more like a REDIS database, with quite efficient linear scans that are O(n/m) where n is the file size and m is a blocksize.)

Use Cases

The uses cases are:

  • 2-way Interconvertable CSV-RAN: A system can read a RAN-CSV file in and expand it to RAN. This makes the CSV a form of compression and data-loading and integration-sugar.
  • Direct-to-database: A system can read a RAN-CSV file in and populate a relational database with it directly, without going through an RAN-DOM.  (This also applies to databinding to objects.)
    • Similarly, the database can be serialized to a RAN-CSV file without creating a RAN-DOM.
  • Lo-Parse: A system can read the RAN-CSV file in and locate particular fragments, and return the CSV directly. This allows coarse-grain, pre-sorted pre-arrangements of table data, without requiring a query.  (Only standard RAN delimiting of field values, namely > and & will be performed.)
  • Appending: The RAN-CSV file can be updated by appending new fragments. The RAN-CSV processor takes care of combining fragments for the same table.
  • Journal/Instruct: The RAN-CSV file has attributes enough to allow a system to create, read (i.e. find matching rows), update and delete table rows. The RAN-CSV file can act as a journal.  This CRUD functionality is not provided by the base, only create.

Features

A processor can support, and an API call can require, the following features or processing options. These are specified here to establish minimum expectations, and to explain the use of the markup conventions.

Output

  1. GENERATE-HEADER - generate the header at the start of the CSV or RAN. For RAN conversion, the header name is an empty element with that name. The name can be any string, as RAN element names can be literals e.g. <"Common Name">...
  2. APPEND-GROUP - add the @group into a column @group-name at the end of each row

Coarse Query

  • GET ROWS FOR TABLE: The coarsest query is for a single table, to return the effective table, after assembly.
    • All implementations should support this.
  • GET ROWS FOR TABLE BY GROUP:  return a single table, with just the information that matches the @group.
    • All implementations should support this.
  • GET ROWS FOR TABLE BY DATE-TIME-RANGE: returns a single table, with all entries using the data-rows using data-rows/@to and @from  (these may be ranges of dates, not just dates, using RAN modern ISO 8601 datatypes.)
    • This is an optional feature.

Coarse Assembly

RAN-CSV fragments for each @name are processed in document order.  The following operations apply to grouped data rows of a table,  not to individual rows in the data-rows or to tables per se (except for simple tables with only one data-rows element and no grouping.)

The model is that a RAN-CSV processor goes through each fragment in order, apply the appropriate CRUD operation for the fragment to the accumulated data-tables, to assemble each effective table.

The term swallows below means a processor ignores the data-rows element, does not process it, does not delete it, skips it. ignores it: the data-rows has no effect on the output.

A RAN-CSV file API may allow any of the following features, minimally ALLOW-CREATE only. An function-call to use the data may require the following features be supported. If an implementation does not support the feature, it should presumably be an error.

Mode:

RAN-CSV documents allow a range of CRUD operations to take advantage of the provided markup. The following processing features use the markup on the fragment /data-table and /data-table/data-rows

  1. (DIS)ALLOW-CREATE - take the first fragment element with the appropriate @name as the base. The default is ALLOW-CREATE.
    • DISALLOW-CREATE specifies that data-table elements marked @treat=create should be swallowed. For example, if only updates were of interest.
    • An implementation only needs to find the first fragment with a name and @treat=create, if only ALLOW-CREATE is enabled, by definition.
  2. (DIS)ALLOW-REPLACE - replace the data-rows[@groups = $groups] of the matching  accumulated data-tables. The default is DISALLOW-REPLACE, to swallow any data-rows of that name.
    • An implementation can ignore any fragments with a given @name and all the same groups before the last one in the file, by definition.
    • No duplicate checking of CSV rows is involved
    • When performing a replace operation, a RAN-CSV processor does well to look at the last fragment in the file, and if it is a matching table with the same groups, overwrite it. A processor may replace the previous fragments for this group with spaces and newlines. An implementation may decide to compact the RAN-file after any operations.  
  3. (DIS)ALLOW-UPDATE - append data-rows[@groups = $groups]  to the matching accumulated data-tables. The default is DISALLOW-UPDATE, to swallow any data-rows of that name, for speed.
    • No duplicate checking of CSV rows is involved.
  4. (DIS)ALLOW-DELETE - delete all data-rows[@groups = $groups] of the matching  accumulated data-tables.

Dynamic:

  1. (DIS)ALLOW-DYNAMIC - Pass the data-rows information to some callback, e.g. to dynamically populate the information. The default is DISALLOW-DYNAMIC. This is entirely implementation dependent.
    • This is an operation of on data-rows, not the data-table (table). Thus they operate by groups. The data-table/@treat will determine how the data-row is supposed to be handled.

Structure

In a spontaneous DTD-like syntax:

<!FRAGMENT data-table  (data-header?, data-rows*) >
<!ATTLIST  data-table 
   name    CDATA #REQUIRED
   to      CDATA #REQUIRED

   applies CDATA #OPTIONAL
   version CDATA  #OPTIONAL

   row-separator       #IMPLIED
   field-separator     #OPTIONAL ","
   string-delimiter    #OPTIONAL '"'
   delimiter-delimiter #OPTIONAL "/"

   row-separator-in-string ( error | allow | new-row | delimit )  allow
   data-type  (none | non-strings | all) non-strings
   treat      (create | replace | update | delete ) create
>
<!ELEMENT data-header CDATA >
<!ELEMENT data-rows   CDATA >
<!ATTLIST data-rows
   group        CDATA #OPTIONAL
   group-name   CDATA #OPTIONAL

   range        CDATA #OPTIONAL
   from         CDATA #OPTIONAL
   to           CDATA #OPTIONAL

   dynamic  (yes | true | no | false)  "no" 
>

In RAN-CSV, each table is a RAN Fragment named data-table.

  • The usual pattern of storing a relational database as Comma-Separated-Values is for each table to be separate file.
  • Any top-level elements outside the fragment are not relevant to RAN-CVS.
  • The data-table has the following attributes:
    • @name is the name of the table. There may be multiple fragments with the same name.
    • @applies is a value (typicaly a RAN date-time-range value) specify when the data was produced, or the period it applies to.
    • @version is a token that indicates some version information
    • @id is the unique identified for the fragment. According to RAN conventions, @id is required and should be the first attribute after the fragment start-tag name. An @id might typically be composed by the values @name "_" @version, however this is not a default.
    • @row-separator is the character used as the row separator. By default it is 7#x0D; newline. (The implementation will automatically allow UNIX and DOS forms of newlines.)
    • @field-separator is the character used as the field separator. By default it is , comma.
    • @string-delimiter is the character used to delimit strings. By default it is " double quote.
    • @delimiter-delimiter is the character used inside strings to delimit delimiter characters. By default it is \ backslash. So:
      • \\ is \
      • \" is "
      • \n is newline
      • \t is tab
      • Any < or & character in the external CSV should be delimited using XML rules, namely &gt; and &amp;
    • @row-separator-in-string determines the treatment of row separator delimiters found in string. The values are:
      • error
      • allow (do not treat as delimiter)
      • new-row (close current row and start a new one)
      • delimit (error unless preceding delimiter-delimiter).
    • @datatype determines how fields are lexed into datatypes:
      • none
      • non-strings - only recognize as RAN types items outside strings
      • all - attempt to type strings and non-strings
    • There is no provision currently for fields within fields.
    • @treat is a hint, which may be used by the RAN-CSV processor to select elements if capable.   The default is "create". There should only be one "create" fragment for a given name. The initial token may be:
      • new - if it is the base information
      • replace - if it is supposed to supersede any previous base information
      • update - add rows to already created data - this is an append behaviour with no checking for duplicates
      • delete - if the data row has been deleted.  data-row is not actioned by the simple CVS-RAN processor and would cause an error.

A data-table contains (data-header?, data-rows*).

  • data-header
    • is a row of header names, using the field-delimiter.
  • data-rows
    • contains CSV data, by default in in lines.
    • the contents of multiple data-rows elements are all part of the same table.
    • data-row may have the following attributes:
      •  @group gives a column value shared by all rows. If there is an @group-name then all rows get final column with this name and the value. Only one is provided. This allows faster identification of a data-rows, and some compression.
        • A data-rows element without an @group attribute contains all data-rows that do not belong to any other group. 
      • @range gives a column in the data-rows data. The attributes @to and @from give the highest and lowest values of these found in the table. Only one is provided. This allows faster identification of a data-rows when searching for a particular row.
      • @dynamic =true or yes on an empty data-rows element indicates that the filled dynamically by some exception process.  If the data-rows has values, then these are a cache of the data, and may be updated based on some time consideration, for example, using the data-table/@applies value. The default is no.

Examples

Simple Archive

A RAN-CSV file contains multiple fragments. There is one fragment per table only. There is one table-rows element per fragment.  An existing relational database CSVs can be stored in it.  In this way it acts as an archive, an alternative to a file-system or UNIX .tar file.

The archive can be extracted by streaming the RAN file, scanning for each fragment-start-tag open-delimiter "<<[^/]" and then scanning for "name", then "=" then following token, which can be used as the filename with ".csv" appended.

Here is our simple file, with two simple tables.

<<<data-table name:="alphas">>>
<data-rows>
a,b,c,d
e,f,g,h
i,j,k,l
</data-rows>
<<</data-table  name:="alphas">>>
<<<data-table   name:="nums">>>
<data-rows>
1,2,3,4
5,6,7,8
9,0,1,2
</data-rows>
<<</data-table  name:="nums">>>

If we know that only create has been used, then a streaming processing can halt after using the appropriate fragment.

Full Feature Use

Our scenario is we have a Web API that returns CVS tables, where there are one or two parameters that characterize most accesses.  Our API supports the full RAN-CSV feature set.

For example, say we have a pharmaceutical database with tables for drugs,forms, prices, restrictions, brands.

-> So our RAN-CSV file will have fragments each containing a /data-table/@name  for 'drugs',  'manner-of-administration', 'brands', 'prices'

We have additional data which is added each week.

-> So we mark our initial fragments with /data-table/@treat=create, and then append new fragments to the document with the same @name, with @treat=update.

<<<data-table name:="prices" ...>>>...<<</data-table name:="prices">>>

<<<data-table name:="drug" treat=create>>>
<data-header>
name,manufacturer,manner-of-admistration,use
</data-header>
<data-row>
aspirin,Beyer,pill,pain
cannabis,DrZappy,oil,pain
vaxevria,AstraZenica,injectable,vaccine
freddy,Gallan,animal,bad-blood
</data-row>
<<</data-table name:="drug">>>

<<<data-table name:="drug-update" to=:"drug" treat=update>>>
<data-row>
spikevax,Moderna,injectable,vaccine
</data-row>
<<</data-table name:="drug-update">>>

We want to be to then get the current data for a whole  table.

->Our processor goes through each fragment for the table name, matching the /data-table/@name, and appends the all the data

Sometimes we make a mistake and need to replace a whole table.

-> We append a new fragment with that name to the document, marked with /data-table/@treat=replace. The next time the RAN-CSV processor knows to replace

Each month a consolidated version is put out as a new RAN file.

-> So we go though the old RAN-CSV file,  using the @treat actions to create our effective tables, and then output that, one fragment for each table.

The most common query is that people want to get the drugs from the drug table in a CSV file that correspond to a manner-of-administration, such as 'suppository' or 'pill'. Also, they are not interested in drugs that are not currently allowed.

->We group our fragments so there is one data-rows for each manner-of-administration, using the table-rows/@group attribute.

-> For each manner of administration, we have two data-row elements, one for current drugs, the other for old drugs.  We use the @applies field to specify a currency  date-time-range on them both

<?RAN align=10 skip=7 ?>

<<<data-table name:="drug" treat=create>>>
<data-header>
name,manufacturer,manner-of-admistration,use
</data-header>

<data-row group="pain" group-name="use">
aspirin,Beyer,pill
cannabis,DrZappy,oil
</data-row>

<data-row group="vaccine">
vaxevria,AstraZenica,injectable
spikevax,Moderna,injectable
</data-row>

</data-row>
<data-row name="bad-blood" group-name="use"
   applies=X-X-X/%1900-X-X>
freddy,Gallan,animal
</data-row>
<<</data-table  name:="drug" >>>

Lets go through this example:

So to perform this query, our implementation might:

  • First scan through the file using fast SIMD operations (256 bytes at a time) to find all fragment-start-tag identifiers.
    • The RAN Pragma PI tells the processor that fragments (except the first) are aligned on 2^10 byte boundaries, within the first 2^7 bytes. So it only needs to go to each 1k address, then look at the first 128 bytes with one or two SIMD vector instructions, then skip the following 896 characters. (A good implementation might first test using SIMD instructions a vector of the first 16 bytes only, i.e. one cache-line, to avoid unnecessary loading of bytes.)
  • Parse the fragment start-tag for the attributes, and determines if it has a match on the name
    • It only needs to scan until '>' to find the end of the start-tag. Within that range, it can tokenize.  In fact, it only needs to look for 'name\s*\=\s*\"?' to locate the name, since @name is required here.
  • For each of these fragments for that table, find the data-rows.
    • It only needs to scan for "<data-row" to locate all data rows. There is no CDATA that could confuse it, nor < inside attribute values. Parse the element start-tag for the attributes.
  • Select the matching data-rows elements for the group and the date range, and perform the CRUD operations on the data-rows.
    • At no stage is there are requirement to read in the CSV data or completely parse the RAN document.
    • The table-rows/@applies attribute uses the ISO8601 date-time-range subset built-into RAN.  It is interpreted as follows:
      • X-X-X means any date
      • / mean until (i.e. a range
      • %1900-X-X means approximately the year 1900.  See the Grammar section.
  • Return the effective CSV to output. For each data-rows in the effective table, scan until "<" is found. If is not </data-rows it is an error. If the text has any "&" then perform character reference replacement, otherwise merely copy the text to the output.
    • In the absence of & and < in data, the original data buffer used to read the RAN-CSV file can probably be used (blocks permitting), without the need for any re-allocation.  For-example, if the whole RAN-CSV file is in-memory. 

Limitation

RAN-CSV provides no complementary mechanism for elements.  The fast runtime location of fragments is all, plus the possibility of two-part IDREFs, where the prefix of the IDREF is the @id (or @name?) of the fragment element.

One intent of RAN-CSV is to show how the provision of fast-accessible fragments, a lazily parseable syntax with no modes, etc,  can be a springboard for new uses of markup languages, and for applying markup languages to areas that they have been excluded from for performance and complexity reason, in real life.  Some of the conventions proposed could also be done with XML; but they have not, and so the XML ecosystem does not have APIs etc to support it out-of-the-box.  My feeling would be that RAN ecosystem would be better to start with a good complement of these kind of capabilities.  If RAN systems merely parse the whole document into a DOM, as the only mode of processing, it would defeat much of the purpose.

Possibility: Index

Define a fragment that can appear at the end, such as

<<<data-index>>>
  <data-table index-in-bytes="2345" name:="drug" ... >
    <data-rows value-index-in-bytes="1367" value-length="86" group=:"pain" ... />
    <data-rows value-index-in-bytes="1367" value-length="92" group=:"vaccine" ... />
    ...
  </data-table>
  <data-table ... >
    ...
<<</data-index>>

This has byte offsets to each data-row, allowing the grouping, CRUD assembling and locating to be done only using this index, without have to parse the document at all. An indexing function could add this. It could be searched for by searching backwards from the end of the document for  the first (i.e., the last) <<[^/] delimiter, then seeing if it is a data-index.  Other data-index fragments in the document would be ignored.