Next Previous Contents

2. Data formats

2.1 NoSQL table (relation) structure.

A table (or relation) is an ordinary ASCII file, with some additional rules that make it possible to use it as a database table. The file has records (rows) and fields (columns). The relation, or table structure, is achieved by separating the columns with ASCII TAB characters and terminating the rows with ASCII NEWLINE characters. That is, each row of data in a file contains the data values (each in a data field) separated by TAB characters and terminated with a NEWLINE character. Therefore a fundamental rule is that data values must NOT contain TAB characters.

The first section of the file, called the header, contains the file structure information used by the operators. The rest of the file, called the body, contains the actual data values. A file of data, so structured, is said to be a "table."

The header consists of exactly one line that contains the structure information: the column name row. Each column name begins with an octal "\001" character, i.e. ASCII start of header ("SOH") character, and is separated from the next column name by an ASCII TAB. The SOH character must uniquely mark the beginning of column names, and especially the beginning of the table header, and therefore it may not appear anywhere else as the first character in a row. The fields in the column name row contain the names of each column, and are separated from each other by a single TAB character. In the rest of this document, the SOH characters usually will be omitted from the explanations, but it is implied that they must always be present at the beginning of each column name.

The SOH character was choosen for a reason: it is the character that sorts at the top in lexicographical order, so that the table header does not usually need to be treated especially when sorting or performing most other table manipulations. It has been added to the beginning of each column name, not just the first one, to make it survive "table slicing" and column reordering.

The column names are case sensitive, i.e. "COUNT" is different from "Count." The guideline for characters that may be used in column names is that alphabetic, numeric, and the underscore (_) are good choices. Numeric-only column names are not allowed.

The TAB character must never be used in column names, nor should spaces or UNIX I/O redirection characters (<,>,|) be used. To be on the safe side, column names should always start with a letter and contain only upper and lower case letters, numbers, and the underscore (_). The following names are reserved to the AWK programming language, and should not be used to indicate column names:

BEGIN, END, break, continue, else, exit, exp, for, getline, gsub, if, in, index, int, length, log, next, print, printf, split, sprintf, sqrt, sub, substr, while, and possibly others. Refer to the mawk(1) man page. Furthermore, the '_nosql_' prefix is reserved for NoSQL internal use, and should never be used at the beginning of column names.

For instance, if you have a table that maps names to nicknames, then the table's two columns could be called Name and Nickname. Some NoSQL operators create new columns that have the same name as pre-existing table columns with lower-case letters prepended to them. This is why you really should stick to these rules.

Not abiding by these naming rules may still work, but there may be unexpected results.

A sample table (named SAMPLE) that will be used in later examples is shown in Table 1. The picture in Table 1 is for illustrative purposes; what the file would actually look like is shown in Table 2, where a TAB character is represented by "<T>," a NEWLINE is represented by "<N>" and a SOH by "<S>."

                          Table 1

                       table (SAMPLE)

                NAME    COUNT   TYP     AMT
                Bush       44     A     133
                Hansen     44     A     23 
                Jones      77     X     77 
                Perry      77     B     244
                Hart       77     D     1111
                Holmes     65     D     1111
  

                         Table 2
  
                table (SAMPLE) actual content

                <S>NAME<T><S>COUNT<T><S>TYP<T><S>AMT<N>
                Bush<T>44<T>A<T>133<N>
                Hansen<T>44<T>A<T>23<N>
                Jones<T>77<T>X<T>77<N>
                Perry<T>77<T>B<T>244<N>
                Hart<T>77<T>D<T>1111<N>
                Holmes<T>65<T>D<T>1111<N>
    

It is important to note that only actual data is stored in the data fields, with no leading or trailing space characters. This fact can (and usually does) have a major effect on the size of the resulting datafiles (tables) compared to data stored in "fixed field width" systems. The datafiles in NoSQL are almost always smaller, sometimes dramatically smaller.

A table also can be represented in a different format, called "list format". The list format of the above SAMPLE table is:


      NAME  Bush
      COUNT 44
      TYP   A
      AMT   133     
      
      NAME  Hansen
      COUNT 44
      TYP   A
      AMT   23      
      
      NAME  Jones
      COUNT 77
      TYP   X
      AMT   77      
      
      NAME  Perry
      COUNT 77
      TYP   B
      AMT   244     
      
      NAME  Hart
      COUNT 77
      TYP   D
      AMT   1111    
      
      NAME  Holmes
      COUNT 65
      TYP   D
      AMT   1111
      
    

The actual contents of a table in 'list' format, showing newlines and TABs is:


      <N>
      NAME<T>Bush<N>
      COUNT<T>44<N>
      TYP<T>A<N>
      AMT<T>133<N>
      <N>
      NAME<T>Hansen<N>
      COUNT<T>44<N>
      TYP<T>A<N>
      AMT<T>23<N>
      <N>
      NAME<T>Jones<N>
      COUNT<T>77<N>
      TYP<T>X<N>
      AMT<T>77<N>
      <N>
      NAME<T>Perry<N>
      COUNT<T>77<N>
      TYP<T>B<N>
      AMT<T>244<N>
      <N>
      NAME<T>Hart<N>
      COUNT<T>77<N>
      TYP<T>D<N>
      AMT<T>1111<N>
      <N>
      NAME<T>Holmes<N>
      COUNT<T>65<N>
      TYP<T>D<N>
      AMT<T>1111<N>
      <N>

    

Long lines, i.e. lines that are too long to fit in the width of the screen, may be folded over multiple rows in the 'list' format, provided that each continuation row starts with one TAB character. Field (column) names need to be separated by the associated data by exactly one TAB characters. The data part may contain physical TABs and newlines, which will be turned into '\t' and '\n' escapes respectively by the 'listtotable' operator when the list is turned into a table.


      COMMENTS  This is a very looong comment, that I want to fold over
        multiple lines.

    

and the actual content is :


      <N>
      COMMENTS<T>This is a very looong comment, that I want to fold over<N>
      <T>multiple lines.<N>
      <N>

    

As we will see, there are NoSQL operators that convert back and forth between "table" and "list" formats.

2.2 NoSQL and Data Types.

Unlike most other database systems, NoSQL knows nothing about data types. Everything is just a string, that occurs between one TAB character and the next one. This was done on purpose, of course, as NoSQL tables can be accessed in a number of ways, even directly with a text editor. NoSQL has no way of enforcing any data typing that we may possibly establish, so why bother about types at all. This model goes well with the plethora of text utilities that come with most Unices, and with Linux in particular, and is a very natural way of representing data, more on the human level than other conventions. The drawback is that it is up to the application to enforce data types if necessary.

As I have already pointed out, NoSQL should be seen just as a simple data dictionary toolkit. Its main purpose is to attach names to slices of an otherwise flat data file. Having a dictionary means that you can reference individual pieces of data by name rather than by their physical position in the file, thus attaining a basic level of information abstraction.

A table column can contain anything except physical tabs and newlines. The data itself can be anything that is considered to be text according to the local character set (mine is iso-8859-1, or Latin1). A field can even contain an entire text-encoded file (a BLOB). Common encodings are uuencode, base64 and quoted-printable. Large fields may of course break AWK or the other utilities, but that must be seen as a limitation in those programs or in the operating system, not something pertaining to the paradigm.

A valid NoSQL table needs always to contain the header. Keeping the latter in a separate file is possible but strongly deprecated.

Table editing/writing/locking/unlocking/versioning should not be seen as core NoSQL features, but simply add-on facilities. In real applications the locking policy may become quite complicated and very application-specific. Locking should be provided by the application program itself according to its needs. The same is true for modifying/versioning a table and ensuring overall database consistency.

The structure of a NoSQL table is record oriented, so that it easily can be acted by the wealth of existing UNIX utilities which are mostly record oriented. This does not mean that a table cannot map a more complicated structure, like an XML document or any other hierarchical tree-like structure. Such higher order dictionaries will not pertain to the paradigm though, but rather to the application that uses the table.

2.3 Notes on similar database packages.

Besides NoSQL and RDB there are other UNIX DBMS's, both commercial and free, that are based on ASCII tables. A commercial implementation is /rdb, by Revolutionary Software, while among the free ones there are Starbase, developed by John Roll, JDB, written by John Heidemann, and Gunnar Stefansson's reldb, a collection of interesting tools available at sites that carry archives of the comp.sources.unix Usenet newsgroup.

The ASCII table format of those database engines is very close to that of NoSQL, therefore data can easily be converted back and forth between them and NoSQL.


Next Previous Contents