Previous Next Contents

2. Data formats

2.1 NoSQL table (rdbtable) structure.

Besides the regular UNIX editors and utilities, a good way to view the data of course, would be to use the NoSQL operator that prints such datafiles: 'nsq-pr' (named after the 'pr' UNIX utility).

The relation, or table structure is achieved by separating the columns with ASCII TAB characters, and terminating the rows with ASCII NEWLINE characters. That is, each row of data in a file contains the data values (a data field) separated by TAB characters and terminated with a NEWLINE character. Therefore a fundamental rule is that data values must NOT contain TAB characters.

The first section of the file, called the header, contains the file structure information used by the operators. The header also contains optional embedded documentation relating to the entire datafile (table documentation) and/or each data column (column documentation). The rest of the file, called the body, contains the actual data values. A file of data, so structured, is said to be an 'rdbtable'.

The header consists of two or more lines. There is an optional number (zero or more) of lines of table documentation followed by exactally two lines that contain the structure information: the column name row and the column definition row. The table documentation lines start with either a sharp sign (#) followed by a space character, or one or more space characters followed by a sharp sign (#). The rest of each line may contain any documentation desired. Note that the table documentation lines are the only lines in an rdbtable that are not required to conform to the table structure defined above. The fields in the column name row contain the names of each column. The fields in the column definition row contain the data definitions and optional column documentation for each column.

The column names are case sensitive, i.e. 'COUNT' is different from 'Count'. The guideline for characters that may be used in column names is that alphabetic, numeric, and the non-alphanumeric characters listed below are good choices.

Non-alphanumeric characters that are acceptable in column names are underscore (_) and dash (-), but they must not be the first character in a column name. The TAB character must never be used in column names, nor should internal spaces or UNIX I/O redirection characters (<,>,|) be used. To be on the safe side, column names should always start with a letter and contain only upper and lower case letters, numbers and the underscore (_). To be really safe, column names should always start with an upper-case letter, to prevent them from conflicting with reserved keywords of the various programming languages involved (if, then, else, ...). For instance, suppose you have a table that maps names to nicknames, then its two columns could be called Name and NName. Some NoSQL operators create new columns that have the same name as pre-existing table columns, with lower-case letters prepended to them. This is why you really sould stick to these rules.

Not abiding by these naming rules may still work, but there may be unexpected results.

The data definitions include column width, data type, and justification. The column width must be explicitly specified; the others are optional and are frequently specified by default.

The data definitions are specified by adjacent characters in a single word. The width of each field is specified by a numeric count. The type of data is "string", "numeric", or "month". The types are specified by an 'S', 'N', or 'M' respectively, and the default is type string. Printout justification is 'left', or 'right', and is specified by an '<' or '>' character respectively. If not specified, data types string and month will be left justified and type numeric will be right justified.

Note that column width is used primarily by the operator 'nsq-pr' and in no way limits the actual data size. It is not an error if some actual data in a column is wider than the defined width; a listing produced with 'nsq-pr' may be out of alignment however.

The optional documentation for each column follows the data definition word in the field. There must be one or more space characters after the data definition word and before the column documentation; the column documentation may be as long as necessary. Note that the data definition and the optional column documentation are contained in a single field in the row.

If the column name and/or column definition rows contain much information and/or column documentation they can become long and confusing to read. However the operators 'nsq-valid' and 'nsq-headchg' have options to print the header contents as a 'template' file, an organized list of information about the header.

A sample rdbtable (named SAMPLE) that will be used in later examples is shown in Table 1. The picture in Table 1 is for illustrative purposes; what the file would actually look like is shown in Table 2, where a TAB character is represented by '<T>' and a NEWLINE character is represented by '<N>'.


                          Table 1

                     rdbtable (SAMPLE)

      # Table documentation lines. These describe and
      # identify the rdbtable contents.
      # They may be read by many normal UNIX utilities,
      # which is useful to easily identify a file.
      # May also contain RCS or SCCS control information.
      NAME    COUNT   TYP     AMT     OTHER   RIGHT
      6       5N      3       5N      8       8>
      Bush    44      A       133     Another This
      Hansen  44      A       23      One     Is
      Jones   77      X       77      Here    On
      Perry   77      B       244     And     The
      Hart    77      D       1111    So      Right
      Holmes  65      D       1111    On      Edge
  

                         Table 2
  
              rdbtable (SAMPLE) actual content

      # Table documentation lines. These describe and<N>
      # identify the rdbtable contents.<N>
      # They may be read by many normal UNIX utilities,<N>
      # which is useful to easily identify a file.<N>
      # May also contain RCS or SCCS control information.<N>
      NAME<T>COUNT<T>TYP<T>AMT<T>OTHER<T>RIGHT<N>
      6<T>5N<T>3<T>5N<T>8<T>8><N>
      Bush<T>44<T>A<T>133<T>Another<T>This<N>
      Hansen<T>44<T>A<T>23<T>One<T>Is<N>
      Jones<T>77<T>X<T>77<T>Here<T>On<N>
      Perry<T>77<T>B<T>244<T>And<T>The<N>
      Hart<T>77<T>D<T>1111<T>So<T>Right<N>
      Holmes<T>65<T>D<T>1111<T>On<T>Edge<N>
    

It is important to note that only actual data is stored in the data fields, with no leading or trailing space characters. This fact can (and usually does) have a major effect on the size of the resulting datafiles (rdbtables) compared to data stored in "fixed field width" systems. The datafiles in NoSQL are almost always smaller, sometimes dramatically smaller.

2.2 Notes on /rdb table format.

Besides NoSQL there are other UNIX DBMS's, both commercial and free, that are based on ASCII tables. A commercial implementation is /rdb, by Revolutionary Software, while among the free ones there are Starbase, developed at the Harvard Smithsonian Astrophysical Observatory, and Gunnar Stefansson's reldb, a collection of interesting tools available at sites that bring archives of the comp.sources.unix Usenet newsgroup.

The ASCII table format of those database engines is very close to that of NoSQL, therefore data can easily be converted back and forth between them and NoSQL.

Here is what the basic /rdb and Starbase table format look like :


                          Table 1a

                     Starbase table (SAMPLE)

      Table documentation lines. These describe and
      identify the rdbtable contents.
      They may be read by many normal UNIX utilities,
      which is useful to easily identify a file.
      May also contain RCS or SCCS control information.

      NAME    COUNT   TYP     AMT
      ----    -----   ---     ---
      Bush    44      A       133
      Hansen  44      A       23
      Jones   77      X       77
      Perry   77      B       244
      Hart    77      D       1111
      Holmes  65      D       1111
    

As with the NoSQL format, the actual table contents are:


                         Table 2a
  
              Starbase table (SAMPLE) actual content

      Table documentation lines. These describe and<N>
      identify the rdbtable contents.<N>
      They may be read by many normal UNIX utilities,<N>
      which is useful to easily identify a file.<N>
      May also contain RCS or SCCS control information.<N>
      <N>
      NAME<T>COUNT<T>TYP<T>AMT<N>
      ----<T>-----<T>---<T>---<N>
      Bush<T>44<T>A<T>133<N>
      Hansen<T>44<T>A<T>23<N>
      Jones<T>77<T>X<T>77<N>
      Perry<T>77<T>B<T>244<N>
      Hart<T>77<T>D<T>1111<N>
      Holmes<T>65<T>D<T>1111<N>
    

And here is its corresponding list format:



      NAME      Bush
      COUNT     44
      TYP       A
      AMT       133     
      
      NAME      Hansen
      COUNT     44
      TYP       A
      AMT       23      
      
      NAME      Jones
      COUNT     77
      TYP       X
      AMT       77      
      
      NAME      Perry
      COUNT     77
      TYP       B
      AMT       244     
      
      NAME      Hart
      COUNT     77
      TYP       D
      AMT       1111    
      
      NAME      Holmes
      COUNT     65
      TYP       D
      AMT       1111
      
    


Previous Next Contents