Skip to content
Matt Newville edited this page Apr 29, 2015 · 2 revisions

Implementation Overview

With the motivation from the previous section, the goals for the format and library are:

  1. store spectra for exchange. Special emphasis is to allow for data on "model" systems to be stored. Raw data, direct from the beamline will probably need to be converted to this format.
  2. store information about the sample, measurement conditions, etc.
  3. store multiple spectra, either on the same sample or multiple samples, and possibly taken at many facilities.
  4. provide programming libraries and simple standalone applications that can read, write, and manage such data libraries. Programming libraries would have to support multiple languages.

There are a few reasonable ways to solve this problem. What follows below is a methods which makes heavy use of ''relational databases'' and SQL. The principle argument here is that relational databases offer a well-understood, proven way to store data with extensible meta-data. The use of SQL also makes the programming libraries simpler, as they can rely on tested SQL syntax to access the underlying data store, and can be accessed from multiple programming languages and platforms.

Development Code and Data

As the XAS Data Library (XDL) is being developed, code and examples will be available at https://github.com/XraySpectroscopy/XASDataLibrary

Why PostgresQL and SQLite

We propose supporting both Postgresql and SQLite as backend storage for the data library. Both of these database systems are Free, well-supported, and in wide use. These have some important differences, but access to the underlying data is very similar using the two systems, and most software using these is completely interchangeable. Postgresql is a full-featured relational database using a client/server model to provide multiple users simultaneous access to the data. This is most appropriate for a web-interface to the XDL, and we expect to use Postgresql with a public web server for the initial, main XDL web site. SQLite provides an SQL database in a single, portable file, and is most appropriate for stand-alone applications, such as an application for storing and accessing personal XAFS data. The idea that two such applications (public web service, desktop application for private data) can use the same data storage approach means that one can imagine easily downloading from the public server into your private store of data, and uploading some of your data onto the public server.

Challenges Using SQL Tables for Numerical Array

SQL-based relational databases may not be the most obvious choice for storing scientific data. One obvious limitation is that relational databases don't store array data very well. Thus storing array data in a portable way within the confines of an SQL database needs special attention. The approach adopted here is to JSON, which can encapsulate an array, or other complex data structure into a string.

Using JSON to store array data

JSON -- Javascript Object Notation -- provides a standard, easy-to-use method for encapsulating complex data structures into strings that can be parsed and used by a large number of programming languages as the original data. In this respect, the requirements for the XAS Data Library -- numerical arrays of data -- are fairly modest. Storing array data in strings is, of course, what ASCII Column Files have done for years, only not with the benefit of a standard programming interface to read them. As an example, an array of data [8000, 8001.0 , 8002.0] would be encoded in JSON as

'[8000, 8001.0, 8002.0]'

This is considerably easier and lighter weight than using XML to encode array data.

In addition to encoding numerical arrays, JSON can also encode an associative array (also known as a Hash Table, Dictionary, Record, or Key/Value List. This can be a very useful construct for storing attribute information. It might be tempting to use such Associative Arrays for many pieces of data inside the database, this would prevent those data from being used in SQL SELECT and other statements: such data would not be available for making relations. But, as Associative Arrays can so useful and extensible, several of the tables in the database include a attributes column that is always stored as text. This data will be expected to hold a JSON-encoded Associative Array that may be useful to complement the corresponding notes column. This data cannot be used directly in searching the database, but may be useful to particular applications.

Challenges When Using SQLite and Postgresql

While more similar than different, the SQLite and Postgresql have some differences in detail, for example in the way they normally store dates, boolean values, and enums. These can be ironed out so that a common subset of each is used. This is actually not very difficult, but needs to be kept in mind. For example, Enum values (which may have been used to encode Elements, Collection Modes, etc) are implemented as indexes into foreign tables, and JOINs must be used to relate the data in the tables.