Skip to content

Overview

Erik Fäßler edited this page Jun 6, 2019 · 6 revisions

Background

The Corpus Storage System (CoStoSys) was created to deal with the quickly growing PubMed/MEDLINE data in the context of natural language processing (NLP). It is part of the JeDIS framework whose higher-level goal is to provide an infrastructure for NLP processing using UIMA on large corpora.

The reigning goal of CoStoSys is an infrastructure that satisfies the following requirements:

  1. Store the documents of an XML corpus.
  2. Allow quick and easy lookup of individual documents.
  3. Manage the concurrent access to the corpus.
  4. Allow to create subsets of the corpus.
  5. Track the processing state of each document of a subset when doing NLP (Is a document in process? Has it been finished? Were there processing errors?)
  6. Apply updates to the corpus.
  7. Provide an easy-to-use command line interface to the corpus database for frequent tasks.
  8. Provide an API for programmatic access to the corpus and the functions provided by CoStoSys.

CoStoSys relies on a PostgreSQL database for storage of the main corpus data and the representation of corpus subsets. The Data Model page provides details about the way this is achieved.

Functional Scope

CoStoSys' main goal is to import documents into the database by creating a primary data table and the definition of subsets of those data in the form of subset tables which offer columns to record the processing state of each document.

CoStoSys provides means to import XML data into the database. It does store whole XML elements, possibly extracted from larger XML data, in database tables. The contents of the XML data stored in the database are opaque to the database. The goal of CoStoSys is to store and retrieve whole documents, not to operate on them.

Internally, CoStoSys uses the JULIE Lab XML Tools which offer XPath-oriented support for quick parsing of large XML files that contain individual documents as records. This model is a consequence of the focus on MEDLINE. The MEDLINE XML format defines the MedlineCitationSet element which contains a number of MedlineCitation subelements. Thus, the goal of the XML tools in the context of CoStoSys is to extract the contents of the MedlineCitation elements given an XML file containing a MedlineCitationSet. Nowadays, the officially downloaded MEDLINE data comes in PubMed format which embeds the MedlineCitation elements in PubmedArticle elements collected in a PubmedArticleSet which is basically the same structure.

Once the data is inserted into the database, the original format is not important any more. The datatype of the document data may be just Text or even bytea. In this sense, CoStoSys could be used with non-XML data but it does currently provide no built-in means of importing those into a the database.

The main function of CoStoSys is then to create and manage subsets of the imported document data. A subset is a table that lists a portion - or all of - the primary keys of the primary document data table as is explained in more detail in Data Model. Additionally, subset tables have columns to record the processing state of each document. These state are set by external applications using the CoStoSys API.

After the document data is imported and a subset is created (which is not strictly necessary but the common use case for concurrent document NLP), external programs may work on the tables set up by CoStoSys using the CoStoSys API. The actual NLP of the document data is then done by consumers that retrieve the documents from the database and possibly write results back.

Clone this wiki locally