Skip to content

Overview

Erik Fäßler edited this page Jun 6, 2019 · 6 revisions

Background

The Corpus Storage System (CoStoSys) was created to deal with the quickly growing PubMed/MEDLINE data in the context of natural language processing (NLP). It is part of the JeDIS framework whose higher-level goal is to provide an infrastructure for NLP processing using UIMA on large corpora.

The reigning goal of CoStoSys is an infrastructure that satisfies the following requirements:

  1. Store the documents of an XML corpus.
  2. Allow quick and easy lookup of individual documents.
  3. Manage the concurrent access to the corpus.
  4. Allow to create subsets of the corpus.
  5. Track the processing state of each document of a subset when doing NLP (Is a document in process? Has it been finished? Does it have errors?)
  6. Apply updates to the corpus.
  7. Provide an easy-to-use command line interface to the corpus database for frequent tasks.
  8. Provide an API for programmatic access to the corpus and the functions provided by CoStoSys.

CoStoSys relies on a PostgreSQL database for storage of the main corpus data and the representation of corpus subsets.

Functional Scope

Clone this wiki locally