-
Notifications
You must be signed in to change notification settings - Fork 0
Overview
Erik Fäßler edited this page Jun 6, 2019
·
6 revisions
The Corpus Storage System (CoStoSys) was created to deal with the quickly growing PubMed/MEDLINE data in the context of natural language processing (NLP). It is part of the JeDIS framework whose higher-level goal is to provide an infrastructure for NLP processing using UIMA on large corpora.
The reigning goal of CoStoSys is an infrastructure that satisfies the following requirements:
- Store the documents of an XML corpus.
- Allow quick and easy lookup of individual documents.
- Manage the concurrent access to the corpus.
- Allow to create subsets of the corpus.
- Track the processing state of each document of a subset when doing NLP (Is a document in process? Has it been finished? Does it have errors?)
- Apply updates to the corpus.
- Provide an easy-to-use command line interface to the corpus database for frequent tasks.
- Provide an API for programmatic access to the corpus and the functions provided by CoStoSys.
CoStoSys relies on a PostgreSQL database for storage of the main corpus data and the representation of corpus subsets.