Skip to content

Latest commit

 

History

History
93 lines (76 loc) · 4.66 KB

Development.md

File metadata and controls

93 lines (76 loc) · 4.66 KB

OpusTools

Tools for accessing and processing OPUS data.

  • opus-tools: general python library of OPUS-related tools as class library
  • opus-read: read parallel data sets and convert to different output formats
  • opus-cat: extract given OPUS document from release data
  • opus-api: web-service that provides information about OPUS data and download URLs
  • opus-get: download parallel corpora
  • opus-register: tool for updating info about OPUS resources (resource registry DB)

opus-get

  • download data files from OPUS (ObjectStorage)
  • requires some sort of database with info about available OPUS resources (resource registry DB)
  • allow to download several corpus files (from different corpora) - Do we need to restrict it somehow?
  • store statistics about downloads
  • open question: How do we handle regional language variants?
  • command-line options
    • -s (required)
    • -t (optional?)
    • -c (optional)
    • -v (optional, default = latest found in registry)
    • -f (optional, default = tokenized xml + sentalign-xml)
    • -o (optional, original language - requires meta-data!)
    • -l (optional - list resources only, no download)
    • -d (optional)

opus-register

  • maintain resource registery in some database
  • add, delete, modify entries
  • info about source/target language, size (translation units, nr of sentences, tokens), original language (if available, default = unknown), versions (which one is the latest)
  • import from .info files
  • should be able to query from outside (for opus-get), update only from inside with password ...

opus-read related task- and wish-list:

  • NEW opus-read and opus-cat download necessary files from OPUS if they don't find the source files they need for reading certain corpus data (requires access to OPUS registery DB via the online API)

  • NEW opus-cat can run recursively through an entire corpus for a selected language and reads all files (basically creates the monolingual files from a corpus, tokenized or untokenized)

  • NEW new filter for opus_read and opus_cat: exclude/include files that match a certain file path/name pattern (regex).

  • NEW filename filter as above but with a given list of file names to be read from a file (exclude or include)

  • a python library of data access functions

    • status: to do
    • reader and writer classes for data in different formats
    • installable package (pip package?)
  • OPUS reader

    • status: started
    • read OPUS corpora in their native XML formats (sentence-aligned)
    • obligatory parameters: corpus name, source and target language
    • otional parameters: release version, type (raw, xml, parsed), opus root-dir
  • filter on alignment type and link likelihood

    • status: started
    • extract certain alignment types (for example only one-to-one sentence alignments or even ranges like 1-2 sentences in source aligned to one sentence in target etc)
    • filter on alignment certainty (if the attribute is given in the sentence alignment file)
    • filter on other attributes (e.g. overlap in OpenSubtitles)
    • possible flag to output only the sentence alignments information after filtering (in XML format)
  • Moses writer

    • status: started
    • output in Moses format (2 files with aligned sentence on the same line)
    • skip empty alignments (source sentences or target sentences that are not aligned) - using flags like in filtering mode (see above)
  • TMX writer

    • status: to do
    • output in TMX format
    • possibly: even for more than 2 aligned languages? (that would require transitive expansion of sentence alignments - difficult task - should we define that as a separate task/feature?)
  • output with (token) annotation (factors)

    • extract token factors from tokenized corpora besides of the plain text (part-of-speech tags etc) if they exist
    • this only works with tokenized corpora
    • format like in factored Moses? Important: same number of factors for each token in that format! (possibility to change the factor delimiter symbol - default = '|'), example:
word1|POS1|lemma1 word2|POS2|lemma2
  • OPUS root can be changed

    • status: to do
    • OPUS root directory is hard-coded now - make it more flexible
    • option to change it
    • possibly work automatically on Abel (Norwegian cluster) without changing with the optional flag
  • read parallel corpora that are not packed into zip-files

    • status: to do
    • can read given sentence alignment file
    • checks whether aligned xml documents exist in the path before trying to find the zipped OPUS release files
  • simple cat function (opus-cat)

    • status: to do
    • read a file from a given corpus in a given language (extract from corresponding zip file)