Pre-processing: Process the input corpus for ELMo training. Either call with --pre_process or --gen_vocab.
usage: pre_process.py [-h] [--pre_process PRE_PROCESS] [--gen_vocab GEN_VOCAB] [--train_prefix TRAIN_PREFIX] [--vocab_file VOCAB_FILE] [--heldout_prefix HELDOUT_PREFIX] [--min_count MIN_COUNT] optional arguments: -h, --help show this help message and exit --pre_process PRE_PROCESS The corpus to pre-process. --gen_vocab GEN_VOCAB Only generate the vocabulary of this corpus, no training data. --train_prefix TRAIN_PREFIX Prefix for train files --vocab_file VOCAB_FILE Vocabulary file --heldout_prefix HELDOUT_PREFIX The path and prefix for heldout files. --min_count MIN_COUNT The minimal count for a vocabulary item. Example calls: python3 bin/pre-process.py --pre_process /home/data/corpora/corpus_a \ --train_prefix '/home/data/pre/corpus_a_training/train.corpus_a*' \ --heldout_prefix '/home/data/pre/corpus_a_heldout/heldout.corpus_a*' \ --vocab_file /home/data/pre/corpus_a.100.vocab \ --min_count 100 python3 bin/pre-process.py --gen_vocab /home/data/corpora/corpus_a \ --vocab_file /home/data/pre/corpus_a.50.vocab \ --min_count 50 --pre_process: the corpus which to pre-process. The corpus is split into 100 parts, which are saved with the train_prefix. A heldout portion (1%) is saved in a different directory and there split into 50 parts. --train_prefix: The file prefix for training files. '.$slice_number' is appended to this name. --heldout_prefix: The file prefix for heldout files. '.$heldout_slice_number' is appended to this name. The first training slice is also saved into the directory given by this prefix. --vocab_file: The file in which the vocabulary is stored. --min_count: The minimal count of a token to make it into the vocabulary.
Training: Train the biLM for ELMo embeddings on pre-processed data.
usage: train_elmo_n_gpus.py [-h] [--train_prefix TRAIN_PREFIX] [--save_dir SAVE_DIR] [--vocab_file VOCAB_FILE] [--n_tokens N_TOKENS] [--stats STATS] [--use_gpus USE_GPUS] [--epochs EPOCHS] [--batchsize BATCHSIZE] [--pre_process PRE_PROCESS] [--heldout_prefix HELDOUT_PREFIX] [--min_count MIN_COUNT] optional arguments: -h, --help show this help message and exit --train_prefix TRAIN_PREFIX Prefix for train files --save_dir SAVE_DIR Location of checkpoint files --vocab_file VOCAB_FILE Vocabulary file --n_tokens N_TOKENS The number of tokens in the training files --stats STATS Use a .stat file for input data statistics, like token count. --use_gpus USE_GPUS The number of gpus to use --epochs EPOCHS The number of epochs to run --batchsize BATCHSIZE The batchsize for each gpu Example calls: python3 bin/train_elmo_n_gpus.py \ --train_prefix '/home/public/stoeckel/data/Leipzig40MT2010_raw_training/train.Leipzig40MT2010_raw*' \ --vocab_file /home/public/stoeckel/data/Leipzig40MT2010_lowered.100.vocab --save_dir /home/public/stoeckel/models/biLM/Leipzig40MT2010_raw/ --n_tokens 1093717542 \