Skip to content

Latest commit

 

History

History
 
 

2019

In the paper we are considering:

  • different architectures for acoustic modeling:
    • ResNet
    • TDS
    • Transformer
  • different criterions:
    • Seq2Seq
    • CTC
  • different settings:
    • supervised LibriSpeech 1k hours
    • supervised LibriSpeech 1k hours + unsupervised LibriVox 57k hours (for LibriVox we generate pseudo-labels to use them as a target),
  • and different language models:
    • word-piece (ngram, ConvLM)
    • word-based (ngram, ConvLM, transformer)

Dependencies

Data preparation

Run data and auxiliary files (like lexicon, tokens set, etc.) preparation (set necessary paths instead of [...]: data_dst path to data to store, model_dst path to auxiliary path to store).

pip install sentencepiece==0.1.82
python3 ../../utilities/prepare_librispeech_wp_and_official_lexicon.py --data_dst [...] --model_dst [...] --nbest 10 --wp 10000

Besides data the auxiliary files for acoustic and language models training/evaluation will be generated:

cd $MODEL_DST
tree -L 2
.
├── am
│   ├── librispeech-train-all-unigram-10000.model
│   ├── librispeech-train-all-unigram-10000.tokens
│   ├── librispeech-train-all-unigram-10000.vocab
│   ├── librispeech-train+dev-unigram-10000-nbest10.lexicon
│   ├── librispeech-train-unigram-10000-nbest10.lexicon
│   └── train.txt
└── decoder
    ├── 4-gram.arpa
    ├── 4-gram.arpa.lower
    └── decoder-unigram-10000-nbest10.lexicon

Instructions to reproduce training and decoding

  • To reproduce acoustic models training on Librispeech (1k hours) and beam-search decoding of these models check the librispeech directory.
  • Details on pseudolabels preparation is in the directory lm_corpus_and_PL_generation (raw LM corpus which has no intersection with Librovox data is prepared in the raw_lm_corpus)
  • To reproduce acoustic models training on Librispeech 1k hours + unsupervised LibriVox data (with generated pseudo-labels) and beam-search decoding of these models, check librivox directory.
  • Details on language models training one can find in the lm directory.
  • Beam dump for the best models and beam rescoring can be found in the rescoring directory.
  • Disentangling of acoustic and linguistic representations analyis (TTS and Segmentation experiments) are in lm_analysis.

Tokens and Lexicon sets

Lexicon Tokens Beam-search lexicon WP tokenizer model
Lexicon Tokens Beam-search lexicon WP tokenizer model

Tokens and lexicon files generated in the $MODEL_DST/am/ and $MODEL_DST/decoder/ are the same as in the table.

Pre-trained acoustic models

Below there is info about pre-trained acoustic models, which one can use, for example, to reproduce a decoding step.

Dataset Acoustic model dev-clean Acoustic model dev-other
LibriSpeech Resnet CTC clean Resnet CTC other
LibriSpeech + LibriVox Resnet CTC clean Resnet CTC other
LibriSpeech TDS CTC clean TDS CTC other
LibriSpeech + LibriVox TDS CTC clean TDS CTC other
LibriSpeech Transformer CTC clean Transformer CTC other
LibriSpeech + LibriVox Transformer CTC clean Transformer CTC other
LibriSpeech Resnet S2S clean Resnet S2S other
LibriSpeech + LibriVox Resnet S2S clean Resnet S2S other
LibriSpeech TDS Seq2Seq clean TDS Seq2Seq other
LibriSpeech + LibriVox TDS Seq2Seq clean TDS Seq2Seq other
LibriSpeech Transformer Seq2Seq clean Transformer Seq2Seq other
LibriSpeech + LibriVox Transformer Seq2Seq clean Transformer Seq2Seq other

Pre-trained language models

LM type Language model Vocabulary Architecture LM Fairseq Dict fairseq
ngram word 4-gram - - - -
ngram wp 6-gram - - - -
GCNN word GCNN vocabulary Archfile fairseq fairseq dict
GCNN wp GCNN vocabulary Archfile fairseq fairseq dict
Transformer - - - fairseq fairseq dict

To reproduce decoding step from the paper download these models into $MODEL_DST/am/ and $MODEL_DST/decoder/ appropriately.

Non-overlap LM corpus (Librispeech official LM corpus excluded the data from Librivox)

One can use prepared corpus to train LM to generate PL on LibriVox data: raw corpus and normalized corpus and 4gram LM with 200k vocab.

Generated pseudo-labels used in the paper

We open-sourced also the generated pseudo-labels on which we trained our model: pl and pl with overlap. **Make sure to fix the prefixes to the files names in the lists, right now it is set to be /root/librivox)

Citation

@article{synnaeve2019end,
  title={End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures},
  author={Synnaeve, Gabriel and Xu, Qiantong and Kahn, Jacob and Grave, Edouard and Likhomanenko, Tatiana and Pratap, Vineel and Sriram, Anuroop and Liptchinsky, Vitaliy and Collobert, Ronan},
  journal={arXiv preprint arXiv:1911.08460},
  year={2019}
}