In the paper we are considering:
- different architectures for acoustic modeling:
- ResNet
- TDS
- Transformer
- different criterions:
- Seq2Seq
- CTC
- different settings:
- supervised LibriSpeech 1k hours
- supervised LibriSpeech 1k hours + unsupervised LibriVox 57k hours (for LibriVox we generate pseudo-labels to use them as a target),
- and different language models:
- word-piece (ngram, ConvLM)
- word-based (ngram, ConvLM, transformer)
- flashlight branch v0.2
- [wav2letter] (https://github.com/facebookresearch/wav2letter/tree/v0.2) branch v0.2
Run data and auxiliary files (like lexicon, tokens set, etc.) preparation (set necessary paths instead of [...]
: data_dst
path to data to store, model_dst
path to auxiliary path to store).
pip install sentencepiece==0.1.82
python3 ../../utilities/prepare_librispeech_wp_and_official_lexicon.py --data_dst [...] --model_dst [...] --nbest 10 --wp 10000
Besides data the auxiliary files for acoustic and language models training/evaluation will be generated:
cd $MODEL_DST
tree -L 2
.
├── am
│ ├── librispeech-train-all-unigram-10000.model
│ ├── librispeech-train-all-unigram-10000.tokens
│ ├── librispeech-train-all-unigram-10000.vocab
│ ├── librispeech-train+dev-unigram-10000-nbest10.lexicon
│ ├── librispeech-train-unigram-10000-nbest10.lexicon
│ └── train.txt
└── decoder
├── 4-gram.arpa
├── 4-gram.arpa.lower
└── decoder-unigram-10000-nbest10.lexicon
- To reproduce acoustic models training on Librispeech (1k hours) and beam-search decoding of these models check the
librispeech
directory. - Details on pseudolabels preparation is in the directory
lm_corpus_and_PL_generation
(raw LM corpus which has no intersection with Librovox data is prepared in theraw_lm_corpus
) - To reproduce acoustic models training on Librispeech 1k hours + unsupervised LibriVox data (with generated pseudo-labels) and beam-search decoding of these models, check
librivox
directory. - Details on language models training one can find in the
lm
directory. - Beam dump for the best models and beam rescoring can be found in the
rescoring
directory. - Disentangling of acoustic and linguistic representations analyis (TTS and Segmentation experiments) are in
lm_analysis
.
Lexicon | Tokens | Beam-search lexicon | WP tokenizer model |
---|---|---|---|
Lexicon | Tokens | Beam-search lexicon | WP tokenizer model |
Tokens and lexicon files generated in the $MODEL_DST/am/
and $MODEL_DST/decoder/
are the same as in the table.
Below there is info about pre-trained acoustic models, which one can use, for example, to reproduce a decoding step.
Dataset | Acoustic model dev-clean | Acoustic model dev-other |
---|---|---|
LibriSpeech | Resnet CTC clean | Resnet CTC other |
LibriSpeech + LibriVox | Resnet CTC clean | Resnet CTC other |
LibriSpeech | TDS CTC clean | TDS CTC other |
LibriSpeech + LibriVox | TDS CTC clean | TDS CTC other |
LibriSpeech | Transformer CTC clean | Transformer CTC other |
LibriSpeech + LibriVox | Transformer CTC clean | Transformer CTC other |
LibriSpeech | Resnet S2S clean | Resnet S2S other |
LibriSpeech + LibriVox | Resnet S2S clean | Resnet S2S other |
LibriSpeech | TDS Seq2Seq clean | TDS Seq2Seq other |
LibriSpeech + LibriVox | TDS Seq2Seq clean | TDS Seq2Seq other |
LibriSpeech | Transformer Seq2Seq clean | Transformer Seq2Seq other |
LibriSpeech + LibriVox | Transformer Seq2Seq clean | Transformer Seq2Seq other |
LM type | Language model | Vocabulary | Architecture | LM Fairseq | Dict fairseq |
---|---|---|---|---|---|
ngram | word 4-gram | - | - | - | - |
ngram | wp 6-gram | - | - | - | - |
GCNN | word GCNN | vocabulary | Archfile | fairseq | fairseq dict |
GCNN | wp GCNN | vocabulary | Archfile | fairseq | fairseq dict |
Transformer | - | - | - | fairseq | fairseq dict |
To reproduce decoding step from the paper download these models into $MODEL_DST/am/
and $MODEL_DST/decoder/
appropriately.
One can use prepared corpus to train LM to generate PL on LibriVox data: raw corpus and normalized corpus and 4gram LM with 200k vocab.
We open-sourced also the generated pseudo-labels on which we trained our model: pl and pl with overlap. **Make sure to fix the prefixes to the files names in the lists, right now it is set to be /root/librivox
)
@article{synnaeve2019end,
title={End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures},
author={Synnaeve, Gabriel and Xu, Qiantong and Kahn, Jacob and Grave, Edouard and Likhomanenko, Tatiana and Pratap, Vineel and Sriram, Anuroop and Liptchinsky, Vitaliy and Collobert, Ronan},
journal={arXiv preprint arXiv:1911.08460},
year={2019}
}