Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WER for conformer update #124

Open
gandroz opened this issue Jan 22, 2021 · 52 comments
Open

WER for conformer update #124

gandroz opened this issue Jan 22, 2021 · 52 comments
Labels
enhancement New feature or request

Comments

@gandroz
Copy link
Contributor

gandroz commented Jan 22, 2021

Hi,
I've just ended a training of a conformer using the sentencepiece featurizer on LibriSpeech over 50 epochs.
Here are the results if you want to update your readme:

dataset_config:
    train_paths:
      - /data/datasets/LibriSpeech/train-clean-100/transcripts.tsv
      - /data/datasets/LibriSpeech/train-clean-360/transcripts.tsv
      - /data/datasets/LibriSpeech/train-other-500/transcripts.tsv
    eval_paths:
      - /data/datasets/LibriSpeech/dev-clean/transcripts.tsv
      - /data/datasets/LibriSpeech/dev-other/transcripts.tsv
    test_paths:
      - /data/datasets/LibriSpeech/test-clean/transcripts.tsv

Test results:
G_WER = 5.22291565
G_CER = 1.9693377
B_WER = 5.19438553
B_CER = 1.95449066
BLM_WER = 100
BLM_CER = 100

The strange part is that I dot the same metrics on test-other dataset hmmm...

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 23, 2021

@gandroz Wow cool, if you got the same result for test-other then you should check the transcript file to see if it points to test-other files. And you should check the test-clean transcripts file too.
Anyway, I'm thinking that maybe the authors have some tricks that reduce the result to 2.7% that we didn't see.

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 23, 2021

And one more thing is that there's a very small difference between greedy and beam search at this kind of WER percent, so we can ignore the difference and test only on greedy to see if it reduces to near 2.7-3%, for getting faster results

@gandroz
Copy link
Contributor Author

gandroz commented Jan 23, 2021

I'll try to continue training for several epochs, training seems not to have ended. I'll read the paper again to look for any clue on how to reduce WER even more.
But I dont have anything special in my transcripts, both test-clean and test-other are well segregated.

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 23, 2021

@gandroz You should check or generate the transcript file again, may be when creating test-other transcript file, you point to the test-clean directory.
If everything is right, then it's so weird haha 😆

@nglehuy nglehuy added the enhancement New feature or request label Jan 23, 2021
@gandroz
Copy link
Contributor Author

gandroz commented Jan 23, 2021 via email

@gandroz
Copy link
Contributor Author

gandroz commented Jan 26, 2021

I found why I always got the same test metrics.... I tested on the test-clean dataset and it saved a test.tsv file, but each time I performed another test, as there was already an existing file, only the metrics were computed and no inference was done. I've cleaned this file and have launched another test with the test-other dataset to continue the update.

@ncilfone
Copy link

@gandroz Can you post your full config file you are using to generate the ~5% WER results?

Thanks!!!

@gandroz
Copy link
Contributor Author

gandroz commented Jan 29, 2021

@ncilfone sure !

speech_config:
  sample_rate: 16000
  frame_ms: 25
  stride_ms: 10
  num_feature_bins: 80
  feature_type: log_mel_spectrogram
  preemphasis: 0.97
  normalize_signal: True
  normalize_feature: True
  normalize_per_feature: False

decoder_config:
  output_path_prefix: /data/models/asr/conformer_sentencepiece_subword
  model_type: unigram
  target_vocab_size: 1024
  blank_at_zero: True
  beam_width: 5
  norm_score: True
  corpus_files:
    - /data/datasets/LibriSpeech/train-clean-100/transcripts.tsv
    - /data/datasets/LibriSpeech/train-clean-360/transcripts.tsv
    - /data/datasets/LibriSpeech/train-other-500/transcripts.tsv

model_config:
  name: conformer
  encoder_subsampling:
    type: conv2d
    filters: 144
    kernel_size: 3
    strides: 2
  encoder_positional_encoding: sinusoid_concat
  encoder_dmodel: 144
  encoder_num_blocks: 16
  encoder_head_size: 36
  encoder_num_heads: 4
  encoder_mha_type: relmha
  encoder_kernel_size: 32
  encoder_fc_factor: 0.5
  encoder_dropout: 0.1
  prediction_embed_dim: 320
  prediction_embed_dropout: 0.1
  prediction_num_rnns: 1
  prediction_rnn_units: 320
  prediction_rnn_type: lstm
  prediction_rnn_implementation: 1
  prediction_layer_norm: True
  prediction_projection_units: 0
  joint_dim: 320
  joint_activation: tanh

learning_config:
  augmentations:
    after:
      time_masking:
        num_masks: 10
        mask_factor: 100
        p_upperbound: 0.05
      freq_masking:
        num_masks: 1
        mask_factor: 27

  dataset_config:
    train_paths:
      - /data/datasets/LibriSpeech/train-clean-100/transcripts.tsv
      - /data/datasets/LibriSpeech/train-clean-360/transcripts.tsv
      - /data/datasets/LibriSpeech/train-other-500/transcripts.tsv
    eval_paths:
      - /data/datasets/LibriSpeech/dev-clean/transcripts.tsv
      - /data/datasets/LibriSpeech/dev-other/transcripts.tsv
    test_paths:
      - /data/datasets/LibriSpeech/test-clean/transcripts.tsv
      - /data/datasets/LibriSpeech/test-other/transcripts.tsv
    tfrecords_dir: null

  optimizer_config:
    warmup_steps: 10000
    beta1: 0.9
    beta2: 0.98
    epsilon: 1e-9

  running_config:
    batch_size: 2
    accumulation_steps: 4
    num_epochs: 50
    outdir: /data/models/asr/conformer_sentencepiece_subword
    log_interval_steps: 300
    eval_interval_steps: 500
    save_interval_steps: 1000
    checkpoint:
      filepath: /data/models/asr/conformer_sentencepiece_subword/checkpoints/{epoch:02d}.h5
      save_best_only: True
      save_weights_only: False
      save_freq: epoch
    states_dir: /data/models/asr/conformer_sentencepiece_subword/states
    tensorboard:
      log_dir: /data/models/asr/conformer_sentencepiece_subword/tensorboard
      histogram_freq: 1
      write_graph: True
      write_images: True
      update_freq: 'epoch'
      profile_batch: 2

I used a sentencepiece (unigram) model as vocab, currently trying with the BPE version

@ncilfone
Copy link

ncilfone commented Jan 29, 2021

Thanks @gandroz!

Is that the vocab here: vocabularies/librispeech_train_4_1030.subwords

Edit: Based on the config it seems like you might generate one before training?

Also is this just single GPU training?

@gandroz
Copy link
Contributor Author

gandroz commented Jan 29, 2021

no it's not that vocab. However, you can train yours with script\generate_vocab_sentencepiece.py giving your config file.
And I'm training on two GTX 1080Ti. It took soooo long to train, I'm looking for a way to pre-compute the fbanks as they are computed on the fly which might take some time.

@ncilfone
Copy link

Yeah just realized that you generate it based on the config options. Thanks for letting me know!

I'm assuming you are doing the featurization of the WAV files in TF as the stft etc. should be a bit faster on the GPU. DALI might be another place to look too although I've never used it...

@ncilfone
Copy link

Final question I promise... It looks like you are using and tokens in SentencePiece but I'm guessing the text featurizer for the LibriSpeech transcripts doesn't have those? Or do you pad them onto each one?

@gandroz
Copy link
Contributor Author

gandroz commented Jan 29, 2021

I think the best way to accelerate processing is to pre-process fbank just as it done on fairseq.
For your information, featurization is done by the class tensorflow_asr\featurizers\speech_featurizers.py::TFSpeechFeaturizer.

I'm guessing the text featurizer for the LibriSpeech transcripts doesn't have those? Or do you pad them onto each one?

I'm not sure to understand well your question. Sentencepiece is an unsupervised text tokenizer and detokenizer so you have to train a model on the transcripts from LibriSpeech. Tokenized transcripts are padded to the biggest sentence during training for each batch.

@ncilfone
Copy link

Ugh forgot that markdown will remove the notation I used... This is what I meant...

It looks like you are using <sos> and <eos> tokens in SentencePiece but I'm guessing the text featurizer for the LibriSpeech transcripts doesn't have those? Or do you pad them onto each one?

@gandroz
Copy link
Contributor Author

gandroz commented Jan 29, 2021

Oh I see. You are right, transcripts does not have those tokens and they are useless as far as I understand it. However, you can add them when encoding some text. You could find more details on the repo, and I've just realized that there is a tensorflow binding.... I think I'll try it instead of the python implementation I used.

@tund
Copy link

tund commented Jan 30, 2021

Hi @gandroz ,
Have you tested on test-other set, and what is the result?
Thanks!

@gandroz
Copy link
Contributor Author

gandroz commented Jan 30, 2021

@tund not yet, it took me a week to test on test-clean and I did not have time yet

@tund
Copy link

tund commented Jan 30, 2021

Thanks for your reply @gandroz .
Since the performance using beam-search is quite close to the greedy-search, I think only running greedy-search will be much faster.
Another question: do you use Gradient Accumulation for trainng? I saw: "accumulation_steps: 4" in the config file, but not sure what your training command exactly is.

@gandroz
Copy link
Contributor Author

gandroz commented Jan 31, 2021

Indeed, I could just perform greedy search for this test. In a near future perhaps...
And yes, I used gradient accumulation.

@ncilfone
Copy link

ncilfone commented Feb 1, 2021

@gandroz any chance you can post your loss curves?

@gandroz
Copy link
Contributor Author

gandroz commented Feb 2, 2021

sure
image

image

The glitches at the end are due to infinite loop bug corrected afterwards (evaluation occured endlessly after training ended). I trained the model for 40 epochs first and continued for 10 more epochs.

@mjurkus
Copy link

mjurkus commented Feb 13, 2021

How you are able to achieve such good results with your models? I've trained conformed subword model, but it stops improving after ~20 epochs.

I've updated Keras trainer to use EarlyStopping and stops the training process after 5 epochs without improvement to validation loss.

What am I missing?

Train data: 50hrs
Eval data: 7hrs
Using TF RNN Loss

Audio lengths. Not sure :

mean       2.646981
std        2.420535
min        0.100000
25%        0.900000
50%        1.570000
75%        4.030000
max       20.000000

The test results are complete rubbish:

G_WER = 114.837982
G_CER = 88.0064
B_WER = 100
B_CER = 100
BLM_WER = 100
BLM_CER = 100

config

speech_config:
  sample_rate: 16000
  frame_ms: 25
  stride_ms: 10
  num_feature_bins: 80
  feature_type: log_mel_spectrogram
  preemphasis: 0.97
  normalize_signal: True
  normalize_feature: True
  normalize_per_feature: False

decoder_config:
  vocabulary: vocabularies/lithuanian.subwords
  target_vocab_size: 4096
  max_subword_length: 4
  blank_at_zero: True
  beam_width: 0
  norm_score: True
  corpus_files:
    - /tf_asr/manifests/liepa.tsv

model_config:
  name: conformer
  encoder_subsampling:
    type: conv2d
    filters: 144
    kernel_size: 3
    strides: 2
  encoder_positional_encoding: sinusoid_concat
  encoder_dmodel: 144
  encoder_num_blocks: 16
  encoder_head_size: 36
  encoder_num_heads: 4
  encoder_mha_type: relmha
  encoder_kernel_size: 32
  encoder_fc_factor: 0.5
  encoder_dropout: 0.1
  prediction_embed_dim: 320
  prediction_embed_dropout: 0
  prediction_num_rnns: 1
  prediction_rnn_units: 320
  prediction_rnn_type: lstm
  prediction_rnn_implementation: 2
  prediction_layer_norm: False
  prediction_projection_units: 0
  joint_dim: 320
  joint_activation: tanh

learning_config:
  train_dataset_config:
    use_tf: True
    augmentation_config:
      after:
        time_masking:
          num_masks: 10
          mask_factor: 100
          p_upperbound: 0.05
        freq_masking:
          num_masks: 1
          mask_factor: 27
    data_paths:
      - /tf_asr/manifests/liepa_train.tsv
    tfrecords_dir: /tf_asr/tfrecords/tfrecords-train
    shuffle: True
    cache: False
    buffer_size: 100
    drop_remainder: True

  eval_dataset_config:
    use_tf: True
    data_paths:
      - /tf_asr/manifests/liepa_eval.tsv
    tfrecords_dir: /tf_asr/tfrecords/tfrecords-eval
    shuffle: False
    cache: False
    buffer_size: 100
    drop_remainder: True

  test_dataset_config:
    use_tf: True
    data_paths:
      - /tf_asr/manifests/liepa_test.tsv
    tfrecords_dir: /tf_asr/tfrecords/tfrecords-test
    shuffle: False
    cache: False
    buffer_size: 100
    drop_remainder: True

  optimizer_config:
    warmup_steps: 40000
    beta1: 0.9
    beta2: 0.98
    epsilon: 1e-9

  running_config:
    batch_size: 2
    accumulation_steps: 4
    num_epochs: 20
    outdir: /tf_asr/models
    log_interval_steps: 300
    eval_interval_steps: 500
    save_interval_steps: 1000
    early_stopping:
      monitor: "val_val_rnnt_loss"
      mode: "min"
      patience: 5
      verbose: 1
    checkpoint:
      filepath: /tf_asr/models/checkpoints/epoch-{epoch:02d}-{val_val_rnnt_loss:.4f}.h5
      save_best_only: True
      save_weights_only: False
      save_freq: epoch
      verbose: 1
      monitor: "val_val_rnnt_loss"
      mode: "min"
    states_dir: /tf_asr/models/states
    tensorboard:
      log_dir: /tf_asr/models/tensorboard
      histogram_freq: 1
      write_graph: True
      write_images: True
      update_freq: 'epoch'
      profile_batch: 2

@nglehuy
Copy link
Collaborator

nglehuy commented Feb 13, 2021

@mjurkus Could you show the loss curves?

@gandroz
Copy link
Contributor Author

gandroz commented Feb 13, 2021

@mjurkus my training was performed over the LibriSpeech data, 960h of data for training. ASR needs lots of data to converge, so maybe you need more. Furthermore, maybe LibriSpeech data is cleaner than yours ? I also have some proprietary data but they are way worse than LibriSpeech (not even the same sampling rate). But perhaps you could share the training curves ?

@mjurkus
Copy link

mjurkus commented Feb 13, 2021

Yeah, the amount of data is the answer... That's what I thought.

Here's couple:
Very clean, 16k data, 50hrs:
train_rnnt_loss,val_val_rnnt_loss

Mixed data: clean and noisy, 16k, 100hrs:
train_rnnt_loss,val_val_rnnt_loss (1)

It's hard to get good labeled data for my language.

@gandroz
Copy link
Contributor Author

gandroz commented Feb 13, 2021

Your model does not seem to learn anything.... Try to reduce your LR, explore some data augmentation as it could help.

@mjurkus
Copy link

mjurkus commented Feb 14, 2021

Using conformer with characters worked way better, than using subwords. Managed to get decent results (WER ~15%) do not have the graphs for those, though.

Regarding augmentation - I figured, that this config enables augmentation.

    augmentation_config:
      after:
        time_masking:
          num_masks: 10
          mask_factor: 100
          p_upperbound: 0.05
        freq_masking:
          num_masks: 1
          mask_factor: 27

@jinggaizi
Copy link

I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?

@nglehuy
Copy link
Collaborator

nglehuy commented Feb 18, 2021

I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?

@jinggaizi What vocabulary size did you use, 1k or 4k or english characters (around 28)?

@jinggaizi
Copy link

1k

@nglehuy nglehuy pinned this issue Feb 18, 2021
@gandroz
Copy link
Contributor Author

gandroz commented Feb 18, 2021

@jinggaizi no, I have no news from the author. I could try to email him again, he's smart. However, I am surprise by the WER you achieved with ESPNET. They say they had much better results (however I suspect it was not with the small model, but anyway). Have you use the RNNT or a transformer as a decoder ? When ESPNET announced they had same or better results than the paper, it was with a transformer as you can see in their sources.

Maybe you could ask ESPNET how they manage to achieve such good results.... on which machine, which config etc.

@jinggaizi
Copy link

@usimarit thank for your reply, my result used RNNT as decoder, encoder is small size conformer, decoder is 1 lstm layer(dim=320) and dimension of join network is 640. espnet (https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1)have no RNNT result and i suspect that it's better because speed augmentation

@jinggaizi
Copy link

@gandroz hi, have you any news from the author, do you train the model on GPU or TPU? Have you ever tried a larger batch size, i assume google always use a larger batch size. i only worked on titan xp with small batch size, maybe larger batch size can improve the result of transducer

@ncilfone
Copy link

@jinggaizi I've run it with a batch size of 2048 (which is what I think they used in the original paper taken from this ref here http://arxiv.org/abs/2011.06110) via batch accumulation on 8 GPUs (with a joint dim of 320) for days and I can barely get below 5.9% on dev-clean.

@jinggaizi
Copy link

jinggaizi commented Feb 23, 2021 via email

@nglehuy
Copy link
Collaborator

nglehuy commented Feb 23, 2021

@ncilfone batch accumulation is just to mimic the large batch size, I believe they use actual large batch size, which is way more efficient.

@jinggaizi
Copy link

@ncilfone what' version of GPU with 2048 batch size. did you improve the RNNT training refer to https://arxiv.org/pdf/1909.12415.pdf

@gandroz
Copy link
Contributor Author

gandroz commented Feb 26, 2021

Just a follow up with the author of the paper. I asked him some clues to try to find how we can achieve the same results. I asked a question about the dataset and whether the model was pre-trained or not, and asked for details on the hyperparameters not always mentionned in the paper.
He was kind enough to answer me, but not that much details to help us a lot.
Here it is

Re: training set. We use the Librispeech 960h train set as mentioned in our paper.

Re: batch sizes. What batch-size do you use and what's the WER do you see on Librispeech Dev/Devother/Test/Testother datasets? I think this can be one reason, I can actually run an experiment with the same small batch size as yours and update you with the result. We ran our experiments on a batch size of 2048 and trained till 90-100k steps. To evaluate, we sampled 5 ckpts and picked the best one based on the dev/devother performance. Let me know what settings do you use and I can train and report back to you with the results.

So maybe a major difference comes from the batch size which is.....HUGE ! I really dont know how they manage to train the large (or even the small) model with so much data. Maybe an avenue could be to split the model over multiple GPU instead or replicating the model on multiple GPU. We could surely increase the batch size doing so.

@nglehuy
Copy link
Collaborator

nglehuy commented Feb 26, 2021

Thanks @gandroz, they have their HUGE TPUs, that's why they're able to get SOTA results. I'll try to implement gradient accumulation in keras builtin function and test on colab TPUs, hope it will get nearer to their result.

@MadhuAtBerkeley
Copy link

MadhuAtBerkeley commented Feb 26, 2021

Hi @usimarit , I see high bias issue - rnnt_loss in 240s and does not go down further in keras conformer trainer (both keras and non-keras version). I tried learning rate of - 0.5/ sqrt(dmodel), 0.05/sqrt(dmodel), 0.005/sqrt(dmodel) with 960 hours librispeech. There is not much difference in the loss curve. Please let me know if I need to modify anything in the config file to train a model that matches the WER performance of reference latest.h5 (WER of 6.5 in my testing). Thanks

@nglehuy
Copy link
Collaborator

nglehuy commented Feb 28, 2021

@MadhuAtBerkeley I trained with that config on google drive, except that I used use_tf: False (the config on drive is not updated to latest version but it still has the same meaning)

@MadhuAtBerkeley
Copy link

@usimarit Thanks! I confirm that use_tf:False does help and now I see loss curve going below 100.

@thanatl
Copy link

thanatl commented Mar 3, 2021

Why set use_tf to False help the training as both tf version and numpy version perform similar method?

@BuaaAlban
Copy link

I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?

Hi, could you please post your config in espnet?

@nglehuy
Copy link
Collaborator

nglehuy commented Mar 11, 2021

Why set use_tf to False help the training as both tf version and numpy version perform similar method?

The only difference is the numpy version uses nlpaug which randomly chooses time masking and freq masking to do augmentation where the tf version applies both time and freq masking.
The tf version works fine for me on TPUs.

@jinggaizi
Copy link

jinggaizi commented Mar 12, 2021

I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?

Hi, could you please post your config in espnet?

`batch-size: 6
maxlen-in: 800
maxlen-out: 150

criterion: loss
early-stop-criterion: "validation/main/loss"
sortagrad: 0
opt: noam
epochs: 50
patience: 0
accum-grad: 4
grad-clip: 5.0

etype: transformer
enc-block-arch:
- type: conformer
d_hidden: 144
d_ff: 576
heads: 4
macaron_style: True
use_conv_mod: True
conv_mod_kernel: 31
dropout-rate: 0.1
pos-dropout-rate: 0.1
enc-block-repeat: 16
dtype: lstm
dlayers: 1
dec-embed-dim: 320
dunits: 320
trans-type: warp-rnnt
joint-dim: 640

transformer-lr: 10
transformer-warmup-steps: 25000

transformer-enc-positional-encoding-type: rel_pos
transformer-enc-self-attn-type: rel_self_attn

rnnt-mode: 'rnnt' # switch to 'rnnt-att' to use transducer with attention
model-module: "espnet.nets.pytorch_backend.e2e_asr_transducer:E2E`

@jinggaizi
Copy link

@gandroz hi, have any response from the author , running some experience with small batchsize. do you try to use other methods to improve the result

@gandroz
Copy link
Contributor Author

gandroz commented Mar 23, 2021

@jinggaizi No, not any news from the author, I'll let you know as soon as I have. I cannot work on the project for the moment, so nothing news from me either

@AgaDob
Copy link

AgaDob commented Apr 16, 2021

no it's not that vocab. However, you can train yours with script\generate_vocab_sentencepiece.py giving your config file.
And I'm training on two GTX 1080Ti. It took soooo long to train, I'm looking for a way to pre-compute the fbanks as they are computed on the fly which might take some time.

Hey, thanks for the updated config! Any rough estimates of how long it took to train (I'm guessing a few days at least)? Also, any luck with pre-computing fbanks?

@changji95
Copy link

@ncilfone batch accumulation is just to mimic the large batch size, I believe they use actual large batch size, which is way more efficient.

Hello, is gradient accumulation not supported in the latest version (v1.0.0)?

@nglehuy
Copy link
Collaborator

nglehuy commented May 13, 2021

@changji-ustc I haven't supported it in keras training loop, I'm working on this.

@nglehuy nglehuy unpinned this issue May 10, 2022
@nglehuy nglehuy pinned this issue May 10, 2022
@gcervantes8
Copy link

@usimarit Have you been able to get a better WER with conformer? I see a lot of changes in the word piece branch.

With mixed precision and batch size 16 (effective batch size 96), the best Librispeech WER I've gotten is 6.4%.

With a medium conformer model with mixed precision and batch size 12 (effective batch size 72), the best WER I've gotten is 4.6%. (Warmup steps 40k, with only Librispeech)
Using a transformer language model, I'm only able to lower the WER by 0.15% on test clean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests