ConST: Cross-modal Contrastive Learning for Speech Translation

This is an implementation of NAACL 2022 paper "Cross-modal Contrastive Learning for Speech Translation" (read paper here). The implementation based on fairseq codebase.

CONTRIBUTION: You are also more than welcomed to test our code on your machines, and report feedbacks on results, bugs and performance!

👀 Overview

The motivation of our ConST model is to learn similar representations for semantically similar speech and text.

ConST (1) inherits the advantages of multi-task learning (as shown in our previous paper XSTNet (with code)), (2) while employing a contrastive learning approach to bridge the gap between low-level speech representation and text embedding.

Result on MuST-C En-X dataset

We report case-sensitive detokenized BLEU via sacrebleu toolkit.

Model	En-De	En-Es	En-Fr	En-It	En-Nl	En-Pt	En-Ro	En-Ru	Avg.
ConST-base	25.7	30.4	36.8	26.3	30.6	32.0	24.8	17.3	28.0
ConST-expand	28.3	32.0	38.3	27.2	31.7	33.1	25.6	18.9	29.4

🤗 Huggingface Space Demo available now!

Experience our end-to-end voice translation system on Huggingface Space now! Record a sentence in English and translate it into other languages! You are a TRANSLATOR!

HERE IS THE WEBSITE:

https://huggingface.co/spaces/ReneeYe/ConST-speech2text-translator

P.S. Since huggingface space only provides CPU, it will take 12-20 seconds to inference and generate the translation result.

⬇️ Download Trained Models

The models are trained based on pytorch. You may download all the models at 🤗huggingface model.

Datasets	Model	SPM & Vocab
En-De	Download	SPM model; Vocab
En-Es	Download	SPM model; Vocab
En-Fr	Download	SPM model; Vocab
En-It	Download	SPM model; Vocab
En-Nl	Download	SPM model; Vocab
En-Pt	Download	SPM model; Vocab
En-Ro	Download	SPM model; Vocab
En-Ru	Download	SPM model; Vocab

Training & Generation Instruction

⚙️ Requirements and Installation

PyTorch version >= 1.5.0
Python version >= 3.6
For training new models, you'll also need an NVIDIA GPU and NCCL

git clone [email protected]:ReneeYe/ConST.git
cd ConST
pip3 install -r requirements.txt
pip3 install --editable ./

📉 Pre-processing and Training

The instructions of data pre-processing are here. To train the model, take En-De as an example, you may run:

bash ConST/scripts/train_en2x.sh de checkpoint/model_saved.

🤖️ Inference, Generation and Evaluation

We strongly recommend that you average the checkpoints after you get the best checkpoint with highest BLEU on dev set.

python3 ConST/scripts/average_checkpoints.py --inputs checkpoint/model_saved \
--num-update-checkpoints 10 --checkpoint-upper-bound ${step-to-get-the-best-dev} \
--output ${path-to-averaged-ckpt}

Then generate and evaluate your model.

fairseq-generate data/ --gen-subset tst-COMMON_st --task speech_to_text --prefix-size 1 \
--max-tokens 4000000 --max-source-positions 4000000 --beam 10 \
--config-yaml config_st.yaml  --path ${path-to-averaged-ckpt} \
--scoring sacrebleu

✏️ Citation

@InProceedings{ye2022cross,
  author    = {Rong Ye and Mingxuan Wang and Lei Li},
  booktitle = {Proc. of NAACL},
  title     = {Cross-modal Contrastive Learning for Speech Translation },
  year      = {2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
ConST		ConST
docs		docs
fairseq		fairseq
fairseq_cli		fairseq_cli
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hubconf.py		hubconf.py
pyproject.toml		pyproject.toml
requirements.apt.txt		requirements.apt.txt
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ConST: Cross-modal Contrastive Learning for Speech Translation

👀 Overview

Result on MuST-C En-X dataset

🤗 Huggingface Space Demo available now!

⬇️ Download Trained Models

Training & Generation Instruction

⚙️ Requirements and Installation

📉 Pre-processing and Training

🤖️ Inference, Generation and Evaluation

✏️ Citation

About

Releases

Packages

Languages

License

ReneeYe/ConST

Folders and files

Latest commit

History

Repository files navigation

ConST: Cross-modal Contrastive Learning for Speech Translation

👀 Overview

Result on MuST-C En-X dataset

🤗 Huggingface Space Demo available now!

⬇️ Download Trained Models

Training & Generation Instruction

⚙️ Requirements and Installation

📉 Pre-processing and Training

🤖️ Inference, Generation and Evaluation

✏️ Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages