About

This repository provides data and code for "CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription" paper.

The collected transcriptions stored in data/*-crowd.tsv, ground-truth transcriptions stored in data/*-gt.txt. We also provide a code for the annotation process and speech synthesis in annotation and speech_sythesis folders, respectively.

Citation

Pavlichenko N., Stelmakh I., and Ustalov D. CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. 2021. arXiv: 2107.01091 [cs.SD].

@inproceedings{CrowdSpeech,
  author    = {Pavlichenko, Nikita and Stelmakh, Ivan and Ustalov, Dmitry},
  title     = {{CrowdSpeech and Vox~DIY: Benchmark Dataset for Crowdsourced Audio Transcription}},
  year      = {2021},
  booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
  eprint    = {2107.01091},
  eprinttype = {arxiv},
  eprintclass = {cs.SD},
  url       = {https://openreview.net/forum?id=3_hgF1NAXU7},
  language  = {english},
  pubstate  = {forthcoming},
}

Data

CrowdSpeech and VoxDIY datasets stored in the data folder. Each dataset is associated with two filed: <dataset>-<split>-crowd.tsv and <dataset>-<split>-gt.txt. The first one contains three columns INPUT:audio — an audio file given to crowd workers, OUTPUT:transcription — worker's transcription and ASSIGNMENT:worker_id — a unique worker identifier. The second file contains two tab-separated columns without header: an audio file and the ground-truth transcription.

You can also download the CrowdSpeech dataset from HuggingFace.

Evaluation

First, you may need to install some dependencies:

pip3 install crowd-kit toloka-kit jiwer

Then, you can easily evaluate all our baseline aggregation methods by a single command:

python3 baselines.py data/<dataset>-gt.txt data/<dataset>-crowd.tsv

In order to get the Oracle result, run

python3 oracle.py data/<dataset>-gt.txt data/<dataset>-crowd.tsv

You can also get the Inter-Rater Agreement by running

python3 agreement.py data/<dataset>-crowd.tsv

VoxDIY

You can find an IPython notebook with a code for the data collection process for the VoxDIY. For the quality control, we use a special class, TaskProcessor, that gets all the submits that are not accepted or rejected at the moment, calculates workers' skills, and checks if a submit should be accepted or rejected.

T5 Model

Our data is also available at HuggingFace Hub as well as the T5 model trained on train-clean, dev-clean and dev-other parts of CrowdSpeech.

This snippet shows the example of the model's inference:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
mname = "toloka/t5-large-for-text-aggregation"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

input = "samplee text | sampl text | sample textt"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)  # sample text

License

Code

Data

Acknowledgements

LibriSpeech dataset is used under the Creative Commons Attribution 4.0 license.

CrowdWSA2019 dataset is used under the Creative Commons Attribution 4.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

About

Citation

Data

Evaluation

VoxDIY

T5 Model

License

Code

Data

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

About

Citation

Data

Evaluation

VoxDIY

T5 Model

License

Code

Data

Acknowledgements