Skip to content

Toloka/CrowdSpeech

Repository files navigation

About

This repository provides data and code for "CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription" paper.

The collected transcriptions stored in data/*-crowd.tsv, ground-truth transcriptions stored in data/*-gt.txt. We also provide a code for the annotation process and speech synthesis in annotation and speech_sythesis folders, respectively.

Citation

@inproceedings{CrowdSpeech,
  author    = {Pavlichenko, Nikita and Stelmakh, Ivan and Ustalov, Dmitry},
  title     = {{CrowdSpeech and Vox~DIY: Benchmark Dataset for Crowdsourced Audio Transcription}},
  year      = {2021},
  booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
  eprint    = {2107.01091},
  eprinttype = {arxiv},
  eprintclass = {cs.SD},
  url       = {https://openreview.net/forum?id=3_hgF1NAXU7},
  language  = {english},
  pubstate  = {forthcoming},
}

Data

CrowdSpeech and VoxDIY datasets stored in the data folder. Each dataset is associated with two filed: <dataset>-<split>-crowd.tsv and <dataset>-<split>-gt.txt. The first one contains three columns INPUT:audio — an audio file given to crowd workers, OUTPUT:transcription — worker's transcription and ASSIGNMENT:worker_id — a unique worker identifier. The second file contains two tab-separated columns without header: an audio file and the ground-truth transcription.

You can also download the CrowdSpeech dataset from HuggingFace.

Evaluation

First, you may need to install some dependencies:

pip3 install crowd-kit toloka-kit jiwer

Then, you can easily evaluate all our baseline aggregation methods by a single command:

python3 baselines.py data/<dataset>-gt.txt data/<dataset>-crowd.tsv

In order to get the Oracle result, run

python3 oracle.py data/<dataset>-gt.txt data/<dataset>-crowd.tsv

You can also get the Inter-Rater Agreement by running

python3 agreement.py data/<dataset>-crowd.tsv

VoxDIY

You can find an IPython notebook with a code for the data collection process for the VoxDIY. For the quality control, we use a special class, TaskProcessor, that gets all the submits that are not accepted or rejected at the moment, calculates workers' skills, and checks if a submit should be accepted or rejected.

T5 Model

Our data is also available at HuggingFace Hub as well as the T5 model trained on train-clean, dev-clean and dev-other parts of CrowdSpeech.

This snippet shows the example of the model's inference:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
mname = "toloka/t5-large-for-text-aggregation"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

input = "samplee text | sampl text | sample textt"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)  # sample text

License

Code

© YANDEX LLC, 2021. Licensed under the Apache License, Version 2.0. See LICENSE file for more details.

Data

© YANDEX LLC, 2021. Licensed under the Creative Commons Attribution 4.0 license. See data/LICENSE file for more details.

Acknowledgements

LibriSpeech dataset is used under the Creative Commons Attribution 4.0 license.

CrowdWSA2019 dataset is used under the Creative Commons Attribution 4.0 license.