Yu Zhang*, Changhao Pan*, Wenxiang Guo*, Ruiqi Li, Zhiyuan Zhu, Jialei Wang, Wenhao Xu, Jingyu Lu, Zhiqing Hong, Chuxin Wang, LiChao Zhang, Jinzheng He, Ziyue Jiang, Yuxin Chen, Chen Yang, Jiecheng Zhou, Xinyu Cheng, Zhou Zhao | Zhejiang University
Dataset and code of GTSinger (NeurIPS 2024 Spotlight): A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks.
We introduce GTSinger, a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks.
We provide the corpus and processing codes for our dataset and benchmarks' implementation in this repository.
Also, you can visit our Demo Page for the audio samples of our dataset as well as the results of our benchmarks.
- 2024.09: We released the full dataset of GTSinger!
- 2024.09: GTSinger is accepted by NeurIPS 2024 (Spotlight)!
- 2024.05: We released the code of GTSinger!
✅ Release the code.
✅ Release the full dataset.
✅ Release the processed data of Chinese, English, Spanish, German, Russian.
✅ Refine the paired speech data of each language.
✅ Refine Chinese, Spanish, German, Russian annotations.
🔲 Further refine English, French, Japanese, Korean, Italian annotations (planned to be completed by February 2025).
🔲 Release the remaining processed data (planned to be completed by February 2025).
- 80.59 hours of singing voices in GTSinger are recorded in professional studios by skilled singers, ensuring high quality and clarity, forming the largest recorded singing dataset.
- Contributed by 20 singers across nine widely spoken languages (Chinese, English, Japanese, Korean, Russian, Spanish, French, German, and Italian) and all four vocal ranges, GTSinger enables zero-shot SVS and style transfer models to learn diverse timbres and styles.
- GTSinger provides controlled comparison and phoneme-level annotations of six singing techniques (mixed voice, falsetto, breathy, pharyngeal, vibrato, and glissando) for songs, thereby facilitating singing technique modeling, recognition, and control.
- Unlike fine-grained music scores, GTSinger features realistic music scores with regular note duration, assisting singing models in learning and adapting to real-world musical composition.
- The dataset includes manual phoneme-to-audio alignments, global style labels (singing method, emotion, range, and pace), and 16.16 hours of paired speech, ensuring comprehensive annotations and broad task suitability.
Click to access our full dataset (audio along with TextGrid, json, musicxml) and processed data (metadata.json, phone_set.json, spker_set.json) on Hugging Face for free! Hope our data is helpful for your research.
Besides, we also provide our dataset on .
Please note that, if you are using GTSinger, it means that you have accepted the terms of license.
Our dataset is organized hierarchically.
It presents nine top-level folders, each corresponding to a distinct language.
Within each language folder, there are five sub-folders, each representing a specific singing technique.
These technique folders contain numerous song entries, with each song further divided into several controlled comparison groups: a control group (natural singing without the specific technique), and a technique group (densely employing the specific technique).
Our singing voices and speech are recorded at a 48kHz sampling rate with 24-bit resolution in WAV format.
Alignments and annotations are provided in TextGrid files, including word boundaries, phoneme boundaries, phoneme-level annotations for six techniques, and global style labels (singing method, emotion, pace, and range).
We also provide realistic music scores in musicxml format.
Notably, we provide an additional JSON file for each singing voice, facilitating data parsing and processing for singing models.
Here is the data structure of our dataset:
.
├── Chinese
│ ├── ZH-Alto-1
│ └── ZH-Tenor-1
├── English
│ ├── EN-Alto-1
│ │ ├── Breathy
│ │ ├── Glissando
│ │ │ └── my love
│ │ │ ├── Control_Group
│ │ │ ├── Glissando_Group
│ │ │ └── Paired_Speech_Group
│ │ ├── Mixed_Voice_and_Falsetto
│ │ ├── Pharyngeal
│ │ └── Vibrato
│ ├── EN-Alto-2
│ │ ├── Breathy
│ │ ├── Glissando
│ │ ├── Mixed_Voice_and_Falsetto
│ │ ├── Pharyngeal
│ │ └── Vibrato
│ └── EN-Tenor-1
│ ├── Breathy
│ ├── Glissando
│ ├── Mixed_Voice_and_Falsetto
│ ├── Pharyngeal
│ └── Vibrato
├── French
│ ├── FR-Soprano-1
│ └── FR-Tenor-1
├── German
│ ├── DE-Soprano-1
│ └── DE-Tenor-1
├── Italian
│ ├── IT-Bass-1
│ ├── IT-Bass-2
│ └── IT-Soprano-1
├── Japanese
│ ├── JA-Soprano-1
│ └── JA-Tenor-1
├── Korean
│ ├── KO-Soprano-1
│ ├── KO-Soprano-2
│ └── KO-Tenor-1
├── Russian
│ └── RU-Alto-1
└── Spanish
├── ES-Bass-1
└── ES-Soprano-1
The code for processing the dataset is provided in the ./Data-Process
.
A suitable conda environment named gt_dataprocess
can be created and activated with:
conda create -n gt_dataprocess python=3.8 -y
conda activate gt_dataprocess
pip install -r requirements.txt
The code for checking the dataset is provided in ./Data-Process/data_check/
, including the following files:
-
check_file_and_folder.py
: Check the file and folder structure of the dataset. -
check_valid_bandwidth.py
: Check the sample rate and valid bandwidth of the dataset. -
count_time.py
: Count the time of the singing voice and speech in the dataset. -
plot_f0.py
: Plot the pitch(f0) of the singing voice audio. -
plot_mel.py
: Plot the mel-spectrogram of audio.
The code for preprocessing the dataset is provided in ./Data-Process/data_preprocess/
, including the following files:
-
gen_final_json.py
: Generate the final JSON file for each singing voice based on the TextGrid file and musicxml file that have been annotated. -
global2tgjson.py
: Add global style labels to the JSON file and TextGrid file. -
seg_singing.py
&seg_speech.py
: Segment the singing voice and speech based on the TextGrid file.
The code for our benchmarks for Technique Controllable Singing Voice Synthesis. You can also use GTSinger to train TCSinger!
The code for our benchmarks for Technique Recognition.
The code for our benchmarks for Style Transfer. You can use GTSinger to train StyleSinger and TCSinger!
The code for our benchmarks for Speech-to-Singing-Conversion. You can use GTSinger to train AlignSTS!
If you find this code useful in your research, please cite our work:
@article{zhang2024gtsinger,
title={GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks},
author={Zhang, Yu and Pan, Changhao and Guo, Wenxiang and Li, Ruiqi and Zhu, Zhiyuan and Wang, Jialei and Xu, Wenhao and Lu, Jingyu and Hong, Zhiqing and Wang, Chuxin and others},
journal={arXiv preprint arXiv:2409.13832},
year={2024}
}
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's singing without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.