-
Notifications
You must be signed in to change notification settings - Fork 6
/
index.html
221 lines (220 loc) · 14.8 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
<html>
<head>
<meta charset="UTF-8">
<title>Audio samples from "DiDiSpeech: A Large Scale Mandarin Speech Corpus"</title>
<link rel="stylesheet" type="text/css" href="../../stylesheet.css"/>
<link rel="shortcut icon" href="../../images/taco.png">
</head>
<body>
<article>
<header>
<h1>Audio samples from "DiDiSpeech: A Large Scale Mandarin Speech Corpus"</h1>
</header>
</article>
<div><b>Paper: </b><a href="https://arxiv.org/abs/2010.09275">arXiv</a></div>
<div><b>Authors:</b> Tingwei Guo, Cheng Wen, DongWei Jiang, Ne Luo, RuiXiong Zhang, ShuaiJiang Zhao, WuBo Li, Cheng Gong, Wei Zou, Kun Han, XianGang Li</div>
<div><b>Abstract:</b> This paper introduces a new open-sourced Mandarin speech corpus, called DiDiSpeech. It consists of about 800 hours of speech data at 48kHz sampling rate from 6000 speakers and the corresponding texts. All speech data in the corpus was recorded in quiet environment and is suitable for various speech processing tasks, such as voice conversion, multi-speaker text-to-speech and aucomatic speech recognation. We conduct experiments with multiple speech tasks and evaluate the performance, showing that it is promising to use the corpus for both academic research and practical application. The corpus is available at https://outreach.didichuxing.com/research/opendata/.<br/><br/>
</div>
<div> For more information, refer to the paper "DiDiSpeech: A Large Scale Mandarin Speech Corpus", Tingwei Guo, Cheng Wen, Dongwei Jiang, Ne Luo, Ruixiong Zhang, Shuaijiang Zhao, Wubo Li, Cheng Gong, Wei Zou, Kun Han, Xiangang Li, arXiv:2010.09275, 2020. If you use the DiDiSpeech corpus in your work, please cite this paper where it was introduced.</div>
<h2>Multi-speaker TTS</h2>
<div>This section displays the synthesized audio samples of our multi-speaker speech synthesis models trained on the DiDiSpeech corpus. Each column corresponds to a single speaker. The first row consists of the reference audio of all speakers, where the rows below is composed of audio samples synthesized by our models.</div>
<h3>1、Seen Speakers</h3>
<blockquote>
<table>
<tr>
<td align=center width=200></td><td align=center width=200>Speaker 1</td><td align=center width=200>Speaker 2</td><td align=center width=200>Speaker 3</td><td align=center width=200>Speaker 4</td><td align=center width=200>Speaker 5</td>
</tr>
<tr height="10px"></tr>
<tr>
<td align=center width=200>Reference</td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/00112025-ori.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/00004573-ori.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/00004552-ori.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/00004519-ori.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/10004126-ori.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=200>Synthesized audio 1</td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/00112025-syn_1.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/00004573-syn_1.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/00004552-syn_1.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/00004519-syn_1.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/10004126-syn_1.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=200>Synthesized audio 2</td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/00112025-syn_2.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/00004573-syn_2.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/00004552-syn_2.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/00004519-syn_2.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/seen_speakers/10004126-syn_2.wav"></audio></td>
</tr>
</table>
</blockquote>
<h3>2、Unseen Speakers</h3>
<blockquote>
<table>
<tr>
<td align=center width=100> </td><td align=center width=200>Speaker 1</td><td align=center width=200>Speaker 2</td><td align=center width=200>Speaker 3</td><td align=center width=200>Speaker 4</td><td align=center width=200>Speaker 5</td>
</tr>
<tr height="10px"></tr>
<tr>
<td align=center width=300>Reference</td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00005186-ori.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00012581-ori.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00117201-ori.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00006063-ori.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00006002-ori.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300>Synthesized audio 1</td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00005186-syn_1.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00012581-syn_1.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00117201-syn_1.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00006063-syn_1.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00006002-syn_1.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300>Synthesized audio 2</td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00005186-syn_2.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00012581-syn_2.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00117201-syn_2.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00006063-syn_2.wav"></audio></td>
<td><audio controls><source src="multi_speaker_tts/unseen_speakers/00006002-syn_2.wav"></audio></td>
</tr>
</table>
</blockquote>
<h2>Voice conversion</h2>
<div>Audio samples of both the parallel and non-parallel voice conversion (VC) models trained on the DiDiSpeech corpus are provided here. In the rest of this section, the source and target audio, which has been separated from the training data, is the speech samples recorded from source and target speakers respectively. The converted audio is the speech samples converted from the source audio in the same line by using our VC models.</div>
<h3>1、Parallel VC</h3>
<blockquote>
<table>
<tr>
<td align=center width=200></td><td align=center width=200>Source audio</td><td align=center width=200>Target audio</td><td align=center width=200>Converted audio</td>
</tr>
<tr>
<td style="white-space:nowrap" align=center width=300>Inter-gender sample (Female)</td>
<td><audio controls><source src="voice_conversion/parallel_vc/source_1.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/target_1.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/convert_1.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300> </td>
<td><audio controls><source src="voice_conversion/parallel_vc/source_5.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/target_5.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/convert_5.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300>Inter-gender sample (Male)</td>
<td><audio controls><source src="voice_conversion/parallel_vc/source_2.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/target_2.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/convert_2.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300> </td>
<td><audio controls><source src="voice_conversion/parallel_vc/source_6.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/target_6.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/convert_6.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300>Intra-gender sample (Male to Female)</td>
<td><audio controls><source src="voice_conversion/parallel_vc/source_3.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/target_3.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/convert_3.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300> </td>
<td><audio controls><source src="voice_conversion/parallel_vc/source_7.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/target_7.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/convert_7.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300>Intra-gender sample (Female to Male)</td>
<td><audio controls><source src="voice_conversion/parallel_vc/source_4.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/target_4.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/convert_4.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300> </td>
<td><audio controls><source src="voice_conversion/parallel_vc/source_8.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/target_8.wav"></audio></td>
<td><audio controls><source src="voice_conversion/parallel_vc/convert_8.wav"></audio></td>
</tr>
</table>
</blockquote>
<h3>2、Non-parallel VC</h3>
<blockquote>
<table>
<tr>
<td align=center width=200></td><td align=center width=200>Source audio</td><td align=center width=200>Target audio</td><td align=center width=200>Converted audio</td>
</tr>
<tr>
<td style="white-space:nowrap" align=center width=300>Inter-gender sample (Female)</td>
<td><audio controls><source src="voice_conversion/noparallel_vc/source_1.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/target_1.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/convert_1.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300> </td>
<td><audio controls><source src="voice_conversion/noparallel_vc/source_5.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/target_5.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/convert_5.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300>Inter-gender sample (Male)</td>
<td><audio controls><source src="voice_conversion/noparallel_vc/source_2.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/target_2.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/convert_2.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300> </td>
<td><audio controls><source src="voice_conversion/noparallel_vc/source_6.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/target_6.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/convert_6.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300>Intra-gender sample (Male to Female)</td>
<td><audio controls><source src="voice_conversion/noparallel_vc/source_3.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/target_3.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/convert_3.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300> </td>
<td><audio controls><source src="voice_conversion/noparallel_vc/source_7.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/target_7.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/convert_7.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300>Intra-gender sample (Female to Male)</td>
<td><audio controls><source src="voice_conversion/noparallel_vc/source_4.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/target_4.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/convert_4.wav"></audio></td>
</tr>
<tr height="10px"></tr>
<tr>
<td style="white-space:nowrap" align=center width=300> </td>
<td><audio controls><source src="voice_conversion/noparallel_vc/source_8.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/target_8.wav"></audio></td>
<td><audio controls><source src="voice_conversion/noparallel_vc/convert_8.wav"></audio></td>
</tr>
</table>
</blockquote>
</body>
</html>