All progress updates #37
Locked
jpc
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Progress updates (from newest):
Progress update [2024-01-29]
We successfully trained a
tiny
S2A model on an en+pl+fr dataset and it can do voice cloning in French:fr-voice-clone-2.mp4
fr-voice-clone-1.mp4
We were able to do this with frozen semantic tokens that were only trained on English and Polish. This supports the idea that we will be able to train a single semantic token model to support all the languages in the world. Quite likely even ones that are not currently well supported by the Whisper model. Stay tuned for more updates on this front. :)
Progress update [2024-01-18]
We spend the last week optimizing inference performance. We integrated
torch.compile
, added kv-caching and tuned some of the layers – we are now working over 12x faster than real-time on a consumer 4090!We also added an easy way to test voice-cloning. Here is a sample voice cloned from a famous speech by Winston Churchill:
en-cloning.mp4
We can also mix languages in a single sentence (here the highlighted English project names are seamlessly mixed into Polish speech):
pl-en-mix.mp4
You can test all of these on Collab. A Huggingface Space is coming soon.
Progress update [2024-01-10]
We’ve pushed a new SD S2A model that is a lot faster while still
generating high-quality speech. We’ve also added an example of voice
cloning based on a reference audio file.
As always, you can check out our
Colab
to try it yourself!
2023-12-10
Another trio of models, this time they support multiple languages (English and Polish). Here are two new samples for a sneak peek. You can check out our Colab to try it yourself!
English speech, female voice (transferred from a Polish language dataset):
whisperspeech-sample.mp4
A Polish sample, male voice:
whisperspeech-sample-pl.mp4
2023-07-14
We have trained a new pair of models, added support for multiple speakers and integrated the Vocos vocoder to deliver a big overall quality boost. And this is not even our last word because we are doing hyperparameter tuning to train bigger, higher-quality models.
An end to end generation example, inspired by one famous president's speech (don't forget to unmute the videos):
Female voice:
we-choose-tts.mp4
Male voice:
we-choose-tts-s467.mp4
We have streamlined the inference pipeline and you can now test the model yourself on Google Colab:
2023-04-13
We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!).
End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see #9 for more details:
(don't forget to unmute the video)
test-e2e-jfk-T0.7.mp4
Ground truth:
we-choose.mp4
2023-04-03
We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps.
Validation set ground truth (don't forget to unmute):
ground-truth.mov
The generated output from the S->A model (multinomial sampling, temperature 0.8):
saar-1300hr-2l-20e-T0.8.mov
Beta Was this translation helpful? Give feedback.
All reactions