US English Female Dataset #191
Replies: 5 comments 9 replies
-
Hi! Really really happy to see more people experimenting with training models for Piper!!!! |
Beta Was this translation helpful? Give feedback.
-
I'm not doing anything super fancy. I've been experimenting with getting
Eleven Labs TTS to generate files based on the same script used in Piper
Recording Studio, most of which is already normalized correctly.
Just out of curiosity, why are you putting a ton of time into making a
public domain dataset when LJ speech already exists? I'm surprised piper
has not been trained on that yet, as it could potentially make a good base
for fine-tuning other models later on down the line. Libri Vox is
definitely a good source for material, but be very careful about which
titles you choose. Non-fiction is probably best, as you don't want the
voice actor doing different character voices throughout the dataset as that
will confuse the model.
When looking at the data you provided, I noticed Whisper completely missed
most of the text in the first file, and likely others besides. It's
extremely important to manually check anything the model outputs, as I've
had a ton of issues in the past when trying to rely on it for TTS datasets.
Right now, nothing about this process is quick and easy other than the
training. The reason you don't see a ton of voices from open source is
because there is a lot of work involved in doing this correctly and making
something robust and high-quality.
…On Mon, Sep 4, 2023 at 12:18 PM StoryHack ***@***.***> wrote:
After some initial success with the finetuning, I started training from
scratch, and I'm not happy with the results. I'll need to either do as you
suggested and make a lot of manual edits, or figure out some way to more
automatically check/correct the text against the gutenberg.org text that
the narrator used to record.
I really would love to have a good dataset and voice to release to the
public domain so that other experimenters can work without licencing
worries.
What do you do to normalize text?
—
Reply to this email directly, view it on GitHub
<#191 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2E7LYK2VHJELQMJLRAXAUTXYX5NPANCNFSM6AAAAAA4H24NLM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Also, somehow I thought that LJ Speech was under a different license. I think I'll give it a try. |
Beta Was this translation helpful? Give feedback.
-
Good luck! I'm on an apple silicon machine, so unfortunately I can't train
as Piper isn't compatible with MPS. I have to resort to using Colab. Quick
question, in the dataset there are three columns. The file name, the
regular text, and the normalized text. Did you modify the CSV file to only
include the file name and normalized text? That's what you should do as
Piper isn't using the regular LJ speech format as the third column is
reserved for multi speaker models.
…On Mon, Sep 4, 2023 at 4:11 PM StoryHack ***@***.***> wrote:
Ok, I have LJSpeech training now on my puny RTX 3060. Each epoch takes
about 16 minutes. Due to the dataset size, I doubt it'll take the
reccomended 2000 epochs to be useful.
—
Reply to this email directly, view it on GitHub
<#191 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2E7LYPR455OKH4JCOUQFCLXYYYX7ANCNFSM6AAAAAA4H24NLM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Has anyone published any models using this data set? |
Beta Was this translation helpful? Give feedback.
-
I just uploaded a ~12hr, ~7500 wav dataset to Kaggle. https://www.kaggle.com/datasets/storyhack/karen-us-female-tts-dataset
I've been using it overnight to finetune a HQ voice. I ran some tests this morning and it sounds pretty good after relatively few (~150) epochs so I figured it would be good to post in case anyone else wants to use it. I may use it to start a voice from scratch rather than finetune after I'm done testing.
About
This dataset was created from audiobooks recorded by Karen Savage and released into the public domain on Librivox.org. My end goal is to create a couple of public domain TTS voices to use as a base for finetuning. Some of the big datasets currently available require only educational/research use, and the licensing is either complicated or unclear on exactly what you can do with voices made from them.
The three audiobooks used were: "Anne's House of Dreams" by Lucy Maude Montgomery, "Rainbow Valley" also by Lucy Maude Montgomery, and "The Sky is Falling" by Lester Del Rey. These were chosen for the quality of recording (no hiss or echoes, etc...) and because the reader spoke in an American accent. She has several high quality recordings done in an British accent, too, and I suppose those could be used for a different dataset.
This dataset contains about 12 hours of recordings in 6847 total utterences. All recordings are between 3 and 15 seconds long.
Format
The format and files were made to be used in training a Piper voice, but they are in a format that I'm sure can be used in many systems fairly easily. The zip contains:
Process
The audio and text proccessing was done with the assistance of several automatic tools.
Here's a basic outline of what I did:
License
This dataset is placed into the Public Domain
Beta Was this translation helpful? Give feedback.
All reactions