US English Female Dataset #191

StoryHack · 2023-09-01T17:13:07Z

StoryHack
Sep 1, 2023

I just uploaded a ~12hr, ~7500 wav dataset to Kaggle. https://www.kaggle.com/datasets/storyhack/karen-us-female-tts-dataset

I've been using it overnight to finetune a HQ voice. I ran some tests this morning and it sounds pretty good after relatively few (~150) epochs so I figured it would be good to post in case anyone else wants to use it. I may use it to start a voice from scratch rather than finetune after I'm done testing.

About

This dataset was created from audiobooks recorded by Karen Savage and released into the public domain on Librivox.org. My end goal is to create a couple of public domain TTS voices to use as a base for finetuning. Some of the big datasets currently available require only educational/research use, and the licensing is either complicated or unclear on exactly what you can do with voices made from them.

The three audiobooks used were: "Anne's House of Dreams" by Lucy Maude Montgomery, "Rainbow Valley" also by Lucy Maude Montgomery, and "The Sky is Falling" by Lester Del Rey. These were chosen for the quality of recording (no hiss or echoes, etc...) and because the reader spoke in an American accent. She has several high quality recordings done in an British accent, too, and I suppose those could be used for a different dataset.

This dataset contains about 12 hours of recordings in 6847 total utterences. All recordings are between 3 and 15 seconds long.

Format

The format and files were made to be used in training a Piper voice, but they are in a format that I'm sure can be used in many systems fairly easily. The zip contains:

A README.md file
a metadata.csv file which contains trasncripts of all recordings
a "wav" directory with 22050 bit rate wavs all of the recordings

Process

The audio and text proccessing was done with the assistance of several automatic tools.

Here's a basic outline of what I did:

Joined all of the mp3 into one long file in audacity.
Used audacity's "Analyze -> label sounds" function to mark sections based on silences. I looked for silences at least .6 seconds long with sections at least 3 seconds long
Exported to wav format using Audacity's "File -> Export Multiple"
Used a php script to delete all of the recordings longer than 15 seconds. The script calls sox (sound exchange) to get the length of the recordings.
Used a php script that calls Whisper AI to transcribe all of the wavs into .txt files, then cleans up the resultant text and compiles it into the metadata.csv file

License

This dataset is placed into the Public Domain

ZachB100 · 2023-09-04T00:01:09Z

ZachB100
Sep 4, 2023

Hi! Really really happy to see more people experimenting with training models for Piper!!!!
I would love to connect with you to swap ideas as I'm experimenting with the same thing you are. I've been training TTS for a few years at this point, and I'm particularly interested as I'm totally blind and use a screen reader on a daily basis.
I haven't taken a look at your dataset as of yet, but one thing I started really paying attention to is punctuation and normalizing the text. That is to say, making sure that all numbers and abbreviations are spelled out, as well as ensuring that pauses in the recordings line up with appropriate punctuation. I don't know if you can automate this, I found the best results were when I just did everything by hand. It takes a lot more time, but the results are definitely worth it. If you're interested I can share some samples.
If you have Discord, I'm more than happy to chat on there if you'd like.

1 reply

StoryHack Sep 4, 2023
Author

After some initial success with the finetuning, I started training from scratch, and I'm not happy with the results. I'll need to either do as you suggested and make a lot of manual edits, or figure out some way to more automatically check/correct the text against the gutenberg.org text that the narrator used to record.

I really would love to have a good dataset and voice to release to the public domain so that other experimenters can work without licencing worries.

What do you do to normalize text?

ZachB100 · 2023-09-04T17:45:59Z

ZachB100
Sep 4, 2023

I'm not doing anything super fancy. I've been experimenting with getting Eleven Labs TTS to generate files based on the same script used in Piper Recording Studio, most of which is already normalized correctly. Just out of curiosity, why are you putting a ton of time into making a public domain dataset when LJ speech already exists? I'm surprised piper has not been trained on that yet, as it could potentially make a good base for fine-tuning other models later on down the line. Libri Vox is definitely a good source for material, but be very careful about which titles you choose. Non-fiction is probably best, as you don't want the voice actor doing different character voices throughout the dataset as that will confuse the model. When looking at the data you provided, I noticed Whisper completely missed most of the text in the first file, and likely others besides. It's extremely important to manually check anything the model outputs, as I've had a ton of issues in the past when trying to rely on it for TTS datasets. Right now, nothing about this process is quick and easy other than the training. The reason you don't see a ton of voices from open source is because there is a lot of work involved in doing this correctly and making something robust and high-quality.

…

On Mon, Sep 4, 2023 at 12:18 PM StoryHack ***@***.***> wrote: After some initial success with the finetuning, I started training from scratch, and I'm not happy with the results. I'll need to either do as you suggested and make a lot of manual edits, or figure out some way to more automatically check/correct the text against the gutenberg.org text that the narrator used to record. I really would love to have a good dataset and voice to release to the public domain so that other experimenters can work without licencing worries. What do you do to normalize text? — Reply to this email directly, view it on GitHub <#191 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A2E7LYK2VHJELQMJLRAXAUTXYX5NPANCNFSM6AAAAAA4H24NLM> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

StoryHack Sep 4, 2023
Author

As far as the LJ Speech dataset goes, I find the recordings a little harsh and echoey. But if I do plan to end up mostly finetuning from there, maybe it'll be okay.

Regarding the missed text, when I have audacity do the silence detection, it often makes sections that I consider too long. I have a script looks at how long the cut recordings are, then removes long ones from the dataset altogether. There ends up being many pockets of missing text. The full recordings are much longer than the clips that ended up in the set.

I was hoping to automate as much as possible in order to get something useful, as my time is fairly limited. I may have to just let go of that dream, and do more manual grunt work.

StoryHack · 2023-09-04T18:29:34Z

StoryHack
Sep 4, 2023
Author

Also, somehow I thought that LJ Speech was under a different license. I think I'll give it a try.

1 reply

StoryHack Sep 4, 2023
Author

Ok, I have LJSpeech training now on my puny RTX 3060. Each epoch takes about 16 minutes. Due to the dataset size, I doubt it'll take the reccomended 2000 epochs to be useful.

ZachB100 · 2023-09-04T22:19:40Z

ZachB100
Sep 4, 2023

Good luck! I'm on an apple silicon machine, so unfortunately I can't train as Piper isn't compatible with MPS. I have to resort to using Colab. Quick question, in the dataset there are three columns. The file name, the regular text, and the normalized text. Did you modify the CSV file to only include the file name and normalized text? That's what you should do as Piper isn't using the regular LJ speech format as the third column is reserved for multi speaker models.

…

On Mon, Sep 4, 2023 at 4:11 PM StoryHack ***@***.***> wrote: Ok, I have LJSpeech training now on my puny RTX 3060. Each epoch takes about 16 minutes. Due to the dataset size, I doubt it'll take the reccomended 2000 epochs to be useful. — Reply to this email directly, view it on GitHub <#191 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A2E7LYPR455OKH4JCOUQFCLXYYYX7ANCNFSM6AAAAAA4H24NLM> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

StoryHack Sep 5, 2023
Author

Yes, I filtered out the unneeded 1st text column. I just listened to a sample of what produces after 17 epochs, and it already mostly sounds like english, just not good yet. It probably has a long way to go for that.

sweetbbak · 2023-11-29T22:28:55Z

sweetbbak
Nov 29, 2023

Has anyone published any models using this data set?

5 replies

StoryHack Nov 29, 2023
Author

I'm currently working on a much better version. I found a new tool that lets me align the sentances much better.

StoryHack Dec 1, 2023
Author

But I forgot to mention, if you'd like to play, here are a couple of models I trained with that dataset. v1 might have been finetuned from some other model, but v2 was from scratch, and with a filtered version of that dataset. I should keep better notes.

Karen v1
Karen v2

sweetbbak Dec 1, 2023

Thanks I appreciate it! I love playing around with the models. You can get some interesting results. I've been wanting to create a multi-speaker model that is the same speaker but each one would be trained on data that uses a certain emotion. That way you could create wavs that have a mixed range of emotions. Itd be good for novels and things like that. Though its a lot of work.

I also have a few models that Ive been working on that have a similar sound/vibe.
https://github.com/sweetbbak/Neural-Amy-TTS/tree/main/models
https://github.com/sweetbbak/Neural-Amy-TTS/releases/download/v1/ivona_hq.tar.lzma

jd3655 Jan 20, 2024

@StoryHack Can I ask what you are using for sentence alignment? Thank you.

StoryHack Jan 21, 2024
Author

The tool I'm playing with now (PixViz) is one that is built for making Closed Caption files for video. It's a pain to do all the extra steps. I put book audio with a black image to make it into video. Then I use a program called PixVis to either transcribe the audo, or it has a function to import text and attempt to align it. It lets you manually adjust those allignments. I then export the Closed caption file which has the timing embedding. I have something scripted that cuts up the audio file based on the times in the closed cation file, and assembles a metadata file.

The whole proccess is still not ideal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

US English Female Dataset #191

{{title}}

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

US English Female Dataset #191

StoryHack Sep 1, 2023

About

Format

Process

License

Replies: 5 comments · 9 replies

ZachB100 Sep 4, 2023

StoryHack Sep 4, 2023 Author

ZachB100 Sep 4, 2023

StoryHack Sep 4, 2023 Author

StoryHack Sep 4, 2023 Author

StoryHack Sep 4, 2023 Author

ZachB100 Sep 4, 2023

StoryHack Sep 5, 2023 Author

sweetbbak Nov 29, 2023

StoryHack Nov 29, 2023 Author

StoryHack Dec 1, 2023 Author

sweetbbak Dec 1, 2023

jd3655 Jan 20, 2024

StoryHack Jan 21, 2024 Author

StoryHack
Sep 1, 2023

Replies: 5 comments 9 replies

ZachB100
Sep 4, 2023

StoryHack Sep 4, 2023
Author

ZachB100
Sep 4, 2023

StoryHack Sep 4, 2023
Author

StoryHack
Sep 4, 2023
Author

StoryHack Sep 4, 2023
Author

ZachB100
Sep 4, 2023

StoryHack Sep 5, 2023
Author

sweetbbak
Nov 29, 2023

StoryHack Nov 29, 2023
Author

StoryHack Dec 1, 2023
Author

StoryHack Jan 21, 2024
Author