PolyLangVITS

Multilingual Speech Synthesis System Using VITS

Prerequisites

A Windows/Linux system with a minimum of 16GB RAM.
A GPU with at least 12GB of VRAM.
Python == 3.8
Anaconda installed.
PyTorch installed.
CUDA 11.x installed.
Zlib DLL installed.

Pytorch install command:

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

CUDA 11.7 install: https://developer.nvidia.com/cuda-11-7-0-download-archive

Zlib DLL install: https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-zlib-windows

Install pyopenjtalk Manually: pip install -U pyopenjtalk --no-build-isolation

If this command does not install, please install the following library before proceeding: cmake Cython

Installation

Create an Anaconda environment:

conda create -n polylangvits python=3.8

Activate the environment:

conda activate polylangvits

Clone this repository to your local machine:

git clone https://github.com/ORI-Muchim/PolyLangVITS.git

Navigate to the cloned directory:

cd PolyLangVITS

Install the necessary dependencies:

pip install -r requirements.txt

Prepare_Datasets

Place the audio files as follows.

.mp3 or .wav files are okay.

You must write '[language code]' on the back of the speaker folder.

PolyLangVITS
├────datasets
│       ├───speaker0[KO]
│       │   ├────1.mp3
│       │   └────1.wav
│       └───speaker1[JA]
│       │    ├───1.mp3
│       │    └───1.wav
│       ├───speaker2[EN]
│       │   ├────1.mp3
│       │   └────1.wav
│       ├───speaker3[ZH]
│       │   ├────1.mp3
│       │   └────1.wav
│       ├integral.py
│       └integral_low.py
│
├────vits
├────get_pretrained_model.py
├────inference.py
├────main_low.py
├────main_resume.py
├────main.py
├────Readme.md
└────requirements.txt

This is just an example, and it's okay to add more speakers.

Usage

To start this tool, use the following command, replacing {language}, {model_name}, and {sample_rate} with your respective values:

python main.py {language} {model_name} {sample_rate}

For those with low specifications(VRAM < 12GB), please use this code:

python main_low.py {language} {model_name} {sample_rate}

If the data configuration is complete and you want to resume training, enter this code:

python main_resume.py {model_name}

Inference

After the model has been trained, you can generate predictions by using the following command, replacing {model_name} and {model_step} with your respective values:

python inference.py {model_name} {model_step}

For text to speech inference, use the following:

python inference-stt.py {model_name} {model_step}

Also, you may manually pass the text without editing the code by:

python inference-stt.py {model_name} {model_step} {text}

References

For more information, please refer to the following repositories:

jaywalnut310/vits
CjangCjengh/vits
Kyubyong/g2pK
tenebo/g2pk2
henrymass/AudioSlicer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

PolyLangVITS

Table of Contents

Prerequisites

Installation

Prepare_Datasets

Usage

Inference

References

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

PolyLangVITS

Table of Contents

Prerequisites

Installation

Prepare_Datasets

Usage

Inference

References