Skip to content

Latest commit

 

History

History
58 lines (45 loc) · 3.33 KB

README.md

File metadata and controls

58 lines (45 loc) · 3.33 KB

UTAUTAI: Unrestricted Tune Automated Technology Artificial Interigence

README

📖 Quick Index

🚀Model Architecture

UTAUTAI main architecture 🙇sorry for hand-draw

🤔What is UTAUTAI?

An open-source repository aimed at generating matching vocal and instrumental tracks from lyrics, similar to Suno AI's Chirp and Riffusion.

🐍Method

UTAUTAI's method are mainly inspired by SPEAR TTS

During training, the input consists of semantic tokens obtained from 'lyrics2semantic AR', which extracts semantic tokens from lyrics, as well as Acoustic tokens. Additionally, MERT representations derived from the music are subjected to k-means quantization to obtain further semantic tokens.

However, during inference, it is not possible to obtain MERT representations from the music. Therefore, we train a Style Module following the methodology of Prompt TTS2 to acquire the target MERT representations from the prompt during inference. The Style Module is composed of a transformer-based diffusion model.

I think that using this approach, we can successfully accomplish the target tasks. What do you think?

🧠TODO

  • How can we obtain lyrics that match the cropped audio? Or should we even crop the audio in the first place? code
  • Examine the handling of phonemization and special tokens, and make necessary code modifications. code
  • Correct the collator in the dataset. code
  • Complete the StyleModule inference code. code
  • Other minor code fixes, such as masking strategies.
  • Eliminate the diffusion model and adapt the consistency model.

🙏Appreciation

⭐️Show Your Support

If you find UTAUTAI interesting and useful, give us a star on GitHub! ⭐️ It encourages us to keep improving the model and adding exciting features.

🙆Welcome Contributions

Contributions are always welcome.