- Datasets
- Survey
- Funny Work
- Audio-driven
- Text-driven
- NeRF & 3D & Gaussian Splatting
- Metrics
- Tools & Software
- Slides & Presentations
- References
- Star History
This repository organizes papers, codes and resources related to generative adversarial networks (GANs) 🤗 and neural radiance fields (NeRF) 🎨, with a main focus on image-driven and audio-driven talking head synthesis papers and released codes. 👤
Papers for Talking Head Synthesis, released codes collections. ✍️
Most papers are linked to PDFs on "arXiv" or journal/conference websites 📚. However, some papers require an academic license to view 🔐.
🔆 This project Awesome-Talking-Head-Synthesis is ongoing - pull requests are welcome! If you have any suggestions (missing papers, new papers, key researchers or typos), please feel free to edit and submit a PR. You can also open an issue or contact me directly via email. 📩
⭐ If you find this repo useful, please give it a star! 🤩
2023.12 Update 📆
Thank you to https://github.com/Curated-Awesome-Lists/awesome-ai-talking-heads, I have added some of its contents, such as Tools & Software
and Slides & Presentations
. 🙏 I hope this will be helpful.😊
If you have any feedback or ideas on extending this aggregated resource, please open an issue or PR - community contributions are vital to advancing this shared knowledge. 🤝
Let's keep pushing forward to recreate ever more realistic digital human faces! 💪 We've come so far but still have a long way to go. With continued research 🔬 and collaboration, I'm sure we'll get there! 🤗
Please feel free to star ⭐ and share this repo if you find it a valuable resource. Your support helps motivate me to keep maintaining and improving it. 🥰 Let me know if you have any other questions!
Dataset | Download Link | Description |
---|---|---|
Faceforensics++ | Download link | |
CelebV | Download link | |
VoxCeleb | Download link | VoxCeleb , a comprehensive audio-visual dataset for speaker recognition, encompasses both VoxCeleb1 and VoxCeleb2 datasets. |
VoxCeleb1 | Download link | VoxCeleb1 contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube. |
VoxCeleb2 | Download link | Extracted from YouTube videos, VoxCeleb2 includes video URLs and discourse timestamps. As the largest public audio-visual dataset, it is primarily used for speaker recognition tasks. However, it can also be utilized for training talking-head generation models. To obtain download permission and access the dataset, apply here. Requires 300 GB+ storage space. |
ObamaSet | Download link | ObamaSet is a specialized audio-visual dataset focused on analyzing the visual speech of former US President Barack Obama. All video samples are collected from his weekly address footage. Unlike previous datasets, it exclusively centers on Barack Obama and does not provide any human annotations. |
TalkingHead-1KH | Download link | The dataset consists of 500k video clips, of which about 80k are greater than 512x512 resolution. Only videos under permissive licenses are included. Note that the number of videos differ from that in the original paper because a more robust preprocessing script was used to split the videos. |
LRW (Lip Reading in the Wild) | Download link | LRW, a diverse English-speaking video dataset from the BBC program, features over 1000 speakers with various speaking styles and head poses. Each video is 1.16 seconds long (29 frames) and involves the target word along with context. |
MEAD 2020 | Download link | MEAD 2020 is a Talking Head dataset annotated with emotion labels and intensity labels. The dataset focuses on facial generation for natural emotional speech, covering eight different emotions on three intensity levels. |
CelebV-HQ | Download link | CelebV-HQ is a high-quality video dataset comprising 35,666 clips with a resolution of at least 512x512. It includes 15,653 identities, and each clip is manually labeled with 83 facial attributes, spanning appearance, action, and emotion. The dataset's diversity and temporal coherence make it a valuable resource for tasks like unconditional video generation and video facial attribute editing. 百度网盘 GoogleDriver |
HDTF | Download link | HDTF, the High-definition Talking-Face Dataset, is a large in-the-wild high-resolution audio-visual dataset consisting of approximately 362 different videos totaling 15.8 hours. Original video resolutions are 720 P or 1080 P, and each cropped video is resized to 512 × 512. |
CREMA-D | Download link | CREMA-D is a diverse dataset with 7,442 original clips featuring 91 actors, including 48 male and 43 female actors aged 20 to 74, representing various races and ethnicities. The dataset includes recordings of actors speaking from a set of 12 sentences, expressing six different emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad) at four emotion levels (Low, Medium, High, and Unspecified). Emotion and intensity ratings were gathered through crowd-sourcing, with 2,443 participants rating 90 unique clips each (30 audio, 30 visual, and 30 audio-visual). Over 95% of the clips have more than 7 ratings. For additional details on CREMA-D, refer to the paper link. |
LRS2 | Download link | LRS2 is a lip reading dataset that includes videos recorded in diverse settings, suitable for studying lip reading and visual speech recognition. |
GRID | Download link | The GRID dataset was recorded in a laboratory setting with 34 volunteers, each speaking 1000 phrases, totaling 34,000 utterance instances. Phrases follow specific rules, with six words randomly selected from six categories: "command," "color," "preposition," "letter," "number," and "adverb." Access the dataset here. |
SAVEE | Download link | The SAVEE (Surrey Audio-Visual Expressed Emotion) database is a crucial component for developing an automatic emotion recognition system. It features recordings from 4 male actors expressing 7 different emotions, totaling 480 British English utterances. These sentences, selected from the standard TIMIT corpus, are phonetically balanced for each emotion. Recorded in a high-quality visual media lab, the data undergoes processing and labeling. Performance evaluation involves 10 subjects rating recordings under audio, visual, and audio-visual conditions. Classification systems for each modality achieve speaker-independent recognition rates of 61%, 65%, and 84% for audio, visual, and audio-visual, respectively. |
BIWI(3D) | Download link | The Biwi 3D Audiovisual Corpus of Affective Communication serves as a compromise between data authenticity and quality, acquired at ETHZ in collaboration with SYNVO GmbH. |
VOCA | Download link | VOCA is a 4D-face dataset with approximately 29 minutes of 4D face scans and synchronized audio from 12-bit speakers. It greatly facilitates research in 3D VSG. |
Multiface(3D) | Download link | The Multiface Dataset consists of high-quality multi-view video recordings of 13 people displaying various facial expressions. It contains approximately 12,200 to 23,000 frames per subject, captured at 30 fps from around 40 to 160 camera views with uniform lighting. The dataset's size is 65TB and includes raw images (2048x1334 resolution), tracked and meshed heads, 1024x1024 unwrapped face textures, camera calibration metadata, and audio. This repository provides code for downloading the dataset and building a codec avatar using a deep appearance model. |
MMFace4D | Download link | The MMFace4D dataset is a large-scale multi-modal dataset for audio-driven 3D facial animation research. It contains over 35,000 sequences captured from 431 subjects ranging in age from 15 to 68 years old. Various sentences from scenarios such as news broadcasting, conversations and storytelling were recorded, totaling around 11,000 utterances. High-fidelity data was captured using three synchronized RGB-D cameras to obtain high-resolution 3D meshes and textures. A reconstruction pipeline was developed to fuse the multi-view data and generate topology-consistent 3D mesh sequences. In addition to the 3D facial motions, synchronized speech audio is also provided. The final dataset covers a wide range of expressive talking styles and facial expressions through a diverse set of subjects and utterances. With its large scale, high quality of data and strong diversity, the MMFace4D dataset provides an ideal benchmark for developing and evaluating audio-driven 3D facial animation models. |
VFHQ | Download link | Most of the existing video face super-resolution (VFSR) methods are trained and evaluated on VoxCeleb1, which is designed specifically for speaker identification and the frames in this dataset are of low quality. As a consequence, the VFSR models trained on this dataset can not output visual-pleasing results. In this paper, we develop an automatic and scalable pipeline to collect a high-quality video face dataset (VFHQ), which contains over 16,000 high-fidelity clips of diverse interview scenarios. To verify the necessity of VFHQ, we further conduct experiments and demonstrate that VFSR models trained on our VFHQ dataset can generate results with sharper edges and finer textures than those trained on VoxCeleb1. In addition, we show that the temporal information plays a pivotal role in eliminating video consistency issues as well as further improving visual performance. Based on VFHQ, by analyzing the benchmarking study of several state-of-the-art algorithms under bicubic and blind settings. |
MultiTalk | Download link | MultiTalk dataset is a new multilingual 2D video dataset featuring over 420 hours of talking videos across 20 languages. It contains 293,812 clips with a resolution of 512x512, a frame rate of 25 fps, and an average duration of 5.19 seconds per clip. The dataset shows a balanced distribution across languages, with each language representing between 2.0% and 9.7% of the total. |
CN-CVS | Download link | CN-Celeb-AV is a multi-genre audio-visual person recognition dataset covering 11 different genres in the real world, collected from multiple Chinese open media sources. CN-CVS is a large-scale continuous visual-speech dataset in Mandarin Chinese consisting of short clips collected from TV news and Internet speech shows. |
Year | Title | Code | Project | Keywords |
---|---|---|---|---|
2024 | [Audio2Photoreal] From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations | Code | Project | Photoreal |
2024 | [Animate Anyone] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation | Code | Project | 🔥Animate (阿里科目三驱动) |
2024 | [3DGAN] What You See Is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs | Project | 🔥Nvidia | |
2024 | [LivePortrait] LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control | Codea | Project | 🔥快手 |
Metrics | Paper | Link |
---|---|---|
PSNR (peak signal-to-noise ratio) | - | |
SSIM (structural similarity index measure) | Image quality assessment: from error visibility to structural similarity. | |
CPBD(cumulative probability of blur detection) | A no-reference image blur metric based on the cumulative probability of blur detection | |
LPIPS (Learned Perceptual Image Patch Similarity) - | The Unreasonable Effectiveness of Deep Features as a Perceptual Metric | paper |
NIQE (Natural Image Quality Evaluator) | Making a ‘Completely Blind’ Image Quality Analyzer | paper |
FID (Fréchet inception distance) | GANs trained by a two time-scale update rule converge to a local nash equilibrium | |
LMD (landmark distance error) | Lip Movements Generation at a Glance | |
LRA (lip-reading accuracy) | Talking Face Generation by Conditional Recurrent Adversarial Network | paper |
WER(word error rate) | Lipnet: end-to-end sentencelevel lipreading. | |
LSE-D (Lip Sync Error - Distance) | Out of time: automated lip sync in the wild | |
LSE-C (Lip Sync Error - Confidence) | Out of time: automated lip sync in the wild | |
ACD(Average content distance) | Facenet: a unified embedding for face recognition and clustering. | |
CSIM(cosine similarity) | Arcface: additive angular margin loss for deep face recognition. | |
EAR(eye aspect ratio) | Real-time eye blink detection using facial landmarks. In: Computer Vision Winter Workshop | |
ESD(emotion similarity distance) | What comprises a good talking-head video generation?: A Survey and Benchmark |
Tool/Resource | Description |
---|---|
LUCIA | Development of a MPEG-4 Talking Head Engine. 💻 |
Yepic Studio | Create and dub talking head-style videos in minutes without expensive equipment. 🎥 |
Mel McGee's Talkbots | A complete multi-browser, multi-platform talking head application in SVG suitable for web sites or as an avatar. 🗣️ |
face3D_chung | Create 3D character avatar head objects with texture from a single photo for your games. 🎮 |
CrazyTalk | Exciting features for 3D head creation and automation. 🤪 |
tts avatar free download - SourceForge | Mel McGee's Talkbots is a complete multi-browser, multi-platform talking head. (🔧👄) |
Verbatim AI - Product Information, Latest Updates, and Reviews 2023 | A simple yet powerful API to generate AI "talking head" videos in near real-time with Verbatim AI. Add interest, intrigue, and dynamism to your chat bots! (🔧👄) |
Best Open Source BASIC 3D Modeling Software | Includes talk3D_chung, a small example using obj models created with face3D_chung, and speak3D_chung_dll, a dll to load and display face3D_chung talking avatars. (🛠️🎭) |
DVDStyler / Discussion / Help: ffmpeg-vbr or internal | Talking heads would get a bitrate which is unnecessarily high while using DVDStyler. (🛠️👄) |
puffin web browser free download - SourceForge | Mel McGee's Talkbots is a complete multi-browser, multi-platform talking head. (🔧👄) |
12 best AI video generators to use in 2023 Free and paid |Product ... | Whether you’re an entrepreneur, small business owner, or run a large company, AI video generators make it super easy to create high-quality videos from scratch. (🔧🎥) |
Presentation Title | Description |
---|---|
Few-Shot Adversarial Learning of Realistic Neural Talking Head Models | Presentation reviewing the few-shot adversarial learning of realistic neural talking head models. |
Nethania Michelle's Character | PPT: Presentation discussing the improvement of a 3D talking head for use in an avatar of a virtual meeting room. |
Presenting you: Top tips on presenting with Prezi Video – Prezi | Article providing top tips for presenting with Prezi Video. |
Research Presentation | PPT: Resident Research Presentation Slide Deck. |
Adding narration to your presentation (using Prezi Video) – Prezi | Learn how to add narration to your Prezi presentation with Prezi Video. |
Website | Description |
---|---|
arXiv | Provides preprints in various academic fields, serving as an important platform for accessing the latest research findings. |
CVF Open Access | The Computer Vision Foundation's open-access platform, offering open-access papers from top conferences such as CVPR, ICCV, ECCV, and more. |
Papers with Code | A platform that aggregates research papers with accompanying code implementations, making it convenient to find the latest research findings and their corresponding implementations. |
ICCV - International Conference on Computer Vision | The International Conference on Computer Vision, gathering the latest research findings in the field of computer vision. |
ECCV - European Conference on Computer Vision | The European Conference on Computer Vision, providing the latest research results and related information in the field of computer vision. |
CVPR - Conference on Computer Vision and Pattern Recognition | The Conference on Computer Vision and Pattern Recognition, one of the top conferences in computer vision, showcasing numerous important research findings. |