-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
366 lines (329 loc) · 64.3 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description"
content="">
<meta name="keywords" content="MS-UFAD">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>MS-UFAD</title>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});
</script>
<script type="text/javascript"
src="http://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-PYVRSFMDRL"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag('js', new Date());
gtag('config', 'G-PYVRSFMDRL');
</script>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./static/images/favicon_transparent.png">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
<style>
.container {
position: relative;
}
.text-block {
position: relative;
top: 0px;
right: 0px;
margin-left: 5px;
width: 80%;
text-align: center;
border-radius:10px 10px 0px 0px;
border: 1px solid #787878;
background-color: #787878;
color: white;
padding-left: 0px;
padding-right: 0px;
padding-top: 3px;
padding-bottom: 3px;
}
.center-text {
width: 800px; /* 设置文字的宽度 */
margin: 0 auto; /* 上下不设置外边距,左右设置外边距自动居中 */
text-align: left; /* 设置文字居中显示 */
}
</style>
</head>
<body>
<section class="center-text">
<div style="margin-bottom:-80px" class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-2 publication-title">MS-UFAD: A Large-Scale Dataset for Real-world Unifed Face Attack Detection with Text Descriptions</h1>
<div class="column has-text-centered">
</span>
<!-- Code Link. -->
<span class="link-block">
<a href=""
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<p style="text-align:center;">Dataset (Coming Soon)</p><br>
</a>
</span>
<!-- Dataset Link. -->
<!--
<span class="link-block">
<a href=""
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="far fa-images"></i>
</span>
<span>Data</span>
</a>
-->
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<!--
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">TL;DR</h2>
<center>
<div class="content has-text-justified" style='width:80%'>
<p>
TBD.
</p>
</div>
</center>
</div>
</div>
</div>
</section>
-->
<section class="center-text">
<section class="columns is-vcentered interpolation-panel" width=100%>
<div class="container is-max-desktop">
</br>
</br>
<center><h2 class="title is-3">Dataset Overview</h2></center><br>
<p class="content has-text-justified">
As shown in <strong>Table 1</strong>, As shown in Table I, our MS-UFAD dataset includes 5000 subjects from 3 ethnic groups (e.g., Asian, African, and European), covering ages from 16 to 80 years old, with a gender ratio close to 1:1. The
collection settings include 5 types of lighting conditions (normal, strong, dim, side light, and back light), and scenes include both indoor and outdoor environments. The capturing devices cover 30 different models of smartphones. The attack data generation process is as follows:
</p>
<!-- Visual Effects. -->
<div class="column">
<div class="content">
<p class="content has-text-justified" style='width:80%'>
<ul>
<li><p class="content has-text-justified" style='width:100%'>Video Replay Presentation Attack: Each subject’s video is displayed on a high-definition screen, and then the video on the screen is recaptured with a smartphone. During capturing, the angle is adjusted appropriately to avoid moire patterns, reflections, and artifacts.</p></li>
<li><p class="content has-text-justified" style='width:100%'>Photo Presentation Attack: For each subject, a frame is extracted from its video, printed on high-definition photo paper, and then recaptured with a smartphone. </p></li>
<li><p class="content has-text-justified" style='width:100%'>Adversarial Example Attack: For each subject’s video, 10 different adversarial attack algorithms are used to perform untargeted attacks (to make the face comparison model ArcFace ineffective), obtaining adversarial sample data for each video. </p></li>
<li><p class="content has-text-justified" style='width:100%'>Deepfake Attack via Face Swapping: For each subject’s video and another randomly selected subject’s video, face swapping is performed using them as source and target videos respectively, including 12 different face swapping algorithms.</p></li>
<li><p class="content has-text-justified" style='width:100%'>Deepfake Attack via Reenactment: A frame is randomly extracted from each subject’s video, and a video from another randomly selected subject is used as the source image and
driving video for face reenactment, including 14 reenactment algorithms.</p></li>
<li><p class="content has-text-justified" style='width:100%'>Deepfake Attack via Editing: For each subject’s video, 4 different face editing algorithms are used for digital forgery, including changing expressions, hairstyles, and styles.</p></li>
<li><p class="content has-text-justified" style='width:100%'>Deepfake Attack via Talking Head Generation: For each subject, a frame or video is used as the source, and audio as the driving force, with 10 different digital human generation algorithms used to create talking head videos.</p></li>
</ul>
</p>
</div>
</div>
</br>
<p class="content has-text-justified">
To simulate the data collection and usage process of real-world facial recognition systems, we applied three different types of compression to the original facial videos and attack videos, including H264 compression at crf 23 (c23) and crf 30 (c30), and image compression on HTML5 (h5). These compression operations typically result in the elimination or reduction of various forgery clues, increasing the difficulty of attack detection. We referenced the video compression method in [71] , used ffmpeg to compress the live videos and all attack videos, obtaining videos with quality levels c23 and c30. We employed the canvas element in HTML5 to compress images, setting the compression ratio to 0.5 to obtain the h5 subset.
</p>
</br>
</div>
</div>
</section>
<section class="section">
<section class="columns is-vcentered interpolation-panel" width=100%>
<div class="hero-body" style='margin-top:-25px;margin-bottom:-25px'>
<center><h2 class="title is-3">Table 1 MS-UFAD DataSet</h2></center><br>
<center><h5 class="title is-3">Our dataset contains 5,000 IDs, each of which includes 52 types of face attack methods across 3 major categories, with 4 different quality levels, and includes fine-grained textual descriptions at the sample level. [Keys: I=Image, V=Video]</h5></center><br>
<div class="hero-body" style='margin-top:-25px;margin-bottom:-25px'>
<center>
<img src="https://ailab.zkj.com/shtjs/4.png"
class="interpolation-image" width=100%/></center>
</br>
</div>
</section>
<section class="section">
<section class="columns is-vcentered interpolation-panel" width=100%>
<div class="hero-body" style='margin-top:-25px;margin-bottom:-25px'>
<center><h2 class="title is-3">Table 2 Deepfake Methods</h2></center><br>
<div class="hero-body" style='margin-top:-25px;margin-bottom:-25px'>
<center>
<img src="https://ailab.zkj.com/shtjs/5.png"
class="interpolation-image" width=100%/></center>
<section class="section">
<section class="columns is-vcentered interpolation-panel" width=100%>
<div class="container is-max-desktop">
<center><h2 class="title is-3">Details of Implemented of DeepFake methods:</h2></center><br>
<p class="content has-text-justified">
We employed 40 different deepfake methods, as shown in the table above. Below, we provide a brief introduction to each method.
</p>
<!-- Visual Effects. -->
<div class="column">
<div class="content">
<p class="content has-text-justified" style='width:100%'>
<ul>
<li><p class="content has-text-justified" style='width:100%'><strong>FSGAN:</strong>FSGAN performs quite well in face swapping, effectively preserving the identity features of the source face while transferring facial and expression information to the target image. It introduces a novel face reconstruction method based on Recurrent Neural Networks (RNNs), which can adjust for changes in pose and expression and can be applied to single images or video sequences. For video sequences, a continuous interpolation method based on ghosting, Delaunay triangulation, and barycentric coordinates is introduced. FSGAN contains many different sub-networks suitable for more general face swapping tasks. However, the authors did not specify how to train such a large network. Nevertheless, one can take inspiration from the ideas presented in the article or apply parts of the network to achieve good results. In 2022, FSGANv2 was proposed, offering a facial swapping solution that does not require subject-specific training, achieving significant changes in pose and expression through iterative deep learning, suitable for single images or video sequences. It uses a facial blending network to maintain target skin color and lighting conditions while dealing with occluded areas. However, for comparative effectiveness, we implemented the method using the official FSGAN code available at: https://github.com/NVlabs/imaginaire. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>PBE:</strong>Most face swapping methods utilize GANs; however, this paper employs a diffusion model for editing. We use it for face swapping, where PBE achieves this by utilizing self-supervised training to disentangle and reorganize source images and templates. Simple methods, though, can lead to noticeable blending artifacts. The authors thoroughly analyze this issue and propose content bottlenecks and strong augmentation to avoid simple copy-and-paste solutions of template images. To ensure the controllability of the editing process, the authors designed arbitrary-shaped masks for template images and used classifier-free guidance to increase similarity with the template images. The entire framework requires only a single forward pass of the diffusion model without any iterative optimization. It performs well on in-the-wild data and achieves high-fidelity controlled editing. However, after swapping, there may occasionally be other artifacts. We employ the official implementation code: https://github.com/Fantasy-Studio/Paint-by-Example. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>SimSwap:</strong>SimSwap is a current state-of-the-art (SOTA) face swapping method, introducing a new approach to address the lack of generalization to arbitrary identities. It transfers identity information at the feature level through an ID injection module while using a weak feature matching loss to preserve the attributes of the target face. This method overcomes challenges in generalization capabilities and attribute preservation found in traditional face swapping methods, enabling arbitrary source-to-target face swapping while maintaining attributes like expressions and gaze directions. SimSwap offers single-frame and video face swapping modes and supports specific person swapping. Many face swapping tools use it as their internal swapping algorithm. However, it failed to preserve facial expressions and gaze directions, possibly because the ID extraction was not clean. We use the official implementation code: https://github.com/neuralchen/SimSwap. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>InSightFace:</strong>InSightFace is a powerful face algorithm that has been continuously updated and iterated in recent years, serving as a tool used within many models. Some use it as their internal basic face swapping method, while others leverage its face detection functionality. Now available as a package, InSightFace can be used conveniently and quickly. It seamlessly blends the features of the source face while maintaining the target facial identity characteristics. We use the official implementation code: https://github.com/deepinsight/insightface. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>SAFA:</strong> Building upon FOMM, SAFA improves to tackle challenges caused by complex scene structures in facial animation tasks. This method constructs specific geometric structures to simulate different components of facial images, aiming to solve potential improper distortions and occlusions in generated images. At the core of the method is the use of a 3D Morphable Model (3DMM) to simulate the face, multiple affine transformations to simulate other foreground components like hair and beard, and an identity transformation for the background. The geometric embedding of 3DMM not only helps generate realistic structures for the driving scene but also better perceives occluded areas in the generated images. Incorporating 3DMM has led to a qualitative leap in maintaining facial rigidity during expression-driven animations, meaning less facial deformation and simultaneous driving and swapping capabilities. We use the official implementation code: https://github.com/qiulin-w/safa. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>UniFace:</strong>Face reenactment and swapping are two highly related tasks, but current research works almost always consider each task separately. In this work, the authors propose an efficient end-to-end unified framework to perform these two tasks. Unlike current methods that directly utilize pre-trained face structure prior networks to obtain facial attributes and identity information, the authors use a self-supervised approach to decouple and obtain attribute information represented as vectors and identity information represented as feature maps. The method does not rely on any facial priors during inference, making it perform better in extreme environments. Additionally, current methods do not fully consider the intrinsic connection between the two tasks, leading to reduced performance under a unified framework. The authors meticulously designed attribute and face transfer modules and efficiently combined corresponding modules according to the relationship between the two tasks to enhance the performance of each task. Experiments prove that the unified structure achieves more advanced results in both face reenactment and swapping tasks. We use the official implementation code: https://github.com/xc-csc101/UniFace. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>MobileSwap:</strong>Advanced face swapping methods have achieved remarkable results. However, most of these methods involve many parameters and computations, making them challenging to apply in real-time applications or deploy on edge devices like smartphones. In this work, a lightweight Identity-aware Dynamic Network (IDN) is proposed that dynamically adjusts model parameters through identity information, for arbitrary face swapping. The provided IDN contains only 0.50M parameters and requires 0.33G FLOPs per frame, thus enabling real-time video face swapping on mobile devices. Furthermore, a stable training method based on knowledge distillation is introduced, and a loss reweighting module is adopted to achieve better comprehensive results. We use the official implementation code: https://github.com/Seanseattle/MobileFaceSwap. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>e4s:</strong> e4s rethinks face swapping from the perspective of fine-grained face editing, termed "edit swapping." Moreover, e4s can handle facial occlusions with an average mask. The core is a novel Region GAN Inversion (RGI) method that explicitly disentangles shape and texture, allowing face swapping in the latent space of StyleGAN. Based on disentanglement, face swapping is reformulated as a simplified style and mask swapping problem. Recently, e4s has been improved, mainly addressing potential lighting discrepancies when transferring the source face's skin to the target image, ensuring that the swapped face maintains target lighting conditions while preserving the source skin. It also repairs post-swap faces, mainly potential mismatch areas that may occur during the mask swapping process, refining facial shapes. The latest generation speed is very slow, and it has lost continuity for video face swapping. We use the official implementation code: https://github.com/e4s2024/E4S2024. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>FaceDancer:</strong>FaceDancer is a novel single-stage face swapping technique with real-time generative performance, utilizing Adaptive Feature Fusion Attention (AFFA) and Interpretative Feature Similarity Regularization (IFSR) modules. It achieves high-fidelity identity transfer without additional face segmentation processes while preserving target facial attributes such as expression, pose, and lighting. IFSR leverages a pre-trained identity encoder to preserve key facial attributes, while the AFFA module learns gated features for adaptive fusion of attributes and identity information. Experiments on various datasets show that FaceDancer outperforms existing methods in identity preservation and pose maintenance. It can reach 200ms on A100. We use the official implementation code: https://github.com/felixrosberg/FaceDancer. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>AFS:</strong>AFS is a novel, high-fidelity, and straightforward face swapping method that explicitly decomposes the intermediate latent space W+ of a pre-trained StyleGAN into "identity" and "style" subspaces. In W+, a latent code is the sum of "identity" and "style" codes from the corresponding subspaces. Through this decomposition, face swapping is viewed as simple arithmetic operations in W+, namely, the sum of the source "identity" code and the target "style" code. This makes AFS more intuitive and elegant than other face swapping methods. We use the official implementation code: https://github.com/truongvu2000nd/afs. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>FaceShifter:</strong>FaceShifter is a SOTA face swapping algorithm from 2020, mainly addressing the occlusion challenge encountered in face swapping. It proposes a two-stage framework for high-fidelity and occlusion-aware face swapping. AEI-Net achieves comprehensive integration of target attributes through an adaptive embedding integration network, addressing inconsistencies in lighting and shape; HEAR-Net handles occlusions in a self-supervised manner without manual annotations. Compared to existing methods, FaceShifter excels in fidelity and identity preservation. Not only is FaceShifter a powerful and practical tool, but it also provides an excellent platform for researchers and developers to explore face transformation technologies. We use the official implementation code: https://github.com/taotaonice/FaceShifter. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>DiffSwap:</strong>DiffSwap is a 3D-aware conditional repair technique that uses a diffusion model for high-fidelity controllable face swapping. With identity features and facial landmarks guided conditional repair, it solves the shape retention problem inherent in traditional methods. The proposed midpoint estimation method requires only two steps to achieve identity constraints, improving the quality of the swap. The introduced Mask D is a technique to indicate which facial features need to be swapped. Mask D can identify and label different areas of the face, allowing for precise control over the face swapping process. Users can select specific areas to swap, such as eyes, nose, mouth, etc., achieving personalized face swapping effects. This technology has potential applications in many fields, such as movie special effects, video games, and virtual reality. Its reasoning speed is very slow. We use the official implementation code: https://github.com/wl-zhao/DiffSwap. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>FOMM:</strong>FOMM is a pioneering work in reenactment algorithms, and many current algorithms are improvements based upon it, appearing in many public datasets. FOMM uses a self-supervised formula to separate appearance and motion information. To support complex movements, it employs a representation method consisting of a set of learned keypoints and their local affine transformations. The generator network models occlusions that occur during the target motion process and combines the appearance extracted from the source image with the motion extracted from the driving video. We use the official implementation code: https://github.com/AliaksandrSiarohin/first-order-model. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>FS_vid2vid:</strong>vid2vid is a deep learning-based video generation method that trains a neural network model to transform input static images or a small number of video clips into continuous videos. Compared to traditional video generation methods, vid2vid has higher generation quality and consumes fewer computational resources. This makes it of great practical value in many application scenarios. FS_vid2vid is an improved method based on vid2vid technology. Its main idea is to synthesize new videos using a small number of video clips. This method greatly improves the diversity and flexibility of generated videos while maintaining high-quality generation, making it play a larger role in many application scenarios. We use the official implementation code: https://github.com/NVlabs/few-shot-vid2vid. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>FADM:</strong>This paper is the first to use a diffusion model for face driving, proposing a facial animation framework based on an Attribute-Guided Diffusion Model (FADM). Previous GAN-based methods often produce unnatural distortions and artifacts due to complex motion deformations, while diffusion models have outstanding modeling capabilities. The main component is the Attribute-Guided Conditional Network (AGCN), which adaptively combines animation features and 3D facial reconstruction results, integrating appearance and motion conditions into the diffusion process, correcting unnatural artifacts and distortions, and enriching high-fidelity facial details through iterative diffusion refinement. We use the official implementation code: https://github.com/zengbohan0217/FADM. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>FNeVR:</strong>Face reenactment still faces two major challenges: maintaining identity features and generating realistic images, mainly due to complex motion deformations and facial detail modeling. To solve these two problems, FNeVR designed a 3D Face Volume Rendering (FVR) module to enhance facial details in image rendering. The carefully designed architecture extracts 3D information, then introduces an orthogonal adaptive ray sampling module for efficient rendering, and designs a lightweight pose editor, allowing FNeVR to edit facial poses in a simple and effective way. We use the official implementation code: https://github.com/zengbohan0217/FNeVR. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>TPSMM:</strong>TPSMM mainly solves the problem of large angles, i.e., cases where there is a significant gap in pose between the source and driving images. It proposes a thin-plate spline motion estimation method to generate more flexible optical flow, thus warping the feature map of the source image into the feature domain of the driving image. To more realistically restore missing areas, multi-resolution occlusion masks are used for more effective feature fusion. Additional auxiliary loss functions are designed to ensure clear division of labor among network modules and encourage the network to generate high-quality images. The method has been successfully applied to animating various objects such as faces, bodies, and cartoons, showing great potential in handling unseen operations and demonstrating strong generalization capabilities. We use the official implementation code: https://github.com/yoyo-nb/thin-plate-spline-motion-model. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>LIA:</strong>The common method for face reenactment is to use structural representations extracted from driving videos (such as keypoints or regions), which is crucial for transferring motion from driving videos to static images. However, these methods may fail when there is a significant difference in appearance between the source image and the driving video. Additionally, extracting structural information requires extra modules, increasing the complexity of the animation model. LIA proposes using a self-supervised autoencoder to animate images without structural representations, mainly by simplifying the image animation process through linear navigation in latent space. The motion in the generated video is constructed by linear displacement in the latent space, learned by learning a set of orthogonal motion directions and using their linear combinations to represent any displacement in the latent space. LIA's idea is excellent, treating the motion between the source and driving images as a transformation in high-dimensional space, and performing arbitrary combinations in this high-dimensional space. We use the official implementation code: https://github.com/wyhsirius/LIA. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>DaGAN:</strong>This work proposes a depth-aware generative adversarial network (GAN) that learns a depth estimation network in a self-supervised manner by training on facial videos without requiring additional 3D data as input. Thus, it can generate reliable facial depth maps for both source and driving images, producing higher quality talking head videos by capturing accurate facial 3D structures. The focus of this method is on combining 3D geometric information with deep learning to improve the quality and accuracy of generated talking head videos, especially when dealing with complex backgrounds and noise information. We use the official implementation code: https://github.com/harlanhong/cvpr2022-dagan. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>LivePortrait:</strong>LivePortrait is the first performance-considerate face reenactment model, with a significant speed increase in generation compared to diffusion-based methods, reaching 12.8ms, suitable for industrial needs. The paper expands the training data to 69 million high-quality frames, adopts a hybrid image-video training strategy, upgrades the network architecture, designs better motion transformations and optimization objectives, and proposes compact implicit keypoints to effectively represent mixed shapes. It also designs splicing and two redirection modules to enhance controllability. We use the official implementation code: https://github.com/KwaiVGI/LivePortrait. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>MCNet:</strong>MCNet, by the same author as DaGAN, is an improvement based on DaGAN. Intense and complex actions in the driving video cause blurring in the generated results because the static source image cannot provide sufficient appearance information for occluded areas or subtle facial expression changes, resulting in severe artifacts and significantly reducing generation quality. To solve this problem, a network module was designed to learn a unified spatial facial memory bank from all training samples, which can provide a rich facial structure and appearance priors to compensate for the generation of distorted source facial features. An effective query mechanism based on implicit identity representation is proposed, learned from discrete keypoints of source images. It greatly facilitates retrieving more relevant information from the repository for compensation. Testing shows that the more frontal the input single image of the face, the better the effect. Within a certain range of facial rotation angles, the synthesis of facial expressions and movements is very good. The larger the facial rotation angle, the greater the deformation. At the same time, the closer the face in the driving video is to the face in the target image (e.g., eye size, mouth size), the better. If the driving video has small eyes and the target image has large eyes, the synthesized target image may not close the eyes completely. We use the official implementation code: https://github.com/harlanhong/iccv2023-mcnet. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>HyperReenact:</strong>Existing face reenactment methods are prone to visual artifacts, especially when there are significant head pose changes, or expensive fine-tuning is required to maintain source identity features. This paper leverages the realistic generative capabilities and decoupled attributes of the pre-trained StyleGAN2 generator. It first converts real images to their latent space and then uses a hypernetwork for source identity feature refinement and facial pose redirection, thus eliminating artifacts without external editing methods. The method operates under a one-time setup, allowing cross-subject reenactment without fine-tuning for specific subjects. But for unseen data, sometimes the decoupling effect is not very effective, and the generated video noise is relatively high. We use the official implementation code: https://github.com/stelabou/hyperreenact. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>FRST:</strong> FRST is a current SOTA method that achieves cross-subject reenactment by transferring head movements and facial expressions from the driving video to the appearance of the source image. First, a Transformer-based encoder computes the latent representation of the source image. Then, using a Transformer-based decoder combined with keypoints and facial expression vectors from the driving frames, it predicts the output color of query pixels. It learns the latent representation of the source character in a self-supervised manner, separating its appearance, head pose, and facial expressions, suitable for cross-subject reenactment. It can be naturally extended to multiple source images, adapting to the facial dynamics of specific characters. We use the official implementation code: https://github.com/andrerochow/fsrt. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>AniPortrait:</strong>AniPortrait is an audio- and reference-portrait-based animation generation framework that processes in two stages (Audio2Lmk and Lmk2Video), extracting 3D intermediate representations from audio and projecting them into a series of 2D facial keypoints. Then, using a robust diffusion model combined with a motion module, it transforms the keypoint sequence into realistic and temporally consistent portrait animations. Experimental results show that AniPortrait has advantages in facial naturalness, pose diversity, and visual quality, thus providing an enhanced perceptual experience. Additionally, it shows considerable potential in flexibility and controllability, effectively applicable in fields like facial action editing or face reenactment. However, it relies on high-quality 3D data and has GPU compatibility issues. We use the official implementation code: https://github.com/Zejun-Yang/AniPortrait. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>MRFA:</strong>Existing methods mostly use a prior-based motion model (e.g., the local affine motion model or the local thin-plate-spline motion model), which can capture broad facial movements, but often produce artifacts in small movements in local areas (such as lips and eyes) because these methods have limited modeling capabilities for more refined facial movements. The authors designed a new unsupervised face animation method that learns both coarse and fine movements. First, it constructs a structure-related volume based on keypoint features of the source and driving images, then generates tiny facial movements from low to high resolution, and finally combines the learned motion refinement with coarse movements to generate a new image. We use the official implementation code: https://github.com/JialeTao/MRFA. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>DPE:</strong>Head poses and facial expressions are always intertwined and transferred simultaneously in facial actions, which limits the direct application of this method in video portrait editing, as video portrait editing may only require modifying expressions while keeping the pose unchanged. One challenge in decoupling poses and expressions is the lack of paired data, such as data with the same pose but different expressions. Although some methods attempt explicit decoupling using 3D Morphable Models (3DMMs), due to the limited number of Blendshapes, 3DMMs are not accurate enough in capturing facial details, which affects the quality of action transfer. This paper introduces a novel self-supervised decoupling framework that can decouple poses and expressions without the need for 3DMMs and paired data. The framework consists of a facial encoding module, a pose generator, and an expression generator. The facial encoding module projects the face into a latent space, where the pose actions and expression actions can be decoupled, and pose or expression transfers can be conveniently performed in the latent space through addition. The two generators then render the modified latent codes into images. Furthermore, to ensure decoupling, this paper proposes a bidirectional cycle training strategy with carefully designed constraints. We use the official implementation code: https://github.com/Carlyx/DPE. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>OneShot:</strong>This paper proposes a neural talking head video synthesis model for video conferencing. The model learns to synthesize talking head videos using a source image containing the target person's appearance and a driving video that determines the output motion. The paper encodes motion based on a novel keypoint representation, where specific identity and motion-related information are decomposed in an unsupervised manner. Extensive experiments demonstrate that the model outperforms competing methods on benchmark datasets. The compact keypoint representation enables video conferencing systems to achieve commercial H.264 standard visual quality while only using a tenth of the bandwidth, as the entire video does not need to be transmitted over the network; only the first frame image and subsequent 3D keypoint representations for each frame are required. Additionally, the keypoint representation allows users to rotate the head during the synthesis process, which is useful for simulating a face-to-face video conferencing experience. We use the official implementation code: https://github.com/zhanglonghao1992/One-Shot_Free-View_Neural_Talking_Head_Synthesis. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>MakeItTalk:</strong>MakeItTalk brings faces or cartoon characters to life. It first extracts the facial landmarks from the image using existing methods, then extracts the content information and speaker information from the input speech, and uses these two features to obtain the corresponding changes in facial landmarks for the lips, head, and facial expressions. Finally, it combines these with the original image to produce a video. The main advantage is the separation of content information and speaker information in the speech, which allows for better lip-sync and head movements and facial expressions that are more consistent with the speaker themselves. We use the official implementation code: https://github.com/yzhou359/MakeItTalk. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>Portrait4D:</strong>Existing one-shot 4D head synthesis methods typically rely on monocular videos and learn with the aid of 3DMM reconstruction, which is highly challenging and limits their ability to perform reasonable 4D head synthesis. This paper proposes a method to learn one-shot 4D head synthesis through large-scale synthetic data. The key is to first learn a part-based 4D generative model from monocular images through adversarial learning to synthesize multi-view images with different identities and full motion as training data; then, using a transformer-based animatable tri-plane reconstructor, the method learns 4D head reconstruction using synthetic data. A novel learning strategy is implemented to enhance the generalization ability to real images by disentangling the learning processes of 3D reconstruction and reenactment. However, the generated video characters have unclear tooth jitter, and for blinking actions, there may be erroneous enlargement of the face. We use the official implementation code: https://yudeng.github.io/Portrait4D. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>SadTalker:</strong>Current methods for generating speaking persons have problems such as unnatural head movements, distorted facial expressions, and identity modifications. These issues mainly stem from the learning of coupled 2D motion fields, while the explicit use of 3D information also leads to stiff expressions and incoherent videos. Therefore, this paper proposes SadTalker, which generates 3D motion coefficients (head pose, expressions) from audio using a new 3D facial renderer to generate head movements. To learn authentic motion coefficients, the researchers model the connection between audio and different types of motion coefficients separately, learning accurate facial expressions from audio through distilled coefficients and 3D rendered faces; a conditional VAE, PoseVAE, is designed to synthesize different styles of head movements. Finally, the generated 3D motion coefficients are mapped to the unsupervised 3D keypoints space of facial rendering and combined to synthesize the final video. Experiments show that this method achieves state-of-the-art performance in motion synchronization and video quality. However, the issue of unclear teeth remains unresolved. We use the official implementation code: https://github.com/winfredy/sadtalker. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>EchoMimic:</strong>Previous methods only used audio or facial keypoints to drive images into videos, and while they could produce satisfactory results, certain issues persisted. For example, methods driven solely by audio might sometimes be unstable due to the relatively weak audio signal, while methods driven solely by facial keypoints, although more stable in driving, could lead to unnatural results due to over-control of facial keypoint information. EchoMimic uses both audio and facial landmarks for training. By implementing a novel training strategy, EchoMimic can generate portrait videos not only from audio and facial landmarks separately but also from a combination of audio and selected facial landmarks. We use the official implementation code: https://github.com/BadToBest/EchoMimic. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>Audio2Head:</strong>Previously, neural network-based algorithms enabled the lip movements in facial videos to match speech relatively well, but issues remained, such as unnatural head movements, incoherent videos, and numerous artifacts. To address these issues, Audio2Head models head movements separately, proposing a spatially encoded neural network for the natural prediction of head movement sequences. To model the motion of the entire image related to speech, Audio2Head uses a speech precursor to generate a dense motion field for the entire image, which then guides image synthesis. This technology can quickly respond to audio input, achieving real-time head movement generation, and only requires a single image of a person to adapt to various audio inputs. We use the official implementation code: https://github.com/wangsuzhen/Audio2Head. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>VideoReTalking:</strong>The lips generated by current generic talking person models are blurry, and they do not support emotional editing. Inspired by wav2lip, the authors propose a new model to drive lip synthesis with audio. In wav2lip, two sets of five frames each are required as input: one set is the GroundTruth with the lower half masked, and the other set is five random frames from the original video (different from the GroundTruth). The masked image frames are obviously the images for which we need to generate lips with audio, while the random image frames provide pose references for lip generation. The authors note that the model is highly sensitive to the pose reference image frames, as the lip information contained in these frames leaks to the model as prior knowledge. If random image frames are used directly as pose references, the generated images often result in asynchronous outcomes. Therefore, the authors modify the random image frames used as pose references by neutralizing facial expressions before inputting them into the model as pose references. Following this idea, modified facial expressions of happiness or sadness can also be used as pose references, naturally generating talking videos with corresponding emotions. VideoReTalking has high practical value, such as silent processing of videos when silent videos are lacking. We use the official implementation code: https://github.com/vinthony/video-retalking. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>Hallo:</strong>This paper is based on an SD scheme that can generate visually appealing, temporally coherent, and complex facial animation audio-driven videos. It is an end-to-end solution, where the authors introduce a hierarchical audio-driven visual synthesis module that uses a cross-attention mechanism to establish correspondences between audio and visual features (lips, expressions, and poses) to improve the alignment between audio input and visual output, including lip movements, expressions, and postural actions. The proposed hierarchical audio-driven visual synthesis provides adaptive control over the diversity of expressions and postures, achieving more effective personalized customization to accommodate different identities. The training also uses some tracks to enhance the effects. We use the official implementation code: https://github.com/fudan-generative-vision/hallo. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>Real3D-Portrait:</strong>In recent years, with the continuous advancement of single-image-driven virtual human technology, lip accuracy and image quality have been steadily improving; Real3D-Portrait for the first time achieves single-image-driven virtual human video synthesis supporting large-scale pose movements with advanced single-image 3D reconstruction technology, further unlocking the motion freedom of single-image-driven virtual humans. Its feature of reconstructing 3D avatars also gives it the potential to be applied in spatial visual products. It is foreseeable that, with the continuous iteration and popularization of technology, virtual humans will appear in various application scenarios such as intelligent assistants, virtual reality, and video conferencing. With Real3D-Portrait, single-image-driven virtual human algorithms are expected to make speakers 'move' more realistically in 2D/3D scenes. However, Real3D-Portrait is not without flaws at this stage; possibly due to a small data volume and sample quality issues, the model sometimes struggles to produce clear and accurate results for areas obscured in the input image, such as teeth and side faces. We use the official implementation code: https://github.com/yerfor/Real3DPortrait. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>Wav2Lip:</strong>Wav2Lip is a classic talking person model, its greatest contribution being the provision of a lip-sync discriminator, which offers new stringent evaluation benchmarks and metrics to accurately measure lip-sync in unrestricted videos. Many products are currently using wav2Lip as their underlying algorithm, enhancing facial and dental clarity by adding super-resolution models (such as GFPGAN). We use the official implementation code: https://github.com/Rudrabha/Wav2Lip. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>Controlnet:</strong>Controlnet is a popular explicit conditional control method currently in widespread use. ControlNet is a neural network architecture designed to add spatial conditional control to large, pre-trained text-to-image diffusion models. It locks down large diffusion models that are ready for production and reuses these models' deeply and robustly encoded layers, trained on billions of images, as a powerful backbone for learning diverse conditional controls. We use the official implementation code: https://github.com/lllyasviel/controlnet. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>DeltaEdit:</strong>Previous editing methods had one flaw: for different text prompts, they had to undergo different optimization processes, which was inflexible during training or inference and did not generalize well to any other unseen text. The authors believe that the key to mitigating this problem is to accurately establish the relationship between the text feature space and the StyleGAN latent space within one model. Therefore, a method for image editing in the StyleGAN space is proposed, conditioned on the corresponding embeddings in the CLIP image space, without any text supervision. This method involves randomly selecting two images from a training image dataset, extracting their CLIP image embeddings and the pre-trained StyleGAN model to extract latent codes in the S space. Then, the extracted latent codes are used to predict the manipulation direction. The paper also proposes a solution that directly uses image embeddings to construct pseudo-text conditions based on the aligned features of the joint CLIP image-text space. We use the official implementation code: https://github.com/yueming6568/deltaedit. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>Face2Diffusion:</strong>Previous diffusion-based editing methods aimed to insert specific facial features from images into pre-trained text-to-image diffusion models. However, past methods still faced challenges in maintaining identity similarity and editability, as they were overfitted to training samples. This paper proposes a Face2Diffusion (F2D) method for high editability facial personalization. The core idea behind F2D is to remove identity-irrelevant information from the training process to prevent overfitting issues and enhance the editability of encoded faces. F2D includes three novel components: first, a multi-scale identity encoder provides well-separated identity features while maintaining the benefits of multi-scale information, thus improving camera pose diversity; second, expression guidance separates facial expressions from identity, enhancing controllability of facial expressions; third, category-guided denoising regularization encourages the model to learn how to denoise faces, thereby improving the text alignment of the background. We use the official implementation code: https://github.com/mapooon/face2diffusion. </p></li>
<li><p class="content has-text-justified" style='width:100%'><strong>PhotoMaker:</strong>Recently, text-to-image generation technology has made significant progress in synthesizing realistic human photos that can be conditionally generated based on given text prompts. However, existing personalized generation methods cannot simultaneously meet the requirements of high efficiency, high identity (ID) fidelity, and flexible text controllability. This paper is an efficient personalized text-to-image generation method, which primarily encodes any number of input ID images into a stacked ID embedding to preserve ID information. This embedding, as a unified ID representation, not only encapsulates the features of the same input ID comprehensively but also accommodates the features of different IDs for subsequent integration. This paves the way for more interesting and practically valuable applications. Moreover, to drive the training of PhotoMaker, we propose an ID-oriented data construction pipeline to assemble training data. Nourished by the dataset constructed through the proposed pipeline, our PhotoMaker outperforms test-time fine-tuning methods in ID preservation capability, while also providing significant speed boosts, high-quality generation results, powerful generalization capabilities, and a wide range of applications. We use the official implementation code: https://github.com/TencentARC/PhotoMaker. </p></li>
</ul>
</p>
</div>
</div>
</br>
</div>
</div>
</section>
<section class="section">
<section class="columns is-vcentered interpolation-panel" width=100%>
<div class="container is-max-desktop">
<center><h2 class="title is-3">References</h2></center><br>
<!-- Visual Effects. -->
<div class="column">
<div class="content">
<p class="content has-text-justified" style='width:100%'>
<ul>
<p>
[1] Yuval Nirkin, Yosi Keller, and Tal Hassner. Fsgan: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7184–7193, 2019.<br>
[2] i L, Bao J, Yang H, et al. Faceshifter: Towards high fidelity and occlusion aware face swapping. arXiv 2019[J]. arXiv preprint arXiv:1912.13457.<br>
[3] Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. Simswap: An efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM international conference on multimedia, pages 2003–2011, 2020.<br>
[4] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. ICCV 2015.<br>
[5] Wang Q, Zhang L, Li B. Safa: Structure aware face animation[C]//2021 International Conference on 3D Vision (3DV). IEEE, 2021: 679-688.<br>
[6] Chao Xu, Jiangning Zhang, Yue Han, Guanzhong Tian, Xianfang Zeng, Ying Tai, Yabiao Wang, Chengjie Wang, and Yong Liu. Designing one unified framework for high-fidelity face reenactment and swapping. In European conference on computer vision, pages 54–71. Springer, 2022.<br>
[7] Zhiliang Xu, Zhibin Hong, Changxing Ding, Zhen Zhu, Junyu Han, Jingtuo Liu, and Errui Ding. Mobilefaceswap: A lightweight framework for video face swapping. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2973–2981, 2022.<br>
[8] Zhian Liu, Maomao Li, Yong Zhang, Cairong Wang, Qi Zhang, Jue Wang, and Yongwei Nie. Fine-grained face swapping via regional gan inversion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8578–8587, 2023.<br>
[9] Felix Rosberg, Eren Erdal Aksoy, Fernando Alonso-Fernandez, and Cristofer Englund. Facedancer: Pose-and occlusion-aware high fidelity face swapping. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3454–3463, 2023.<br>
[10] Truong Vu, Kien Do, Khang Nguyen, Khoat Than: Face Swapping as A Simple Arithmetic Operation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023. <br>
[11] Yang B, Gu S, Zhang B, et al. Paint by example: Exemplar-based image editing with diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 18381-18391.<br>
[12] Zhao W, Rao Y, Shi W, et al. Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 8568-8577.<br>
[13] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. Advances in neural information processing systems, 32, 2019.<br>
[14] Wang T C , Liu M Y , Tao A ,et al. Few-shot Video-to-Video Synthesis[J]. In Proceedings of the Conference and Workshop on Neural Information Processing Systems 2019.<br>
[15] Zeng B, Liu X, Gao S, et al. Face animation with an attribute-guided diffusion model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 628-637.<br>
[16] Zeng B, Liu B, Li H, et al. FNeVR: Neural volume rendering for face animation[J]. Advances in Neural Information Processing Systems, 2022, 35: 22451-22462.<br>
[17] Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10039–10049, 2021.<br>
[18] Guo J, Zhang D, Liu X, et al. LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control[J]. arXiv preprint arXiv:2407.03168, 2024.<br>
[19] Rochow A, Schwarz M, Behnke S. FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance Head-pose and Facial Expression Features[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 7716-7726.<br>
[20] Wei H, Yang Z, Wang Z. Aniportrait: Audio-driven synthesis of photorealistic portrait animation[J]. arXiv preprint arXiv:2403.17694, 2024.<br>
[21] Tao J, Gu S, Li W, et al. Learning motion refinement for unsupervised face animation[J]. Advances in Neural Information Processing Systems, 2024, 36.<br>
[22] Pang Y, Zhang Y, Quan W, et al. Dpe: Disentanglement of pose and expression for general video portrait editing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 427-436.<br>
[23] Zhou Y, Han X, Shechtman E, et al. Makelttalk: speaker-aware talking-head animation[J]. ACM Transactions On Graphics (TOG), 2020, 39(6): 1-15.<br>
[24] Deng Y, Wang D, Ren X, et al. Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 7119-7130.<br>
[25] Zhang W, Cun X, Wang X, et al. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 8652-8661.<br>
[26] Chen Z, Cao J, Chen Z, et al. EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions[J]. arXiv preprint arXiv:2407.08136, 2024.<br>
[27] Wang S, Li L, Ding Y, et al. Audio2head: Audio-driven one-shot talking-head generation with natural head motion[J]. arXiv preprint arXiv:2107.09293, 2021.<br>
[28] Cheng K, Cun X, Zhang Y, et al. Videoretalking: Audio-based lip synchronization for talking head video editing in the wild[C]//SIGGRAPH Asia 2022 Conference Papers. 2022: 1-9.<br>
[29] Xu M, Li H, Su Q, et al. Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation[J]. arXiv preprint arXiv:2406.08801, 2024.<br>
[30] Ye Z, Zhong T, Ren Y, et al. Real3d-portrait: One-shot realistic 3d talking portrait synthesis[J]. arXiv preprint arXiv:2401.08503, 2024.<br>
[31] Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth-aware generative adversarial network for talking head video generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3397–3406, 2022. <br>
[32] Zhang L, Rao A, Agrawala M. Adding conditional control to text-to-image diffusion models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 3836-3847.<br>
[33] Lyu Y, Lin T, Li F, et al. Deltaedit: Exploring text-free training for text-driven image manipulation[J]. arXiv preprint arXiv:2303.06285, 2023.<br>
[34] Shiohara K, Yamasaki T. Face2Diffusion for Fast and Editable Face Personalization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 6850-6859.<br>
[35] Li Z, Cao M, Wang X, et al. Photomaker: Customizing realistic human photos via stacked id embedding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 8640-8650.<br>
[36] KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020.<br>
[37] Fa-Ting Hong and Dan Xu. Implicit identity representation conditioned memory compensation network for talking head video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23062–23072, 2023.<br>
[38] Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, Ioannis Patras, and Georgios Tzimiropoulos. Hyperreenact: one-shot reenactment via jointly learning to refine and retarget faces. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7149–7159, 2023.<br>
[39] Jian Zhao and Hui Zhang. Thin-plate spline motion model for image animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3657–3666, 2022.<br>
[40] Yaohui Wang, Di Yang, Francois Bremond, and Antitza Dantcheva. Latent image animator: Learning to animate images via latent space navigation. In Proceedings of the International Conference on Learning Representations, 2022.<br>
[41] D. Deb, X. Liu, and A. K. Jain, “Unifed detection of digital and physical face attacks,” in CVPR, 2021.<br>
[42] Y. Liu, J. Stehouwer, A. Jourabloo, and X. Liu, “Deep tree learning for zero-shot face anti-spoofng,” in CVPR, 2019.<br>
[43] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.<br>
[44] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in ICLR, 2017.<br>
[45] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and accurate method to fool deep neural networks,” in CVPR, 2016.<br>
[46] D. Deb, J. Zhang, and A. K. Jain, “Advfaces: Adversarial face synthesis,” in IJCB, 2020.<br>
[47] A. Dabouei, S. Soleymani, J. Dawson, and N. Nasrabadi, “Fast geometrically-perturbed adversarial faces,” in WACV, 2019, pp. 1979–1988.<br>
[48] H. Qiu, C. Xiao, L. Yang, X. Yan, H. Lee, and B. Li, “Semanticadv: Generating adversarial examples via attribute-conditional image editing,” in ECCV, 2020.<br>
[49] “Faceswap,” 2020, accessed: 2020-05-10. [Online]. Available: https://github.com/MarekKowalski/FaceSwap<br>
[50] P. Korshunov and S. Marcel, “Deepfakes: A new threat to face recognition? assessment and detection,” arXiv preprint arXiv:1812.08685, 2018.<br>
[51] J. Thies et al., “Face2face: Real-time face capture and reenactment of rgb videos,” in CVPR, 2016.<br>
[52] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unifed generative adversarial networks for multi-domain image-toimage translation,” in CVPR, 2018.<br>
[53] M. Liu, Y. Ding, M. Xia, X. Liu, E. Ding, W. Zuo, and S. Wen, “Stgan: A unifed selective transfer network for arbitrary image attribute editing,” in CVPR, 2019.<br>
[54] T. Karras et al., “Analyzing and improving the image quality of stylegan,” in CVPR, 2020, pp. 8110–8119.<br>
[55] H. Fang, A. Liu, H. Yuan, J. Zheng, D. Zeng, Y. Liu, J. Deng, S. Escalera, X. Liu, J. Wan, and Z. Lei, “Unifed physical-digital face attack detection,” 2024. [Online]. Available: https://arxiv.org/abs/2401.17699. <br>
[56] A. Liu et al., “Casia-surf cefa: A benchmark for multi-modal crossethnicity face anti-spoofng,” in WACV, 2021, pp. 1179–1187.<br>
[57] R. Duan et al., “Advdrop: Adversarial attack to dnns by dropping information,” 2021.<br>
[58] J. Rony et al., “Augmented lagrangian adversarial attacks,” in ICCV, 2021, pp. 7738–7747.<br>
[59] Y. Wang et al., “Demiguise attack: Crafting invisible semantic adversarial perturbations with perceptual similarity,” in IJCAI, 2021.<br>
[60] J. Zou et al., “Making adversarial examples more transferable and indistinguishable,” in AAAI, vol. 36, 2022, pp. 3662–3670.<br>
[61] C. W. Yan et al., “Ila-da: Improving transferability of intermediate level attack with data augmentation,” in ICLR, 2022.<br>
[62] C. Luo et al., “Frequency-driven imperceptible adversarial attack on semantic similarity,” in CVPR, 2022, pp. 15 315–15 324.<br>
[63] F.-T. Hong et al., “Depth-aware generative adversarial network fortalking head video generation,” in CVPR, 2022, pp. 3397–3406.<br>
[64] M. Pintor et al., “Fast minimum-norm adversarial attacks through adaptive norm constraints,” in NeurIPS, 2021.<br>
[65] C. Xie et al., “Improving transferability of adversarial examples with input diversity,” in CVPR, 2019.<br>
[66] Y. Dong et al., “Boosting adversarial attacks with momentum,” in CVPR, 2018.<br>
[67] Y. Dong, T. Pang, H. Su, and J. Zhu, “Evading defenses to transferable adversarial examples by translation-invariant attacks,” in CVPR, June 2019.<br>
[68] X. Yang et al., “Robfr: Benchmarking adversarial robustness on face recognition,” in CVPR, 2021.<br>
[69] J. Byun et al., “Improving the transferability of targeted adversarial examples through object-based diverse input,” in CVPR, 2022.<br>
[70] Z. Qin, Y. Fan, Y. Liu, L. Shen, Y. Zhang, J. Wang, and B. Wu, “Boosting the transferability of adversarial attacks with reverse adversarial perturbation,” in Advances in Neural Information Processing Systems (NeurIPS), 2022.<br>
[71] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “Faceforensics++: Learning to detect manipulated facial images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1–11, 1, 2, 3, 4, 6, 7, 15, 17, 18, 19, 20, 21, 29. <br>
</p>
</ul>
</p>
</div>
</div>
</br>
</div>
</div>
</section>
</body>
</html>