ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE

Abstract

Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we propose ProbTalk3D a non-deterministic neural network approach for emotion controllable speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and an emotionally rich facial animation dataset 3DMEAD. We provide an extensive comparative analysis of our model against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. To our knowledge, that is the first non-deterministic 3D facial animation synthesis method incorporating a rich emotion dataset and emotion control with emotion labels and intensity levels. Our evaluation demonstrates that the proposed model achieves superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for quality judgement.

Methodology

ProbTalk3D employs a two-stage training approach: first, it learns a motion prior that has a VQ-VAE based autoencoder structure; second, it trains a transformer-based neural network for facial motion synthesis conditioned on speech and style, leveraging the motion prior acquired in the first stage.

Stage-1 Training

In the first stage, the transformer-based Motion Autoencoder is trained solely on motion data. The encoder transfers motion to a latent vector. Subsequently, a discrete codebook is learned through vector quantization to model the data distribution. The decoder is trained to reconstruct the input from the quantized vector. This process effectively learns a motion prior capable of encoding motion input into a meaningful latent representation.

Stage-2 Training

In the second stage, we incorporate an audio encoder and train the model on paired audio-motion data. The Audio Encoder is transformer-based and includes a pre-trained HuBERT model. During this stage, the parameters of the Motion Autoencoder are frozen. Given an audio input, HuBERT extracts audio features, which are then processed to be mapped to the motion latent space. This mapping is conditional on the Style Vector, which allows the model to learn different speaking styles, emotional expressions, and emotion intensities.

Inference

During inference, the Motion Encoder is excluded, and the remaining components are used to generate facial animation from audio input. To introduce non-determinism into the model's outputs, we employ stochastic sampling in the quantization process. Specifically, we transform the distance between input vectors and codebook embeddings into probabilities, enabling us to sample codebook embeddings based on them.

BibTeX

@inproceedings{Probtalk3D_Wu_MIG24, author = {Wu, Sichun and Haque, Kazi Injamamul and Yumak, Zerrin}, title = {ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE}, booktitle = {The 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games (MIG '24), November 21--23, 2024, Arlington, VA, USA}, year = {2024}, location = {Arlington, VA, USA}, numpages = {12}, url = {https://doi.org/10.1145/3677388.3696320}, doi = {10.1145/3677388.3696320}, publisher = {ACM}, address = {New York, NY, USA} }

ProbTalk3D

Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE

(Accepted at ACM SIGGRAPH MIG 2024)