Deploying Speech-Driven 3D Facial Animation in Unreal Engine for Production-Ready Digital Humans

Accepted to the ACM SIGGRAPH 2026 Posters Program

Utrecht University, The Netherlands

Abstract

Speech-driven 3D facial animation research has shown promising results, but most methods rely on representations that are not compatible with production pipelines. In this work, we present a deployable system that bridges this gap by enabling speech-driven 3D facial animation directly in Unreal Engine (UE) using ARKit-compatible representations. We construct the 3DMEAD-ARKit dataset by converting the MEAD corpus into blendshape sequences using MediaPipe, and retrain FaceDiffuser and ProbTalk3D-X to generate stochastic and emotion-controllable animations. We further develop a modular UE plugin with a Python backend that supports model selection and parameter control. We compare our results to two existing commercial tools: Epic Games’ MetaHuman speech-driven animator and NVIDIA Audio2Face with a perceptual user study. The results highlight the importance of comparisons among academic and commercial pipelines. We recommend watching the supplementary video. We also plan to do live demonstrations at the conference.

Video

The supplementary video contains a teaser and a practical demonstration of the AutoFaceARKit plugin for Unreal Engine 5 generating facial animations using speech-driven facial animation models.


Dataset Processing Pipeline

We built the standardized, FACS-based 3DMEAD-ARKit dataset by extracting frame-wise blendshape sequences from the MEAD video corpus using MediaPipe. This dataset natively enables both emotion and intensity control during speech-driven generation.
Original MEAD Video Frame

Step 1: Original MEAD video frame

MediaPipe Facial Landmark Tracking

Step 2: MediaPipe facial landmark tracking

ARKit Ground-Truth Mapping

Step 3: Resulting ARKit ground-truth animation

Figure: Dataset processing pipeline. From left to right: Original MEAD video frame, MediaPipe facial landmark tracking, and the resulting ground-truth animation rendered onto an ARKit-compatible digital human.


System Architecture Overview

Our modular Unreal Engine 5 plugin features a local Python backend. The frontend interface captures user audio and style parameters; the local server runs model inference to generate CSV data; and the engine converts this data into Level Sequences via LiveLinkFaceImporter (Epic Games) for real-time character playback.
Unreal Engine 5 System Plugin Architecture Overview

Figure: System overview. (1) In the frontend interface, the user can select the model, input audio (existing or live recording), conditioning style (i.e., speaking style, emotion, intensity), and a digital human character. (2) The data is passed to the backend process for generating the animation data. (3) After the inference, the data is sent to the engine which is used to create, apply to the selected character, and save the animations in the animation library, allowing the user to also retarget the saved animations to other compatible characters.


Perceptual User Study Evaluation

We conducted two statistically rigorous surveys via Prolific (30+ mutually exclusive participants each) utilizing within-subject repeated measures ANOVA (RM-ANOVA) to evaluate Lip-Sync, Realism, and Expressiveness on 7-point Likert scales, benchmarking our models directly against commercial tools from Epic Games and NVIDIA using MetaHuman character models. Experiment 1 uses 12 test-set dataset audios and includes an emotion recognition task, while Experiment 2 uses 8 neutral in-the-wild audios.
User Study Interface Questionnaire Screenshot

Figure: Questionnaire layout. Interface and survey layout presented to participants during the evaluation. This template displays the full setup used for Experiment 1, which included the emotion recognition task. For Experiment 2, the setup was identical but the final emotion classification question was omitted.

BibTeX

@inproceedings{ABusacchi26_ueplugin,
        author = {Busacchi, Alessandro and Haque, Kazi Injamamul and Yumak, Zerrin},
        title = {Deploying Speech-Driven 3D Facial Animation in Unreal Engine for Production-Ready Digital Humans},
        booktitle = {Special Interest Group on Computer Graphics and Interactive Techniques Conference Posters (SIGGRAPH Posters '26), July 19--23, 2026, Los Angeles, CA, USA},
        year = {2026},
        location = {Los Angeles, CA, USA},
        numpages = {3},
        url = {https://doi.org/10.1145/3799825.3818695},
        doi = {10.1145/3799825.3818695},
        publisher = {ACM},
        address = {New York, NY, USA}
}