Date of Award
12-5-2024
Publication Type
Thesis
Degree Name
M.Sc.
Department
Computer Science
Keywords
3D Morphable Models (3DMM); Diffusion Models;Emotionally Expressive Avatars; Lip Sync and Facial Dynamics; Multimodal Synthesis
Supervisor
Boubakeur Boufama
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Abstract
In the pursuit of advancing the capabilities of talking head generation systems, this thesis proposes a comprehensive architecture designed to synthesize emotionally expressive digital avatars from audio input and a single driver image. The proposed architecture leverages a multimodal approach, combining image codec encoders for facial feature extraction, text encoders for linguistic and emotional content analysis, and audio codec encoders with lip regressors to align speech with lip movements. At the core, is a diffusion model responsible for integrating these diverse inputs to generate a cohesive and emotionally resonant visual output. This research aims to observe and quantify the performance of the proposed architecture in replicating a range of human emotions accurately. By employing a dynamic emotion changer, the architecture is tested for its ability to adapt expressions in real-time, reflecting subtle changes in the emotional undertones of speech. The evaluation focuses not on surpassing existing models but on analyzing the practical application and potential advancements this architecture offers. The outcome of this investigation is expected to contribute significant insights into the viability of such a system for real-world implementation and set a benchmark for future innovations in the field.
Recommended Citation
Bharadwaj, Rajath Devadatta, "Emo TalkGen: Evaluating Multimodal Synthesis for Emotionally Expressive Talking Head Generation using Diffusion Models, Dual Decoders and 3DMM" (2024). Electronic Theses and Dissertations. 9615.
https://scholar.uwindsor.ca/etd/9615