Dynamic Multimodal Fusion of LLM-Driven Context Embeddings and Deep Vision Encoders for Real-Time Affective Computing
Main Article Content
Abstract
The proposed system integrates Large Language Model (LLM)-based contextual semantic embeddings with deep convolutional and transformer-based vision encoders to enhance real-time emotion recognition and affect analysis. Unlike traditional unimodal or static fusion approaches, the framework employs adaptive attention-driven fusion mechanisms that dynamically weight textual context and visual cues according to environmental and conversational relevance. The LLM module captures rich contextual semantics from speech transcripts, dialogue history, and situational metadata, while the deep vision encoder extracts fine-grained facial expressions and micro-emotional patterns from video streams. A temporal alignment strategy ensures synchronized multimodal representation learning, improving robustness under noisy or partially missing data conditions. Experimental evaluations conducted on benchmark affective computing datasets demonstrate superior performance in accuracy, F1-score, and latency compared to conventional early and late fusion models. The proposed architecture enables scalable, low-latency deployment for applications such as intelligent tutoring systems, human–computer interaction, mental health monitoring, and adaptive social robotics, contributing a significant advancement toward context-aware, emotionally intelligent AI systems.