A Unified Framework for Multimodal Emotion Recognition: Leveraging Text, Audio, and Visual Data for Enhanced Emotional Understanding

Main Article Content

Sanjeeva Rao Sanku, B.Sandhya

Abstract

Emotion recognition based on multimodal data (e.g., video, audio, text, etc.) is a highly demanding and significant research field with numerous applications. This research rigorously explores model level fusion to find the best multifunctional model combining audio and visual modalities for emotion identification. Specifically, it proposes novel feature extractor networks for both audio and video data. This research presents a comprehensive approach to multimodal emotion recognition, utilizing state-of-the-art feature extraction methods tailored to each modality. For text data, we implement the Assimilated N-gram Approach (ANA) to effectively capture contextual information. Audio features are extracted using Mel-Frequency Cepstral Coefficients (MFCC), ideal for capturing spectral characteristics in speech. Visual features are derived using Squeezenet, a deep learning architecture optimized for efficient and informative visual data representation. To integrate the extracted features from text, audio, and visual modalities, propose a multimodal data fusion strategy that combines information across modalities, thereby enhancing the overall representation of emotional cues. In the classification stage, employ Capsule Net, a novel neural network architecture adept at capturing hierarchical relationships and spatial hierarchies within data, making it well-suited for handling complex multimodal data. To further optimize the performance of the Capsule Net classifier, utilize hyper parameter tuning through the Sand Cat Swarm Optimization (SCSO) algorithm. SCSO, a metaheuristic optimization technique inspired by the behavior of sand cats, iteratively updates candidate solutions to converge towards optimal hyperparameter configurations. Using the Multimodal Emotion Lines Dataset (MELD), our approach achieved an accuracy of 98.91%, precision of 98.83%, recall of 99.04%, and F-measure of 98.94. These results highlight the effectiveness of our multimodal framework in emotion recognition tasks.

Article Details

Section
Articles