Multimodal Deep Learning Framework for Intelligent Video Querying and Graph Recognition in Educational Content Analysis

Main Article Content

Y. N. Thakare, J. Rangole, R. U. Shekokar, P. K. Agrawal, U. Wankhade, S. Kulsum5, J. Sharma, H. Agrawal, A. Bonde

Abstract

Video content has become increasingly complex, presenting challenges in extracting and analyzing multi-modal information. Traditional search mechanisms fall short in comprehensively processing and interpreting multimedia resources. VIDIWISE addresses these challenges through an innovative approach integrating multiple advanced technologies. The system leverages deep learning models like ResNet18 for visual feature extraction, Whisper for audio transcription, and machine learning techniques for keyframe detection and text recognition. By employing computer vision algorithms, the system performs OCR on video frames, clusters visual features, and generates a comprehensive, searchable transcript that combines textual and visual insights. The proposed system transforms video content analysis by creating a unified approach that extracts and correlates information across different modalities. Through intelligent processing techniques like feature clustering, timestamp-based segmentation, and semantic analysis, VIDIWISE enables precise and efficient video content retrieval.

Article Details

Section
Articles