SwinVQA: A Transformer Framework for Medical Visual Question Answering

Main Article Content

Keyur Vaitha, Jay Korat, Anand Akbari, Chinmay Raut

Abstract

The integration of advanced computer vision and natural language processing techniques in medical image analysis presents significant challenges due to the complexity and high stakes of diagnostic interpretation. This paper introduces SwinVQA, a novel medical visual question answering framework that leverages the hierarchical architecture of Swin Transformers to address limitations in existing systems. By employing shifted window partitioning and patch merging techniques, SwinVQA efficiently processes high-resolution medical images while simultaneously capturing both local details and global context essential for accurate diagnosis. We implement a sophisticated cross-modal attention mechanism that effectively aligns visual features with clinical queries, enhancing the model’s reasoning capabilities. The framework is evaluated on established medical VQA datasets, including SLAKE and VQA-RAD, demonstrating improved performance across various question types and imaging modalities. Additionally, we introduce beam search optimization for answer generation, resulting in more contextually appropriate and diagnostically accurate responses. Experimental results show that SwinVQA significantly outperforms baseline models in both computational efficiency and diagnostic accuracy. This research advances the field of AI-assisted medical image analysis by providing a more robust and clinically relevant solution that bridges the gap between technological capabilities and practical healthcare applications.

Article Details

Section
Articles