Optimized Image Quality based Hybrid Deep Learning Framework for Enhanced Image Captioning

Main Article Content

Mehzabeen Kaur,Harpreet Kaur

Abstract

Image captioning is a complex interdisciplinary task that connects computer vision and natural language processing to generate meaningful textual descriptions of visual content. However, the quality of input images plays a crucial role in determining the accuracy, fluency, and semantic relevance of generated captions. Low-resolution or noisy images often lead to incomplete or inaccurate descriptions, limiting the effectiveness of captioning systems. This paper introduces a Hybrid Deep Learning Framework for Image Captioning with Optimized Image Quality, which integrates advanced image enhancement techniques with a robust caption generation model. In the proposed method, input images are first processed using a hybrid enhancement strategy that combines histogram equalization with adaptive filtering, improving contrast, clarity, and detail preservation. The quality-enhanced images are then passed through a deep learning pipeline that employs Convolutional Neural Networks (CNNs) for visual feature extraction and Long Short-Term Memory (LSTM) networks for sequential caption generation.Extensive experiments conducted on benchmark datasets demonstrate that the proposed framework outperforms baseline image captioning systems across multiple evaluation metrics, including accuracy, precision, recall, F1-score, BLEU, METEOR, and CIDEr. Results indicate that the enhancement stage significantly improves semantic alignment between the image and its caption, producing more descriptive and contextually accurate outputs. By addressing the limitations imposed by low-quality images, this research highlights the potential of combining image optimization with deep learning to advance the performance and applicability of modern image captioning systems.

Article Details

Section
Articles

References

Vinyals, O., Toshev, A., Bengio, S., & Erhan, D., “Show and Tell: A Neural Image Caption Generator,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

Xu, K., Ba, J., Kiros, R., et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” International Conference on Machine Learning (ICML), 2015.

Dong, C., Loy, C.C., He, K., & Tang, X., “Learning a Deep Convolutional Network for Image Super-Resolution,” European Conference on Computer Vision (ECCV), Springer, pp. 184–199, 2014.

Abiodun, A., et al., “Brightness Preserving Bi-Histogram Equalization,” 1996.

Bevilacqua, M., Roumy, A., Guillemot, C., &Alberi-Morel, M.L., “Low-Complexity Single-Image Super-Resolution Based on Nonnegative Neighbor Embedding,” 2012.

Chang, X., Yu, Y.-L., Yang, Y., & Xing, E.P., “Semantic Pooling for Complex Event Analysis in Untrimmed Videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8):1617–1632, 2017.

Heidari, A.A., et al., “Harris Hawks Optimization: Algorithm and Applications,” 2019–2021.

Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., & Chua, T.-S., “SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6298–6306, 2017.

Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J.C., et al., “From Captions to Visual Concepts and Back,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1473–1482, 2015.

He, K., Zhang, X., Ren, S., & Sun, J., “Deep Residual Learning for Image Recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.

Irani, M., & Peleg, S., “Improving Resolution by Image Registration,” CVGIP: Graphical Models and Image Processing, 53(3):231–239, 1991.

Jia, X., Gavves, E., Fernando, B., &Tuytelaars, T., “Guiding Long-Short Term Memory for Image Caption Generation,” IEEE International Conference on Computer Vision (ICCV), pp. 2407–2415, 2016.

Kim, J., Kwon Lee, J., & Lee, K., “Accurate Image Super-Resolution Using Very Deep Convolutional Networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1646–1654, 2016.

Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., & Berg, T.L., “Babytalk: Understanding and Generating Simple Image Descriptions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903, 2013.

Lu, J., Xiong, C., Parikh, D., &Socher, R., “Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 6, p. 2, 2017.

Li, G., Zhu, L., Liu, P., & Yang, Y., “Entangled Transformer for Image Labeling,” Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8928–8937, 2019.

Fedus, W., Goodfellow, I., & Dai, A.M., “MaskGAN: Better Text Generation,” arXiv preprint arXiv:1801.07736, 2018.

Das, S., Jain, L., & Das, A., “Deep Learning for Military Image Labeling,” 2018 21st International Conference on Information Fusion (FUSION), pp. 2165–2171, doi: 10.23919/ICIF.2018.8455321, 2018.

Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., & Berg, T.L., “Babytalk: Understanding and Generating Image Descriptions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903, June 2013.

Omri, M., et al., “Modeling of Hyperparameter Tuned Deep Learning Model for Automated Image Labeling,” Mathematics, 10:288, 2022.

Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., &Lazebnik, S., “Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections,” European Conference on Computer Vision (ECCV), Springer, pp. 529–545, 2014.

Simonyan, K., & Zisserman, A., “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556, 2014.

Young, P., Hodosh, M., &Hockenmaier, J., “Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics,” Journal of Artificial Intelligence Research, 47:853–899, 2013.

Kiros, R., Salakhutdinov, R., &Zemel, R.S., “Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models,” Workshop on Neural Information Processing Systems (NIPS), 2014.

Donahue, J., et al., “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.