Image Captioning Using Convolutional Neural Networks with Attention Mechanism

Amiri, Fatemeh; Ahmadi, Fereshteh

doi:10.22075/jme.2025.33378.2629

Image Captioning Using Convolutional Neural Networks with Attention Mechanism

Document Type : Research Paper

Authors

Department of Computer Engineering, Hamedan University of Technology, Hamedan, Iran

10.22075/jme.2025.33378.2629

Abstract

Image captioning involves the process of assigning descriptive text to images or photographs. To create an accurate description, several steps are necessary: 1) Object Identification: Initially, the objects within the image must be correctly identified. This includes recognizing their specific features and understanding the relationships between them. 2) Sentence Generation: Once the objects are identified, grammatically and semantically correct sentences are generated to describe the image. In this research, an encoder-decoder architecture is employed for producing textual descriptions. The proposed model consists of three following components: 1) Encoder (ResNet): The ResNet network serves as the encoder, extracting visual features from the input image. 2) Decoder (Convolutional Network): In the decoding section, a four-layer convolutional neural network (CNN) generates descriptions within the language model. 3)Attention Mechanism: To enhance the representation of image features and understand object relationships, an attention mechanism is utilized. This mechanism allows the model to focus on both the input image and the language model. The performance of the proposed model is evaluated using the MSCOCO and Flickr datasets. Experimental results demonstrate that the proposed architecture outperforms state-of-the-art researches in terms of Bleu1 and Meteor measures, while also achieving reduced training time compared to them.

Keywords

Main Subjects

Computer Engneering

References

[1] Bernardi, Raffaella, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, and Barbara Plank. "Automatic description generation from images: A survey of models, datasets, and evaluation measures." Journal of Artificial Intelligence Research 55 (2016): 409-442.

[2] Bai, Shuang, and Shan An. "A survey on automatic image caption generation." Neurocomputing 311 (2018): 291-304.

[3] Ordonez, Vicente, Girish Kulkarni, and Tamara Berg. "Im2text: Describing images using 1 million captioned photographs." Advances in Neural Information Processing Systems 24 (2011).

[4] Yang, Yezhou, Ching Teo, Hal Daumé III, and Yiannis Aloimonos. "Corpus-guided sentence generation of natural images." In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 444-454. 2011.

[5] Kumar, Akshi, and Shivali Goel. "A survey of evolution of image captioning techniques." International Journal of Hybrid Intelligent Systems 14, no. 3 (2017): 123-139.

[6] Mao, Junhua, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. "Deep captioning with multimodal recurrent neural networks (m-rnn)." In International Conference on Learning Representations (ICLR), 2015.

[7] Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. "Show, attend and tell: Neural image caption generation with visual attention." International Conference on Machine Learning (2015): 2048–2057.

[8] Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156-3164. 2015.

[9] Chen, Xinlei, and C. Lawrence Zitnick. "Mind's eye: A recurrent visual representation for image caption generation." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422-2431. 2015.

[10] Li, Ruifan, Haoyu Liang, Yihui Shi, Fangxiang Feng, and Xiaojie Wang. "Dual-CNN: A Convolutional language decoder for paragraph image captioning." Neurocomputing 396 (2020): 92-101.

[11] Aneja, Jyoti, Aditya Deshpande, and Alexander G. Schwing. "Convolutional image captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5561-5570. 2018.

[12] Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. "Microsoft coco: Common objects in context." In European Conference on Computer Vision, pp. 740-755. Cham: Springer International Publishing, 2014.

[13] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "Bleu: a method for automatic evaluation of machine translation." In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318. 2002.

[14] Banerjee, Satanjeev, and Alon Lavie. "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments." In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65-72. 2005.

[15] Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. "Cider: Consensus-based image description evaluation." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566-4575. 2015.

[16] Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." In Text Summarization Branches Out, pp. 74-81. 2004.

[17] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778. 2016.

[18] Chen, Long, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. "Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5659-5667. 2017.

[19] Wu, Qi, Chunhua Shen, Peng Wang, Anthony Dick, and Anton Van Den Hengel. "Image captioning and visual question answering based on attributes and external knowledge." IEEE Transactions on Pattern Analysis and Machine Intelligence 40, no. 6 (2017): 1367-1381.

[20] Zhang, Li, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, and Timothy M. Hospedales. "Actor-critic sequence training for image captioning." arXiv preprint arXiv:1706.09601 (2017).

[21] Venugopalan, Subhashini, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, and Kate Saenko. "Captioning images with diverse objects." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) : 1170–1178.

[22] Rennie, Steven J, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. "Self-critical sequence training for image captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) : 1179–1195.

[23] Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).

[24]‌ Lu, Jiasen, Caiming Xiong, Devi Parikh, and Richard Socher. "Knowing when to look: Adaptive attention via a visual sentinel for image captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375-383. 2017.

[25] Plummer, Bryan A., Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. "Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models." In Proceedings of the IEEE International Conference on Computer Vision, pp. 2641-2649. 2015.

[26] Sattari, Zahra Famil, Hassan Khotanlou, and Elham Alighardash. "Improving image captioning with local attention mechanism." In 2022 9th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), pp. 1-5. IEEE, 2022.

[27] Yamashita, Rikiya, Mizuho Nishio, Richard Kinh Gian Do, and Kaori Togashi. "Convolutional neural networks: an overview and application in radiology." Insights Into Imaging 9, no. 4 (2018): 611-629.

[28] Sattari, Zahra Famil, Hassan Khotanlou, and Elham Alighardash. "Improving image captioning with local attention mechanism." In 2022 9th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), pp. 1-5. IEEE, 2022.

[29] Ding, Songtao, Shiru Qu, Yuling Xi, and Shaohua Wan. "Stimulus-driven and concept-driven analysis for image caption generation." Neurocomputing 398 (2020): 520-530.

Image Captioning Using Convolutional Neural Networks with Attention Mechanism

References

Volume 23, Issue 82
October 2025
Pages 11-21

Files

History

Share

How to cite

Statistics

Image Captioning Using Convolutional Neural Networks with Attention Mechanism

References

Volume 23, Issue 82October 2025Pages 11-21

Files

History

Share

How to cite

Statistics

Volume 23, Issue 82
October 2025
Pages 11-21