توصیف خودکار تصویر مبتنی بر شبکه های عصبی کانولوشنی با بهره گیری از مکانیزم توجه

امیری, فاطمه; احمدی, فرشته

doi:10.22075/jme.2025.33378.2629

توصیف خودکار تصویر مبتنی بر شبکه های عصبی کانولوشنی با بهره گیری از مکانیزم توجه

نوع مقاله : مقاله پژوهشی

نویسندگان

گروه مهندسی کامپیوتر، دانشگاه صنعتی همدان، همدان، ایران

10.22075/jme.2025.33378.2629

چکیده

به فرایند اختصاص دادن توضیحات یا شرح متنی به تصاویر یا عکس‌ها توصیف تصویر اطلاق می‌شود. برای توصیف تصویر ابتدا لازم است که اشیا درون تصویر، ویژگی این اشیا و ارتباط میان آنان به درستی تشخیص داده شود و سپس جملاتی که از نظر گرامری و معنایی درست هستند، تولید شوند. در این تحقیق از معماری رمزگذار-رمزگشا جهت تولید توصیفات متنی استفاده شده است. مدل پیشنهادی شامل یک شبکهResNet به عنوان رمزگذار جهت استخراج ویژگی‌های بصری تصویر است. در بخش رمزگشا شبکه کانولوشنی با چهار لایه جهت تولید توصیفات در مدل زبانی ارایه شده است. برای نشان دادن موثرتر ویژگی های حاصل از تصویر و درک روابط بین اشیا از یک ساز و کارتوجه استفاده شده است که قابلیت توجه به تصویر ورودی و مدل زبانی را دارد. کارایی مدل پیشنهادی بر روی مجموعه داده های MSCOCO و Flickr مورد ارزیابی قرار گرفته است. نتایج آزمایشگاهی نشان می‌دهد کارایی معماری پیشنهادی بر اساس معیارBleu1 و Meteor نسبت به پژوهش‌های جدید برتری دارد درحالیکه زمان آموزش مدل پیشنهادی در مقایسه با پژوهشهای جدید کاهش یافته است.

کلیدواژه‌ها

موضوعات

مهندسی کامپیوتر

عنوان مقاله [English]

Image Captioning Using Convolutional Neural Networks with Attention Mechanism

نویسندگان [English]

Fatemeh Amiri
Fereshteh Ahmadi

Department of Computer Engineering, Hamedan University of Technology, Hamedan, Iran

چکیده [English]

Image captioning involves the process of assigning descriptive text to images or photographs. To create an accurate description, several steps are necessary: 1) Object Identification: Initially, the objects within the image must be correctly identified. This includes recognizing their specific features and understanding the relationships between them. 2) Sentence Generation: Once the objects are identified, grammatically and semantically correct sentences are generated to describe the image. In this research, an encoder-decoder architecture is employed for producing textual descriptions. The proposed model consists of three following components: 1) Encoder (ResNet): The ResNet network serves as the encoder, extracting visual features from the input image. 2) Decoder (Convolutional Network): In the decoding section, a four-layer convolutional neural network (CNN) generates descriptions within the language model. 3)Attention Mechanism: To enhance the representation of image features and understand object relationships, an attention mechanism is utilized. This mechanism allows the model to focus on both the input image and the language model. The performance of the proposed model is evaluated using the MSCOCO and Flickr datasets. Experimental results demonstrate that the proposed architecture outperforms state-of-the-art researches in terms of Bleu1 and Meteor measures, while also achieving reduced training time compared to them.

کلیدواژه‌ها [English]

Image captioning
Convolutional neural network
Resnet
Attention mechanism

مراجع

[1] Bernardi, Raffaella, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, and Barbara Plank. "Automatic description generation from images: A survey of models, datasets, and evaluation measures." Journal of Artificial Intelligence Research 55 (2016): 409-442.

[2] Bai, Shuang, and Shan An. "A survey on automatic image caption generation." Neurocomputing 311 (2018): 291-304.

[3] Ordonez, Vicente, Girish Kulkarni, and Tamara Berg. "Im2text: Describing images using 1 million captioned photographs." Advances in Neural Information Processing Systems 24 (2011).

[4] Yang, Yezhou, Ching Teo, Hal Daumé III, and Yiannis Aloimonos. "Corpus-guided sentence generation of natural images." In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 444-454. 2011.

[5] Kumar, Akshi, and Shivali Goel. "A survey of evolution of image captioning techniques." International Journal of Hybrid Intelligent Systems 14, no. 3 (2017): 123-139.

[6] Mao, Junhua, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. "Deep captioning with multimodal recurrent neural networks (m-rnn)." In International Conference on Learning Representations (ICLR), 2015.

[7] Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. "Show, attend and tell: Neural image caption generation with visual attention." International Conference on Machine Learning (2015): 2048–2057.

[8] Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156-3164. 2015.

[9] Chen, Xinlei, and C. Lawrence Zitnick. "Mind's eye: A recurrent visual representation for image caption generation." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422-2431. 2015.

[10] Li, Ruifan, Haoyu Liang, Yihui Shi, Fangxiang Feng, and Xiaojie Wang. "Dual-CNN: A Convolutional language decoder for paragraph image captioning." Neurocomputing 396 (2020): 92-101.

[11] Aneja, Jyoti, Aditya Deshpande, and Alexander G. Schwing. "Convolutional image captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5561-5570. 2018.

[12] Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. "Microsoft coco: Common objects in context." In European Conference on Computer Vision, pp. 740-755. Cham: Springer International Publishing, 2014.

[13] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "Bleu: a method for automatic evaluation of machine translation." In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318. 2002.

[14] Banerjee, Satanjeev, and Alon Lavie. "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments." In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65-72. 2005.

[15] Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. "Cider: Consensus-based image description evaluation." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566-4575. 2015.

[16] Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." In Text Summarization Branches Out, pp. 74-81. 2004.

[17] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778. 2016.

[18] Chen, Long, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. "Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5659-5667. 2017.

[19] Wu, Qi, Chunhua Shen, Peng Wang, Anthony Dick, and Anton Van Den Hengel. "Image captioning and visual question answering based on attributes and external knowledge." IEEE Transactions on Pattern Analysis and Machine Intelligence 40, no. 6 (2017): 1367-1381.

[20] Zhang, Li, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, and Timothy M. Hospedales. "Actor-critic sequence training for image captioning." arXiv preprint arXiv:1706.09601 (2017).

[21] Venugopalan, Subhashini, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, and Kate Saenko. "Captioning images with diverse objects." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) : 1170–1178.

[22] Rennie, Steven J, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. "Self-critical sequence training for image captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) : 1179–1195.

[23] Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).

[24]‌ Lu, Jiasen, Caiming Xiong, Devi Parikh, and Richard Socher. "Knowing when to look: Adaptive attention via a visual sentinel for image captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375-383. 2017.

[25] Plummer, Bryan A., Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. "Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models." In Proceedings of the IEEE International Conference on Computer Vision, pp. 2641-2649. 2015.

[26] Sattari, Zahra Famil, Hassan Khotanlou, and Elham Alighardash. "Improving image captioning with local attention mechanism." In 2022 9th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), pp. 1-5. IEEE, 2022.

[27] Yamashita, Rikiya, Mizuho Nishio, Richard Kinh Gian Do, and Kaori Togashi. "Convolutional neural networks: an overview and application in radiology." Insights Into Imaging 9, no. 4 (2018): 611-629.

[28] Sattari, Zahra Famil, Hassan Khotanlou, and Elham Alighardash. "Improving image captioning with local attention mechanism." In 2022 9th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), pp. 1-5. IEEE, 2022.

[29] Ding, Songtao, Shiru Qu, Yuling Xi, and Shaohua Wan. "Stimulus-driven and concept-driven analysis for image caption generation." Neurocomputing 398 (2020): 520-530.

توصیف خودکار تصویر مبتنی بر شبکه های عصبی کانولوشنی با بهره گیری از مکانیزم توجه

Image Captioning Using Convolutional Neural Networks with Attention Mechanism

مراجع

دوره 23، شماره 82
مهر 1404
صفحه 11-21

فایل ها

سابقه مقاله

هم رسانی

ارجاع به این مقاله

آمار

توصیف خودکار تصویر مبتنی بر شبکه های عصبی کانولوشنی با بهره گیری از مکانیزم توجه

Image Captioning Using Convolutional Neural Networks with Attention Mechanism

مراجع

دوره 23، شماره 82مهر 1404صفحه 11-21

فایل ها

سابقه مقاله

هم رسانی

ارجاع به این مقاله

آمار

دوره 23، شماره 82
مهر 1404
صفحه 11-21