Image Captioning Using Convolutional Neural Networks with Attention Mechanism

Document Type : Computer Article

Authors

Hamedan

Abstract

Image captioning involves the process of assigning descriptive text to images or photographs. To create an accurate description, several steps are necessary: Object Identification: Initially, the objects within the image must be correctly identified. This includes recognizing their specific features and understanding the relationships between them. Sentence Generation: Once the objects are identified, grammatically and semantically correct sentences are generated to describe the image.
In this research, an encoder-decoder architecture is employed for producing textual descriptions. The proposed model consists of three following components: Encoder (ResNet): The ResNet network serves as the encoder, extracting visual features from the input image. Decoder (Convolutional Network): In the decoding section, a four-layer convolutional neural network (CNN) generates descriptions within the language model. Attention Mechanism: To enhance the representation of image features and understand object relationships, an attention mechanism is utilized. This mechanism allows the model to focus on both the input image and the language model. The performance of the proposed model is evaluated using the MSCOCO and Flickr datasets. Experimental results demonstrate that the proposed architecture outperforms state-of-the-art researches in terms of Bleu1 and Meteor measures, while also achieving reduced training time compared to them

Keywords

Main Subjects



Articles in Press, Accepted Manuscript
Available Online from 26 May 2025
  • Receive Date: 25 February 2024
  • Revise Date: 06 December 2024
  • Accept Date: 12 January 2025