[1] Bernardi, Raffaella, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, and Barbara Plank. "Automatic description generation from images: A survey of models, datasets, and evaluation measures." Journal of Artificial Intelligence Research 55 (2016): 409-442.
[2] Bai, Shuang, and Shan An. "A survey on automatic image caption generation." Neurocomputing 311 (2018): 291-304.
[3] Ordonez, Vicente, Girish Kulkarni, and Tamara Berg. "Im2text: Describing images using 1 million captioned photographs." Advances in Neural Information Processing Systems 24 (2011).
[4] Yang, Yezhou, Ching Teo, Hal Daumé III, and Yiannis Aloimonos. "Corpus-guided sentence generation of natural images." In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 444-454. 2011.
[5] Kumar, Akshi, and Shivali Goel. "A survey of evolution of image captioning techniques." International Journal of Hybrid Intelligent Systems 14, no. 3 (2017): 123-139.
[6] Mao, Junhua, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. "Deep captioning with multimodal recurrent neural networks (m-rnn)." In International Conference on Learning Representations (ICLR), 2015.
[7] Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. "Show, attend and tell: Neural image caption generation with visual attention." International Conference on Machine Learning (2015): 2048–2057.
[8] Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156-3164. 2015.
[9] Chen, Xinlei, and C. Lawrence Zitnick. "Mind's eye: A recurrent visual representation for image caption generation." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422-2431. 2015.
[10] Li, Ruifan, Haoyu Liang, Yihui Shi, Fangxiang Feng, and Xiaojie Wang. "Dual-CNN: A Convolutional language decoder for paragraph image captioning." Neurocomputing 396 (2020): 92-101.
[11] Aneja, Jyoti, Aditya Deshpande, and Alexander G. Schwing. "Convolutional image captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5561-5570. 2018.
[12] Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. "Microsoft coco: Common objects in context." In European Conference on Computer Vision, pp. 740-755. Cham: Springer International Publishing, 2014.
[13] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "Bleu: a method for automatic evaluation of machine translation." In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318. 2002.
[14] Banerjee, Satanjeev, and Alon Lavie. "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments." In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65-72. 2005.
[15] Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. "Cider: Consensus-based image description evaluation." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566-4575. 2015.
[16] Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." In Text Summarization Branches Out, pp. 74-81. 2004.
[17] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778. 2016.
[18] Chen, Long, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. "Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5659-5667. 2017.
[19] Wu, Qi, Chunhua Shen, Peng Wang, Anthony Dick, and Anton Van Den Hengel. "Image captioning and visual question answering based on attributes and external knowledge." IEEE Transactions on Pattern Analysis and Machine Intelligence 40, no. 6 (2017): 1367-1381.
[20] Zhang, Li, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, and Timothy M. Hospedales. "Actor-critic sequence training for image captioning." arXiv preprint arXiv:1706.09601 (2017).
[21] Venugopalan, Subhashini, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, and Kate Saenko. "Captioning images with diverse objects." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) : 1170–1178.
[22] Rennie, Steven J, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. "Self-critical sequence training for image captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) : 1179–1195.
[23] Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).
[24] Lu, Jiasen, Caiming Xiong, Devi Parikh, and Richard Socher. "Knowing when to look: Adaptive attention via a visual sentinel for image captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375-383. 2017.
[25] Plummer, Bryan A., Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. "Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models." In Proceedings of the IEEE International Conference on Computer Vision, pp. 2641-2649. 2015.
[26] Sattari, Zahra Famil, Hassan Khotanlou, and Elham Alighardash. "Improving image captioning with local attention mechanism." In 2022 9th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), pp. 1-5. IEEE, 2022.
[27] Yamashita, Rikiya, Mizuho Nishio, Richard Kinh Gian Do, and Kaori Togashi. "Convolutional neural networks: an overview and application in radiology." Insights Into Imaging 9, no. 4 (2018): 611-629.
[28] Sattari, Zahra Famil, Hassan Khotanlou, and Elham Alighardash. "Improving image captioning with local attention mechanism." In 2022 9th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), pp. 1-5. IEEE, 2022.
[29] Ding, Songtao, Shiru Qu, Yuling Xi, and Shaohua Wan. "Stimulus-driven and concept-driven analysis for image caption generation." Neurocomputing 398 (2020): 520-530.