Thai-IC: Thai Image Captioning based on CNN-RNN Architecture

Lawankorn Mookdarsanit, Pakpoom Mookdarsanit

Abstract


The trend of news are represented in an image with a short description (or caption) and quickly shared on social media. Most short captions in many languages (e.g., English, Indonesian, Myanmar, Chinese, Arabic, etc.) are manually written by human. Instead of human labor, the visual objects within an image have enough information to autonomously generate the caption, called image captioning. Thai image captioning (Thai-IC) is such a new problem in Thai natural language processing (Thai-NLP) to make the model understand the image. This paper proposes an end-to-end deep learning model to generated Thai image caption. The model consists of encoding stage by convolutional neural network (CNN) and decoding stage by recurrent neural network (RNN). Visual geometry group in 16-layers (VGGNet-16) is used to extract visuals from an image as CNN-encoder. The visuals are used to generate Thai captions by Long-short-term memory (LSTM) as RNN-decoder. Thai captioning corpus is constructed by secondary and primary data that has 10,732 images. This Thai-IC is evaluated by Bilingual Evaluation Understudy (BLEU) on the 10-fold cross validation.

Full Text:

PDF

References


K. Shuster, S. Humeau, H. Hu, A. Bordes and J. Weston, "Engaging Image Captioning via Personality," The 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 12508-12518.

I. Laina, C. Rupprecht and N. Navab, "Towards Unsupervised Image Captioning With Shared Multimodal Embeddings," The 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, 2019, pp. 7413-7423.

“Live Traffic Reports from JS100 Radio Added to Nostra Map App,” [Online]. Available: https https://www.nationthailand.com/business/30228564 [Accessed: 8 July 2020].

L. Soimart and P. Mookdarsanit, “Name with GPS Auto-tagging of Thai-tourist Attractions from An Image,” The 2017 Technology Innovation Management and Engineering Science International Conference, Nakhon Pathom, Thailand, 2017, pp. 211-217.

P. Mookdarsanit and L. Mookdarsanit, “Contextual Image Classification towards Metadata Annotation of Thai-tourist Attractions,” in ITMSoc Transactions on Information Technology Management, vol.3, no.1, pp. 32-40, 2018.

A. Olaode and G. Naghdy, "Review of the application of machine learning to the automatic semantic annotation of images," in IET Image Processing, vol. 13, no. 8, pp. 1232-1245, 2019.

S. Li, Z. Tao, K. Li and Y. Fu, "Visual to Text: Survey of Image and Video Captioning," in IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 3, no. 4, pp. 297-312, 2019.

MD. Z. Hossain, F. Sohel, M. F. Shiratuddin and H. Laga "A Comprehensive Survey of Deep Learning for Image Captioning," in ACM Computing Surveys, vol. 51, no. 6, pp.1-36, 2019.

A. A. Nugraha, A. Arifianto and Suyanto, "Generating Image Description on Indonesian Language using Convolutional Neural Network and Gated Recurrent Unit," The 2019 International Conference on Information and Communication Technology (ICoICT), Kuala Lumpur, Malaysia, 2019, pp. 1-6.

S. P. P. Aung, W. P. Pa and T. L. New, "Automatic Myanmar Image Captioning using CNN and LSTM-based Language Model," The 2020 Joint Workshop on Spoken Language Technologies for Under-resourced languages and Collaboration and Computing for Under-Resourced Languages (SLTU-CCURL), Marseille, France, 2020, pp.139-143.

C. Zhang, Y. Dai, Y. Cheng, Z. Jia and K. Hirota, "Recurrent Attention LSTM Model for Image Chinese Caption Generation," The 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems (ISIS), Toyama, Japan, 2018, pp. 808-813.

J. Zakraoui, S. Elloumi, J. M. Alja'am and S. Ben Yahia, "Improving Arabic Text to Image Mapping Using a Robust Machine Learning Technique," in IEEE Access, vol. 7, pp. 18772-18782, 2019.

H. Thaweesak Koanantakool, T. Karoonboonyanan and C. Wutiwiwatchai, "Computers and the Thai Language," in IEEE Annals of the History of Computing, vol. 31, no. 1, pp. 46-61, 2009.

C. Tapsai, P. Meesad and H. Unger, "An Overview on the Development of Thai Natural Language Processing", in Information Technology Journal, vol. 15, no. 2, pp. 45-52, 2019.

M. Jotisakulratana, N. Koomgun, R. Virojrid and B. Udomsaph, "Applying International Patent Classification (IPC) to strategic planning processes of an R&D organization: The case of NECTEC, Thailand," The PICMET '13: Technology Management in the IT-Driven Services (PICMET), San Jose, CA, 2013, pp. 1913-1918.

C. Wutiwiwatchai, V. Chunwijitra, S. Chunwijitra, P. Sertsi, S. Kasuriya, P. Chootrakool and K. Thangthai, "The NECTEC 2015 Thai Open-Domain Automatic Speech Recognition System," International Symposium on Natural Language Processing (SNLP), Ayutthaya, Thailand, 2016, pp. 124-136.

“AI for Thai,” [Online]. Available: https://www.nectec.or.th/research/research-project/aiforthai-digitaltransformation.html [Accessed: 8 July 2020]. [in Thai].

“VAJA Text-to-Speech Engine,” [Online]. Available: https://www.nectec.or.th/innovation/innovation-mobile-application/vaja.html [Accessed: 8 July 2020]. [in Thai].

“Thailand’s first AI journalist named ‘Suthichai AI’ unveiled,” [Online]. Available: https://www.thaipbsworld.com/thailands-first-ai-journalist-named-suthichai-ai-unveiled/ [Accessed: 8 July 2020].

C. Tanprasert and S. Sae-Tang, "Thai type style recognition," The 1999 IEEE International Symposium on Circuits and Systems (ISCAS), Orlando, FL, 1999, pp. 336-339.

P. Mookdarsanit and L. Mookdarsanit, "ThaiWrittenNet: Thai Handwritten Script Recognition Using Deep Neural Networks," in Azerbaijan Journal of High Performance Computing, vol. 3, no. 1, pp. 75-93.

T. Emsawas and B. Kijsirikul, "Thai printed character recognition using long short-term memory and vertical component shifting," The 2016 Pacific Rim International Conference on Artificial Intelligence, Phuket, Thailand, 2016, pp. 106-115.

C. Tanprasert and T. Koanantakool, "Thai OCR: a neural network application," The 1996 Digital Processing Applications (TENCON), Perth, WA, Australia, 1996, pp. 90-95.

P. Mookdarsanit and L. Mookdarsanit, “An Automatic Image Tagging of Thai Dance’s Gestures,” Joint Conference on ACTIS & NCOBA, Ayutthaya, Thailand, 2018, pp. 76-80.

P. Mookdarsanit and L. Mookdarsanit, “A Content-based Image Retrieval of Muay-Thai Folklores by Salient Region Matching,” in International Journal of Applied Computer Technology and Information Systems, vol.7, no.2, pp.21-26, 2018.

P. Mookdarsanit and M. Rattanasiriwongwut, “GPS Determination of Thai-temple Arts from a Single Photo,” The 11th International Conference on on Applied Computer Technology and Information Systems, Bangkok, Thailand, 2017, pp. 42-47.

P. Mookdarsanit and M. Rattanasiriwongwut, “MONTEAN Framework: A Magnificent Outstanding Native-Thai and Ecclesiastical Art Network,” in International Journal of Applied Computer Technology and Information Systems, vol.6, no.2, pp.17-22, 2017.

L. Mookdarsanit, “The Intelligent Genuine Validation beyond Online Buddhist Amulet Market,” in International Journal of Applied Computer and Information Systems, vol. 9, no.2, pp. 7-11, 2020.

L. Mookdarsanit and P. Mookdarsanit, “SiamFishNet: The Deep Investigation of Siamese Fighting Fishes,” in International Journal of Applied Computer Technology and Information Systems, vol.8, no.2, pp. 40-46, 2019.

L. Soimart and P. Mookdarsanit, “Ingredients estimation and recommendation of Thai-foods,” in SNRU Journal of Science and Technology, vol.9, no.2, pp.509-520, 2017.

P. Mookdarsanit and L. Mookdarsanit, “Name and Recipe Estimation of Thai-desserts beyond Image Tagging,” in Kasembundit Engineering Journal, vol.8, Special Issue, pp.193-203, 2018.

P. Khuphiran, S. Kajkamhaeng and C. Chantrapornchai, "Thai Scene Graph Generation from Images and Applications," The 2019 International Computer Science and Engineering Conference (ICSEC), Phuket, Thailand, 2019, pp. 361-365.

D. Liu, M. Bober and J. Kittler, “Visual Semantic Information Pursuit: A Survey,” in arXiv:1903.05434, 2019.

K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition," The 3rd International Conference on Learning Representations, San Diego, CA, 2015.

J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko and T. Darrell, "Long-Term Recurrent Convolutional Networks for Visual Recognition and Description," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 677-691, 2017.

V. Mullachery and V. Motwani, “Image Captioning,” in arXiv: 1805.09137, 2018.

“Large Scale Visual Recognition Challenge 2013 (ILSVRC2013),” [Online]. Available: http://image-net.org/challenges/LSVRC/2013/http://image-net.org/challenges/LSVRC/2013/ [Accessed: 8 July 2020].

J. Deng, W. Dong, R. Socher, L. Li, Kai Li and L. Fei-Fei, "ImageNet: A large-scale hierarchical image database," The 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, 2009, pp. 248-255.

“Flickr8K,” [Online]. Available: https://www.kaggle.com/shadabhussain/flickr8k

[Accessed: 2 February 2020].

A. Karpathy and L. Fei-Fei, "Deep visual-semantic alignments for generating image descriptions," The 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 3128-3137.

“vistec-AI/thai2nmt,” [Online]. Available: https://github.com/vistec-AI/thai2nmt?fbclid=IwAR2CRfPnlEpEykrBV3h62JrOuUBPnH2tUswI9Vf1x-gCkxeVXogM2pPNlsk [Accessed: 21 July 2020].


Refbacks

  • There are currently no refbacks.