PDF Extraction Based on Lexical Analysis for Thai Texts

Santipong Thaiprayoon, Alisa Kongthon Choochart Haruechaiyasak

Abstract


Today, one of the most widely used digital document file format is the Portable Document Format or PDF. The main advantage of PDF is the ability to
create and share documents across platforms with different operating systems and hardware environments. Although, many tools for generating PDF files from text documents exist, however, there is no standard tool for converting PDF files into texts with 100% accuracy. The errors are mainly caused by the misplacement of some characters in the resulting texts. For Thai language, the
problem is more intensified due to the complex lexeme structure, i.e., character composition, of Thai words. In this paper, we first surveyed PDF extraction tools which is suitable for Thai language. To further improve the quality of the extracted texts, we propose an approach called PDF-PP (Thai PDF Post Processor), which performs text cleansing based on the lexical analysis. The experiment results using a large corpus showed that the proposed Thai PDF-PP could help improve the accuracy of extracted texts up to 99.78%.


Full Text:

PDF

Refbacks

  • There are currently no refbacks.