Date of Award

2014

Degree Type

Dissertation

Degree Name

Ph.D.

Department

Electrical and Computer Engineering

First Advisor

Sid-Ahmed, Maher

Keywords

Arabic OCR, OCR, Page Segmentation, Persian OCR, Skew Correction, Subword Recognition

Rights

CC BY-NC-ND 4.0

Abstract

Texts are an important representation of language. Due to the volume of texts generated and the historical value of some documents, it is imperative to use computers to read generated texts, and make them editable and searchable. This task, however, is not trivial. Recreating human perception capabilities in artificial systems like documents is one of the major goals of pattern recognition research. After decades of research and improvements in computing capabilities, humans' ability to read typed or handwritten text is hardly matched by machine intelligence. Although, classical applications of Optical Character Recognition (OCR) like reading machine-printed addresses in a mail sorting machine is considered solved, more complex scripts or handwritten texts push the limits of the existing technology. Moreover, many of the existing OCR systems are language dependent. Therefore, improvements in OCR technologies have been uneven across different languages. Especially, for Persian, there has been limited research. Despite the need to process many Persian historical documents or use of OCR in variety of applications, few Persian OCR systems work with good recognition rate. Consequently, the task of automatically reading Persian typed documents with close-to-human performance is still an open problem and the main focus of this dissertation. In this dissertation, after a literature survey of the existing technology, we propose new techniques in the two important preprocessing steps in any OCR system: Skew detection and Page segmentation. Then, rather than the usual practice of character segmentation, we propose segmentation of Persian documents into sub-words. The choice of sub-word segmentation is to avoid the challenges of segmenting highly cursive Persian texts to isolated characters. For feature extraction, we will propose a hybrid scheme between three commonly used methods and finally use a nonparametric classification method. A large number of papers and patents advertise recognition rates near 100%. Such claims give the impression that automation problems seem to have been solved. Although OCR is widely used, its accuracy today is still far from a child's reading skills. Failure of some real applications show that performance problems still exist on composite and degraded documents and that there is still room for progress.

Share

COinS