Show simple item record

dc.contributor.advisorGutierrez-Osuna, Ricardo
dc.creatorGupta, Anshul
dc.date.accessioned2019-01-18T19:23:19Z
dc.date.available2019-01-18T19:23:19Z
dc.date.created2015-08
dc.date.issued2015-08-05
dc.date.submittedAugust 2015
dc.identifier.urihttps://hdl.handle.net/1969.1/174199
dc.description.abstractMass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools. Issues include noisy backgrounds and faded text due to aging, border/marginal noise, bleed-through, skewing, warping, as well as irregular fonts and page layouts. As a result, OCR tools often produce a large number of spurious bounding boxes (BBs) in addition to those that correspond to words in the document. To improve the OCR output, in this thesis we develop machine-learning methods to assess the quality of historical documents and label/tag documents (with the page problems) in the EEBO/ECCO collections—45 million pages available through the Early Modern OCR Project at Texas A&M University. We present an iterative classification algorithm to automatically label BBs (i.e., as text or noise) based on their spatial distribution and geometry. The approach uses a rule-base classifier to generate initial text/noise labels for each BB, followed by an iterative classifier that refines the initial labels by incorporating local information to each BB, its spatial location, shape and size. When evaluated on a dataset containing over 72,000 manually-labeled BBs from 159 historical documents, the algorithm can classify BBs with 0.95 precision and 0.96 recall. Further evaluation on a collection of 6,775 documents with ground-truth transcriptions shows that the algorithm can also be used to predict document quality (0.7 correlation) and improve OCR transcriptions in 85% of the cases. This thesis also aims at generating font metadata for historical documents. Knowledge of the font can aid OCR system to produce very accurate text transcriptions, but getting font information for 45 million documents is a daunting task. We present an active learning based font identification system that can classify document images into fonts. In active learning, a learner queries the human for labels on examples it finds most informative. We capture the characteristics of the fonts using word image features related to character width, angled strokes, and Zernike moments. To extract page level features, we use bag-of-word feature (BoF) model. A font classification model trained using BoF and active learning requires only 443 labeled instances to achieve 89.3% test accuracy.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectHistorical Documentsen
dc.subjectEMOPen
dc.subjectMachine Learningen
dc.subjectQuality Assessmenten
dc.subjectActive Learningen
dc.subjectFont identificationen
dc.subjectBag-of-wordsen
dc.subjectDiagnosticsen
dc.titleAssessment of OCR Quality and Font Identification in Historical Documentsen
dc.typeThesisen
thesis.degree.departmentCollege of Engineeringen
thesis.degree.disciplineComputer Engineeringen
thesis.degree.grantorTexas A & M Universityen
thesis.degree.nameMaster of Scienceen
thesis.degree.levelMastersen
dc.contributor.committeeMemberFuruta, Richard
dc.contributor.committeeMemberMandell, Laura
dc.type.materialtexten
dc.date.updated2019-01-18T19:23:19Z
local.etdauthor.orcid0000-0002-8991-8429


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record