Show simple item record

dc.contributor.advisorLiu, Jyh-Charn
dc.creatorWang, Zelun
dc.date.accessioned2021-05-06T22:58:32Z
dc.date.available2022-12-01T08:18:42Z
dc.date.created2020-12
dc.date.issued2020-10-20
dc.date.submittedDecember 2020
dc.identifier.urihttps://hdl.handle.net/1969.1/192894
dc.description.abstractIn the era of digitization, the vast volume of scientific publications has become readily accessible to the readers. With the help of information retrieval technologies, a reader can conveniently locate an existing publication by typing in only a few keywords in a search engine. However, existing technologies cannot be directly applied on the contents of many scientific publications. This is due to the limitations of the PDF format, which is the de facto standard format for scientific publications nowadays. Being a layout-based graphical format, PDF unfortunately does not offer easy access to its fine-grained contents. In this dissertation, we introduce a PDF content extraction and recognition system to bridge the gap. The system focuses on extracting crucial elements from scientific publications including text, math expressions, figures, and tables, which carry most of the technical substances. The proposed system investigated four specific problems. Firstly, we designed a set of algorithms to locate math expressions (ME) in PDF documents, which are often blended into the body text. These algorithms include calculating the ME likelihood of each PDF object based on the PDF font information, and reducing the fragmented detections using a bigram regularization model. In addition to the algorithm development, we also released a new dataset for the research community. Secondly, we proposed a deep neural network to recognize math expressions and produce their markup LaTeX. We used an encoder-decoder neural architecture, while the encoder takes images as inputs, and the decoder generates LaTeX tokens as outputs. We also designed a sequence-level objective function to train the neural network in an end-to-end fashion, which affectively enforced the grammar-level correctness of the predicted LaTeX sequences. Thirdly, we developed the PDF2LaTeX OCR system, which recognizes entire PDF pages of mixed text and MEs. In the backend, we implemented machine learning algorithms to segment and label the contents, and applied the neural translators to convert page images into their LaTeX sources. Finally, we integrated the PDF2LaTeX system with two existing figure and table extraction tools, which enables the system to process a much wider range of scientific documents. For demonstration, we developed a graphical user interface for readers to conveniently interact with the contents on PDF pages.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectPDFen
dc.subjectMath expressionsen
dc.subjectMachine learningen
dc.titleContent Extraction and Recognition in Scientific Publicationsen
dc.typeThesisen
thesis.degree.departmentComputer Science and Engineeringen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameDoctor of Philosophyen
thesis.degree.levelDoctoralen
dc.contributor.committeeMemberShipman, Frank
dc.contributor.committeeMemberHuang, Ruihong
dc.contributor.committeeMemberFerris, Thomas
dc.type.materialtexten
dc.date.updated2021-05-06T22:58:32Z
local.embargo.terms2022-12-01
local.etdauthor.orcid0000-0002-1882-2526


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record