Content Extraction and Recognition in Scientific Publications

Wang, Zelun

View/ Open

WANG-DISSERTATION-2020.pdf (6.584Mb)

Date

2020-10-20

Author

Wang, Zelun

Metadata

Show full item record

Abstract

In the era of digitization, the vast volume of scientific publications has become readily accessible to the readers. With the help of information retrieval technologies, a reader can conveniently locate an existing publication by typing in only a few keywords in a search engine. However, existing technologies cannot be directly applied on the contents of many scientific publications. This is due to the limitations of the PDF format, which is the de facto standard format for scientific publications nowadays. Being a layout-based graphical format, PDF unfortunately does not offer easy access to its fine-grained contents. In this dissertation, we introduce a PDF content extraction and recognition system to bridge the gap. The system focuses on extracting crucial elements from scientific publications including text, math expressions, figures, and tables, which carry most of the technical substances. The proposed system investigated four specific problems. Firstly, we designed a set of algorithms to locate math expressions (ME) in PDF documents, which are often blended into the body text. These algorithms include calculating the ME likelihood of each PDF object based on the PDF font information, and reducing the fragmented detections using a bigram regularization model. In addition to the algorithm development, we also released a new dataset for the research community. Secondly, we proposed a deep neural network to recognize math expressions and produce their markup LaTeX. We used an encoder-decoder neural architecture, while the encoder takes images as inputs, and the decoder generates LaTeX tokens as outputs. We also designed a sequence-level objective function to train the neural network in an end-to-end fashion, which affectively enforced the grammar-level correctness of the predicted LaTeX sequences. Thirdly, we developed the PDF2LaTeX OCR system, which recognizes entire PDF pages of mixed text and MEs. In the backend, we implemented machine learning algorithms to segment and label the contents, and applied the neural translators to convert page images into their LaTeX sources. Finally, we integrated the PDF2LaTeX system with two existing figure and table extraction tools, which enables the system to process a much wider range of scientific documents. For demonstration, we developed a graphical user interface for readers to conveniently interact with the contents on PDF pages.

Citation

Wang, Zelun (2020). Content Extraction and Recognition in Scientific Publications. Doctoral dissertation, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /192894.