MECA: Mathematical Expression Based Post Publication Content  Analysis

Wang, Xing

dc.contributor.advisor	Liu, Jyh-Charn
dc.creator	Wang, Xing
dc.date.accessioned	2019-01-23T19:58:08Z
dc.date.available	2020-12-01T07:33:10Z
dc.date.created	2018-12
dc.date.issued	2018-11-29
dc.date.submitted	December 2018
dc.identifier.uri	https://hdl.handle.net/1969.1/174449
dc.description.abstract	Mathematical expressions (ME) are critical abstractions for technical publications. While the sheer volume of technical publications grows in time, few ME centric applications have been developed due to the steep gap between the typesetting data in post-publication digital documents and the high-level technical semantics. With the acceleration of the technical publications every year, word-based information analysis technologies are inadequate to enable users in discovery, organizing, and interrelating technical work efficiently and effectively. This dissertation presents a modeling framework and the associated algorithms, called the mathematical-centered post-publication content analysis (MECA) system to address several critical issues to build a layered solution architecture for recovery of high-level technical information. Overall, MECA is consisted of four layers of modeling work, starting from the extraction of MEs from Portable Document Format (PDF) files. Specifically, a weakly-supervised sequential typesetting Bayesian model is developed by using a concise font-value based feature space for Bayesian inference of ME vs. words for the rendering units separated by space. A Markov Random Field (MRF) model is designed to merge and correct the MEs identified from the rendering units, which are otherwise prone to fragmentation of large MEs. At the next layer, MECA aims at the recovery of ME semantics. The first step is the ME layout analysis to disambiguate layout structures based on a Content-Constrained Spatial (CCS) global inference model to overcome local errors. It achieves high accuracy at low computing cost by a parametric lognormal model for the feature distribution of typographic systems. The ME layout is parsed into ME semantics with a three-phase processing workflow to overcome a variety of semantic ambiguities. In the first phase, the ME layout is linearized into a token sequence, upon which the abstract syntax tree (AST) is constructed in the second phase using probabilistic context-free grammar. Tree rewriting will transform the AST into ME objects in the third phase. Built upon the two layers of ME extraction and semantics modeling work, next we explore one of the bonding relationships between words and MEs: ME declarations, where the words and MEs are respectively the qualitative and quantitative (QuQn) descriptors of technical concepts. Conventional low-level PoS tagging and parsing tools have poor performance in the processing of this type of mixed word-ME (MWM) sentences. As such, we develop an MWM processing toolkit. A semi-automated weakly-supervised framework is employed for mining of declaration templates from a large amount of unlabeled data so that the templates can be used for the detection of ME declarations. On the basis of the three low-level content extraction and prediction solutions, the MECA system can extract MEs, interpret their mathematical semantics, and identify their bonding declaration words. By analyzing the dependency among these elements in a paper, we can construct a QuQn map, which essentially represents the reasoning flow of a paper. Three case studies are conducted for QuQn map applications: differential content comparison of papers, publication trend generation, and interactive mathematical learning. Outcomes from these studies suggest that MECA is a highly practical content analysis technology based on a theoretically sound framework. Much more can be expanded and improved upon for the next generation of deep content analysis solutions.	en
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.subject	mathematical expression	en
dc.subject	information extraction	en
dc.subject	parsing	en
dc.subject	declaration	en
dc.subject	QuQn map	en
dc.title	MECA: Mathematical Expression Based Post Publication Content Analysis	en
dc.type	Thesis	en
thesis.degree.department	Computer Science and Engineering	en
thesis.degree.discipline	Computer Science	en
thesis.degree.grantor	Texas A & M University	en
thesis.degree.name	Doctor of Philosophy	en
thesis.degree.level	Doctoral	en
dc.contributor.committeeMember	Ioerger, Thomas
dc.contributor.committeeMember	Huang, Ruihong
dc.contributor.committeeMember	Duffield, Nick
dc.type.material	text	en
dc.date.updated	2019-01-23T19:58:09Z
local.embargo.terms	2020-12-01
local.etdauthor.orcid	0000-0002-0950-0816

Files in this item

Name:: WANG-DISSERTATION-2018.pdf
Size:: 7.187Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Electronic Theses, Dissertations, and Records of Study (2002– )
Texas A&M University Theses, Dissertations, and Records of Study (2002– )

Show simple item record