Show simple item record

dc.contributor.advisorLiu, Jyh-Charn
dc.creatorWang, Xing
dc.date.accessioned2019-01-23T19:58:08Z
dc.date.available2020-12-01T07:33:10Z
dc.date.created2018-12
dc.date.issued2018-11-29
dc.date.submittedDecember 2018
dc.identifier.urihttps://hdl.handle.net/1969.1/174449
dc.description.abstractMathematical expressions (ME) are critical abstractions for technical publications. While the sheer volume of technical publications grows in time, few ME centric applications have been developed due to the steep gap between the typesetting data in post-publication digital documents and the high-level technical semantics. With the acceleration of the technical publications every year, word-based information analysis technologies are inadequate to enable users in discovery, organizing, and interrelating technical work efficiently and effectively. This dissertation presents a modeling framework and the associated algorithms, called the mathematical-centered post-publication content analysis (MECA) system to address several critical issues to build a layered solution architecture for recovery of high-level technical information. Overall, MECA is consisted of four layers of modeling work, starting from the extraction of MEs from Portable Document Format (PDF) files. Specifically, a weakly-supervised sequential typesetting Bayesian model is developed by using a concise font-value based feature space for Bayesian inference of ME vs. words for the rendering units separated by space. A Markov Random Field (MRF) model is designed to merge and correct the MEs identified from the rendering units, which are otherwise prone to fragmentation of large MEs. At the next layer, MECA aims at the recovery of ME semantics. The first step is the ME layout analysis to disambiguate layout structures based on a Content-Constrained Spatial (CCS) global inference model to overcome local errors. It achieves high accuracy at low computing cost by a parametric lognormal model for the feature distribution of typographic systems. The ME layout is parsed into ME semantics with a three-phase processing workflow to overcome a variety of semantic ambiguities. In the first phase, the ME layout is linearized into a token sequence, upon which the abstract syntax tree (AST) is constructed in the second phase using probabilistic context-free grammar. Tree rewriting will transform the AST into ME objects in the third phase. Built upon the two layers of ME extraction and semantics modeling work, next we explore one of the bonding relationships between words and MEs: ME declarations, where the words and MEs are respectively the qualitative and quantitative (QuQn) descriptors of technical concepts. Conventional low-level PoS tagging and parsing tools have poor performance in the processing of this type of mixed word-ME (MWM) sentences. As such, we develop an MWM processing toolkit. A semi-automated weakly-supervised framework is employed for mining of declaration templates from a large amount of unlabeled data so that the templates can be used for the detection of ME declarations. On the basis of the three low-level content extraction and prediction solutions, the MECA system can extract MEs, interpret their mathematical semantics, and identify their bonding declaration words. By analyzing the dependency among these elements in a paper, we can construct a QuQn map, which essentially represents the reasoning flow of a paper. Three case studies are conducted for QuQn map applications: differential content comparison of papers, publication trend generation, and interactive mathematical learning. Outcomes from these studies suggest that MECA is a highly practical content analysis technology based on a theoretically sound framework. Much more can be expanded and improved upon for the next generation of deep content analysis solutions.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectmathematical expressionen
dc.subjectinformation extractionen
dc.subjectparsingen
dc.subjectdeclarationen
dc.subjectQuQn mapen
dc.titleMECA: Mathematical Expression Based Post Publication Content Analysisen
dc.typeThesisen
thesis.degree.departmentComputer Science and Engineeringen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorTexas A & M Universityen
thesis.degree.nameDoctor of Philosophyen
thesis.degree.levelDoctoralen
dc.contributor.committeeMemberIoerger, Thomas
dc.contributor.committeeMemberHuang, Ruihong
dc.contributor.committeeMemberDuffield, Nick
dc.type.materialtexten
dc.date.updated2019-01-23T19:58:09Z
local.embargo.terms2020-12-01
local.etdauthor.orcid0000-0002-0950-0816


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record