Modeling of Reasoning Flows in Scientific Publications

Lin, Jason

View/ Open

LIN-DISSERTATION-2020.pdf (6.117Mb)

Date

2020-04-16

Author

Lin, Jason

Metadata

Show full item record

Abstract

Mathematical language plays an essential role in conceptualizing the technical contents of scientific publications. It applies words, symbols, and rules to constitute any sophisticated technical discussion. Existing technologies have achieved the recognition of mathematical objects (MOs) from digital documents, as well as the use of MOs and keywords to locate relevant resources. However, very few successful applications are on computer-based content analysis due to the obscured boundaries and semantics of technical contents. In this dissertation, we introduce the concept of reasoning block (RB) to mimic the divide-and-conquer of human writing and reading process. The RB model develops MO-based foundational solutions to address the challenges of reversing the original linear descriptions back to their logical non-linear structure. A system model requires both the annotations of constraint expressions and textual declarations to enhance the mapping of problem settings and physical semantics. These two components highlight the information the readers need to know for the proposed system model of a paper. Reliable indicators such as mathematical symbols, stop words, and punctuations are used as features to distinguish constraint expressions from any other MO. We have investigated both a greedy approach based on the local optimal and a probabilistic approach based on Bayes’ theorem in this study. As for mining the textual declarations of MOs, it requires to overcome the challenges of tagging, chunking, and pairing on the sentences mixed with words and MOs (MWM). We propose a second-order hidden Markov model and a frequent pattern mining toolkit for tagging and chunking the MWM sentence, respectively. The final pairing of MOs and their declarations depend on the three-layer information (spatial, semantic, and syntactic) of the intermediate tokens that connect them. Finally, the above analytical products are integrated and transform each publication into a hierarchical structure known as the MO reasoning (MOR) graph that consists of RBs in logical flows. Redundant MOs and their dependencies are removed based upon the minimum information required to cover all relations of MOs and words. The MOR graph is used as the technical essence to discover new forms of document fingerprint based on different writing styles in various domains.

Citation

Lin, Jason (2020). Modeling of Reasoning Flows in Scientific Publications. Doctoral dissertation, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /191745.