Exploiting Semantics from Widely Available Ontologies to Aid the Model Building Process

Janpuangtong, Sasin

View/ Open

JANPUANGTONG-DISSERTATION-2019.pdf (1.811Mb)

Date

2019-11-25

Author

Janpuangtong, Sasin

Metadata

Show full item record

Abstract

This dissertation attempts to address the changing needs of data science and analytics: making it easier to produce accurate models opening up opportunities and perspectives for novices to make sense of existing data. This work aims to incorporate semantics of data in addressing classical machine learning problems, which is one way to tame the deluge of data. The increased availability of data and the existence of easy-to-use procedures for regression and classification in commodity software allows anyone to search for correlations amongst a large set of variables with scant regard of their meaning. Consequently, people tend to use data indiscriminately, leading to the practice of data dredging. It is easy to use sophisticated tools to produce specious models, which generalize poorly and may lead to wrong conclusions. Despite much effort having been placed on advancing learning algorithms, current tools do little to shield people from using data in a semantically lax fashion. By examining the entire model building process and supplying semantic information derived from high-level knowledge in the form of an ontology, the machine can assist in exercising discretion to help the model builder avoid the pitfalls of data dredging. This work introduces a metric, called conceptual distance, to incorporate semantic information into the model building process. The conceptual distance is shown to be practically computed from large-scale existing ontologies. This metric is exploited in feature selection to enable a machine to take semantics of features into consideration when choosing them to build a model. Experiments with ontologies and real world datasets show the comparable performance of this metric in selecting a feature subset to the traditional data-driven measurements, in spite of using only labels of features, not the associated measures. Further, a new end-to-end model building process is developed by using the conceptual distance as a guideline to explore an ontological structure and retrieve relevant features automatically, making it convenient for a novice to build a semantically pertinent model. Experiments show that the proposed model building process can help a user to produce a model with performance comparable to that built by a domain expert. This work offers a tool to help the common man battle the hazard of data dredging that comes from the indiscriminate use of data. The tool results in models with improved generalization and easy to interpret, leading to better decisions or implications.

Citation

Janpuangtong, Sasin (2019). Exploiting Semantics from Widely Available Ontologies to Aid the Model Building Process. Doctoral dissertation, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /188752.