Show simple item record

dc.contributor.advisorMoreno-Centeno, Erick
dc.creatorSung, Kisuk
dc.date.accessioned2016-09-16T14:21:49Z
dc.date.available2018-08-01T05:57:27Z
dc.date.created2016-08
dc.date.issued2016-07-14
dc.date.submittedAugust 2016
dc.identifier.urihttps://hdl.handle.net/1969.1/157815
dc.description.abstractData mining is an analytic process for discovering systematic relationships between variables and for finding patterns in data. Using those findings, data mining can create predictive models (e.g., target variable forecasting, label classification) or identify different groups within data (e.g., clustering). The principal objective of this dissertation is to develop data mining algorithms that outperform conventional data mining techniques on social and healthcare sciences. Toward this objective, this dissertation develops two data mining techniques, each of which addresses the limitations of a conventional data mining technique when applied in these contexts. The first part (Part I) of this dissertation addresses the problem of identifying important factors that promote or hinder population growth. When addressing this problem, previous studies included variables (input factors) without considering the statistical dependence among the included input factors; therefore, most previous studies exhibit multicollinearity between the input variables. We propose a novel methodology that, even in the presence of multicollinearity among input factors, is able to (1) identify significant factors affecting population growth and (2) rank these factors according to their level of influence on population growth. In order to measure the level of influence of each input factor on population growth, the proposed method combines decision tree clustering and Cohen’s d index. We applied the proposed method to a real county-level United States dataset and determined the level of influence of an extensive list of input factors on population growth. Among other findings, we show that poverty ratio is a highly important factor for population growth while no previous study found poverty ratio to be a significant factor due to its high linear relationship with other input factors. The second part (Part II) of this dissertation proposes a classification method for imbalanced data—data where the majority class has significantly more instances than the minority class. The specific problem addressed is that conventional classification methods have poor minority-class detection performance in imbalanced dataset since they tend to classify the vast majority of the test instances as majority instances. To address this problem, we developed a guided undersampling method that combines two instance-selecting techniques — ensemble outlier filtering and normalized-cut sampling — in order to obtain a clean and well-represented subset of the original training instances. Our proposed imbalanced-data classification method uses the guided undersampling method to select the training data and then applies support vector machines on the sampled data in order to construct the classification model (i.e., decide the final class boundary). Our computational results show that the proposed imbalanced-data classification method outperforms several state-of-the-art imbalanced-data classification methods, including cost-sensitive, sampling, and synthetic data generation approaches on eleven open datasets, most of them related to healthcare sciences.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectPopulation Growthen
dc.subjectMulticollinearityen
dc.subjectDecision Treeen
dc.subjectCohen’s den
dc.subjectGuided Undersamplingen
dc.subjectImbalanced Data Classificationen
dc.subjectEnsemble Outlier Filteringen
dc.subjectNormalized-cut Samplingen
dc.titleNew Data Mining Techniques for Social and Healthcare Sciencesen
dc.typeThesisen
thesis.degree.departmentIndustrial and Systems Engineeringen
thesis.degree.disciplineIndustrial Engineeringen
thesis.degree.grantorTexas A & M Universityen
thesis.degree.nameDoctor of Philosophyen
thesis.degree.levelDoctoralen
dc.contributor.committeeMemberDing, Yu
dc.contributor.committeeMemberMatarrita-Cascante, David
dc.contributor.committeeMemberYates, Justin
dc.type.materialtexten
dc.date.updated2016-09-16T14:21:49Z
local.embargo.terms2018-08-01
local.etdauthor.orcid0000-0001-8493-983X


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record