Show simple item record

dc.contributor.advisorCastell-Perez, Elena
dc.creatorStanosheck, Jacob Alan
dc.date.accessioned2023-09-18T17:11:35Z
dc.date.created2022-12
dc.date.issued2022-12-12
dc.date.submittedDecember 2022
dc.identifier.urihttps://hdl.handle.net/1969.1/198729
dc.description.abstractLarge-scale foodborne illness outbreaks may occur when there is cross-contamination among fresh produce during the washing stage of processing. This study evaluated the effectiveness of synthetic minority oversampling technique (SMOTE), adaptive synthetic oversampling technique (ADASYN), and safe-level synthetic minority oversampling technique (SLSMOTE) oversampling methods and linear-based feature selection, normalization and standardization of data, and outlier detection and replacement by mean value preprocessing methods on wash water data. These methods were used to improve machine learning model accuracy, sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve when predicting the presence of Escherichia coli MG1655 in lab collected spinach wash water. Data from two previous studies were used as a base dataset from which the oversampling methods (SMOTE, ADASYN, and SLSMOTE) generated more balanced datasets. Each of the oversampled datasets were then used to train three machine learning models (ML): random forest, support vector machines, and binomial logistic regression. All data was centered and standardized based on the training data set for each oversampling method and outliers in training data were replaced with feature mean values and then tested without doing these preprocessing steps. The models’ hyperparameters were tuned based on a subset of testing data using 10-fold cross-validation. Next, the ML models were trained using the whole datasets and tested for accuracy on newly collected water samples from spinach wash water. Several methods were used to wash spinach leaves, the wash water collected, and the presence of Escherichia coli MG1655 established. These experimental data were used to evaluate the effectiveness of the oversampling methods as accuracy, specificity, sensitivity, and AUC were compared. The SMOTE and ADASYN oversampling methods showed the best performance of the oversampling methods, with an accuracy of 90.0%, sensitivity of 93.8%, specificity of 87.5% and AUC of 98.2% for the SMOTE random forest model that did not use preprocessed data. The ADASYN model showed an accuracy of 86.55, sensitivity of 87.5%, specificity of 83.3%, and AUC of 92.4%. The SVM and random forest models showed a significant improvement (p < 0.05) using SMOTE and ADASYN compared to the non-oversampled models of the same when there was no preprocessing data used. Preprocessing of the data showed a significant (p < 0.05) improvement in accuracy and specificity for the binomial logistic regression model but had a significant decrease (p < 0.05) in accuracy and specificity for the SVM and random forest models. The most important physiochemical feature for predicting the presence of Escherichia coli MG1655 in the lab collected spinach wash water was the water conductivity. The logistic regression model demonstrated the highest accuracy and AUC compared to the SVM and binary logistic regression models.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectMachine learning
dc.subjectOversampling
dc.subjectE. coil
dc.subjectSVM
dc.subjectRandom forest
dc.subjectlogistic regression
dc.titlePrediction of Escherichia coli (E. coli MG1655) Contamination in Fresh Produce Wash Water Using Machine Learning Models and Oversampling Methods for Model Training Data
dc.typeThesis
thesis.degree.departmentBiological and Agricultural Engineering
thesis.degree.disciplineBiological and Agricultural Engineering
thesis.degree.grantorTexas A&M University
thesis.degree.nameMaster of Science
thesis.degree.levelMasters
dc.contributor.committeeMemberMoreira, Rosana
dc.contributor.committeeMemberCastillo, Alejandro
dc.contributor.committeeMemberKing, Maria
dc.type.materialtext
dc.date.updated2023-09-18T17:11:35Z
local.embargo.terms2024-12-01
local.embargo.lift2024-12-01
local.etdauthor.orcid0000-0003-4569-4520


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record