dc.description.abstract | In this research, we developed a new method of outlier detection and removal from point-based data sets utilizing deep learning. To do this, we focused on creating an outlier detection method that would tie the outlier detection procedure and a model-building process together. Using the different behaviors of outliers and inliers, we used model complexity as an indicator for outliers in data sets. In this context, “complexity” of a model means the weight of non-zero edges in the model. This include features of a model such as number of layers and number of nodes per layer.
Our proposed method of using model complexity to detect outliers consists of several steps. First, a model of low complexity (low number of layers or low number of nodes per layer) should be made and trained on a data set, and its predicted values for each instance of the data set must be recorded. Second, we need to build multiple neural network models of differing number of layers or number of nodes per layer and find a group of models of specific number of layers with the best average performance values on a given data set. Performance in this context includes general classification accuracy or mean squared error values of models. Third, within the group, we pick the model with the highest number of nodes per layer and use its predictions for each instance of the data set and compare them with the predicted values of the low-complexity model from the first step. The instances with different prediction values by both models should then be labeled as outliers and thus removed.
Two factors must be noted about this method. First, the lower the correlation that attributes have to the output values in a data set, the fewer outliers the method will detect. Second, the larger and more complex a data set becomes (such as having many attributes), the fewer outliers the method will find. These factors must be noted when using this method. | en |