Robust Model-free Variable Screening, Double-parallel Monte Carlo and Average Bayesian Information Criterion
Abstract
Big data analysis and high dimensional data analysis are two popular and challenging topics in current statistical research. They bring us a lot of opportunities as well as many challenges. For big data, traditional methods are generally not efficient enough to handle them, from both time perspective and space perspective. For high dimensional data, most traditional methods can’t be implemented, let alone maintain their desirable properties, such as consistency.
In this disseration, three new strategies are proposed to solve these issues. HZSIS is a robust model-free variable screening method and possesses sure screening property under the ultrahigh-dimensional setting. It works based on the nonparanormal transformation and Henze-Zirkler’s test. The numerical results indicate that, compared to the existing methods, the proposed method is more robust to the data generated from heavy-tailed distributions and/or complex models with interaction variables.
Double Parallel Monte Carlo is a simple, practical and efficient MCMC algorithm for Bayesian analysis of big data. The proposed algorithm suggests to divide the big dataset into some smaller subsets and provides a simple method to aggregate the subset posteriors to approximate the full data posterior. To further speed up computation, the proposed algorithm employs the population stochastic approximation Monte Carlo (Pop-SAMC) algorithm, a parallel MCMC algorithm, to simulate from each subset posterior. Since the proposed algorithm consists of two levels of parallel, data parallel and simulation parallel, it is coined as “Double Parallel Monte Carlo”. The validity of the proposed algorithm is justified both mathematically and numerically.
Average Bayesian Information Criterion (ABIC) and its high-dimensional variant Average Extended Bayesian Information Criterion (AEBIC) led to an innovative way to use posterior samples to conduct model selection. The consistency of this method is established for the high-dimensional generalized linear model under some sparsity and regularity conditions. The numerical results also indicate that, when the sample size is large enough, this method can accurately select the smallest true model with high probability.
Subject
Variable selectionvariable screening
ultrahigh dimensional data analysis
big data
parallel computing
MCMC
Citation
Xue, Jingnan (2017). Robust Model-free Variable Screening, Double-parallel Monte Carlo and Average Bayesian Information Criterion. Doctoral dissertation, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /187253.