Multivariate Analysis Applied to the California Health Interview Survey
Objective: Identify if principle components analysis and multiple correspondence analysis are suitable dimension reduction techniques for the California Health Interview Survey. Identify which health risk behaviors, mental health and demographic factors cluster utilizing k-medians clustering. Background: Clustering and multivariate analysis techniques can be used to characterize populations and sub-populations of people by grouping them based on an individual’s similarity to others. These exploratory techniques, while uniformly accepted within the scientific community as valid, are not as popular as other statistical methods and have not been utilized in certain scenarios where they could potentially be useful. The UCLA Center for Health Policy Research’s annual California Health Interview Survey (CHIS) dataset is one such example where using these multivariate techniques could provide new insight. The survey contains information on thousands of randomly sampled Californians regarding health, income and demographics, among other factors. This research project attempts to determine if principle components analysis and multiple correspondence analysis are suitable dimension reduction techniques when applied to the CHIS dataset and to quantify and qualify in greater detail the differences and similarities between the health characteristics of California residents. Methods: This study used data from 21,055 individuals interviewed via telephone from the 2016 California Health Interview Survey, the largest state-wide health survey in the U.S. The statistical procedures principle components analysis and multiple correspondence analysis were conducted to assess their usefulness when applied to health survey data. Concurrently, Gower k-medians clustering was used to identify distinct groupings of California residents. I then performed a chi-squared test to determine which variables are the most statistically significant in forming these clusters. Results: Principle components analysis reduced the initial 118 variables considered to 30, with the largest component only explaining 10.44% of the total variation in the data, suggesting that this technique is ill-suited to the CHIS. Multiple correspondence analysis, however, reduced the 88 categorical variables to 5 with the largest component accounting for 62.27% of the variation in the data. By applying Gower k-medians, I produced 3 distinct clusters of survey respondents and determined that access to specialized medical care is the most strongly clustered characteristic.
Ross, Aaron (2019). Multivariate Analysis Applied to the California Health Interview Survey. Undergraduate Research Scholars Program. Available electronically from