Robust Distributed Learning of Functional Data From Simulators through Data Sketching
Date
2024-05-01Metadata
Show full item recordAbstract
Realistic simulations are crucial for comprehending complex systems in climate and environmental studies. Yet, running sophisticated computational models across a wide range of input settings often overwhelms large computer systems. Statistical surrogate models, or emulators, play a vital role in efficiently exploring the simulator input space. Functional data models involving Gaussian processes (GPs) and their computationally efficient variants have become standard tools for achieving this goal. The conventional centralized processing of such models requires substantial computational and storage resources at the central server. To counter this, emerging distributed Bayesian learning frameworks partition raw data into shards and distribute computations of these shards across machines. While this strategy mitigates data storage costs and improves computation within each machine, concerns arise regarding the sensitivity of distributed inference to shard selection.
Motivated by the concept of data sketching in the literature, this article proposes an innovative alternative. Instead of creating data shards, our approach employs multiple random matrices to construct multiple random linear projections, or `"random sketches," of the complete dataset. Posterior inference on functional data models is performed using random sketches on various machines in parallel. These individual inferences are then combined across machines at a central server. By aggregating inference across diverse random matrices, our approach proves resilient to the selection of data sketches, leading to the development of novel robust distributed Bayesian learning approach.
An important advantage of our approach is its ability to maintain the privacy of sampling units, as the inference is based on random data sketches that do not allow the recovery of raw data.
We illustrate the significance of our approach through various simulated data examples in the realm of Bayesian distributed learning techniques. Finally, we demonstrate the performance of our proposed approach as an emulator with surrogates of the Sea, Lake, and Overland Surges from Hurricanes (SLOSH) simulator—a choice of simulator for government agencies.
Department
StatisticsCollections
Citation
Andros, Jacob; Guhaniyogi, Rajarshi; Francom, Devin; Pasqualini, Donatella (2024). Robust Distributed Learning of Functional Data From Simulators through Data Sketching. Available electronically from https : / /hdl .handle .net /1969 .1 /200959.
The following license files are associated with this item: