High Dimensional Statistical Models: Applications to Climate

Chatterjee, Soumyadeep2015-11-092015-11-092015-09https://hdl.handle.net/11299/175549University of Minnesota Ph.D. dissertation. September 2015. Major: Computer Science. Advisor: Arindam Banerjee. 1 computer file (PDF); ix, 103 pages.Recent years have seen enormous growth in collection and curation of datasets in various domains which often involve thousands or even millions of variables. Examples include social networking websites, geophysical sensor networks, cancer genomics, climate science, and many more. In many applications, it is of prime interest to understand the dependencies between variables, such that predictive models may be designed from knowledge of such dependencies. However, traditional statistical methods, such as least squares regression, are often inapplicable for such tasks, since the available sample size is much smaller than problem dimensionality. Therefore we require new models and methods for statistical data analysis which provide provable estimation guarantees even in such high dimensional scenarios. Further, we also require that such models provide efficient implementation and optimization routines. Statistical models which satisfy both these criteria will be important for solving prediction problems in many scientific domains. High dimensional statistical models have attracted interest from both the theoretical and applied machine learning communities in recent years. Of particular interest are parametric models, which considers estimation of coefficient vectors in the scenario where sample size is much smaller than the dimensionality of the problem. Although most existing work focuses on analyzing sparse regression methods using L1 norm regularizers, there exist other ``structured'' norm regularizers that encode more interesting structure in the sparsity induced on the estimated regression coefficients. In the first part of this thesis, we conduct a theoretical study of such structured regression methods. First, we prove statistical consistency of regression with hierarchical tree-structured norm regularizer known as hiLasso. Second, we formulate a generalization of the popular Dantzig Selector for sparse linear regression to any norm regularizer, called Generalized Dantzig Selector, and provide statistical consistency guarantees of estimation. Further, we provide the first known results on non-asymptotic rates of consistency for the recently proposed $k$-support norm regularizer. Finally, we show that in the presence of measurement errors in covariates, the tools we use for proving consistency in the noiseless setting are inadequate in proving statistical consistency. In the second part of the thesis, we consider application of regularized regression methods to statistical modeling problems in climate science. First, we consider application of Sparse Group Lasso, a special case of hiLasso, for predictive modeling of land climate variables from measurements of atmospheric variables over oceans. Extensive experiments illustrate that structured sparse regression provides both better performance and more interpretable models than unregularized regression and even unstructured sparse regression methods. Second, we consider application of regularized regression methods for discovering stable factors for predictive modeling in climate. Specifically, we consider the problem of determining dominant factors influencing winter precipitation over the Great Lakes Region of the US. Using a sparse linear regression method, followed by random permutation tests, we mine stable sets of predictive features from a pool of possible predictors. Some of the stable factors discovered through this process are shown to relate to known physical processes influencing precipitation over Great Lakes.enClimate ScienceMachine LearningRegressionRegularizationStatistical ModelingHigh Dimensional Statistical Models: Applications to ClimateThesis or Dissertation