Browsing by Subject "Regression"

Now showing 1 - 7 of 7

Algorithms for Semisupervised learning on graphs
(2018-12) Flores, Mauricio
Laplacian regularization has been extensively used in a wide variety of semi-supervised learning tasks over the past fifteen years. In recent years, limitations of the Laplacian regularization have been exposed, leading to the development of a general class of Lp-based Laplacian regularization models. We propose novel algorithms to solve the resulting optimization problem, as the amount of unlabeled data increases to infinity, while the amount of labeled data remains fixed and is very small. We explore a practical application to recommender systems.
Dimension reduction and prediction in large p regressions
(2009-05) Adragni, Kofi Placid
A high dimensional regression setting is considered with p predictors X=(X1,...,Xp)T and a response Y. The interest is with large p, possibly much larger than n the number of observations. Three novel methodologies based on Principal Fitted Components models (PFC; Cook, 2007) are presented: (1) Screening by PFC (SPFC) for variable screening when p is excessively large, (2) Prediction by PFC (PPFC), and (3) Sparse PFC (SpPFC) for variable selection. SPFC uses a test statistic to detect all predictors marginally related to the outcome. We show that SPFC subsumes the Sure Independence Screening of Fan and Lv (2008). PPFC is a novel methodology for prediction in regression where p can be large or larger than n. PPFC assumes that X|Y has a normal distribution and applies to continuous response variables regardless of their distribution. It yields accuracy in prediction better than current leading methods. We adapt the Sparse Principal Components Analysis (Zou et al., 2006) to the PFC model to develop SpPFC. SpPFC performs variable selection as good as forward linear model methods like the lasso (Tibshirani, 1996), but moreover, it encompasses cases where the distribution of Y|X is non-normal or the predictors and the response are not linearly related.
High Dimensional Statistical Models: Applications to Climate
(2015-09) Chatterjee, Soumyadeep
Recent years have seen enormous growth in collection and curation of datasets in various domains which often involve thousands or even millions of variables. Examples include social networking websites, geophysical sensor networks, cancer genomics, climate science, and many more. In many applications, it is of prime interest to understand the dependencies between variables, such that predictive models may be designed from knowledge of such dependencies. However, traditional statistical methods, such as least squares regression, are often inapplicable for such tasks, since the available sample size is much smaller than problem dimensionality. Therefore we require new models and methods for statistical data analysis which provide provable estimation guarantees even in such high dimensional scenarios. Further, we also require that such models provide efficient implementation and optimization routines. Statistical models which satisfy both these criteria will be important for solving prediction problems in many scientific domains. High dimensional statistical models have attracted interest from both the theoretical and applied machine learning communities in recent years. Of particular interest are parametric models, which considers estimation of coefficient vectors in the scenario where sample size is much smaller than the dimensionality of the problem. Although most existing work focuses on analyzing sparse regression methods using L1 norm regularizers, there exist other ``structured'' norm regularizers that encode more interesting structure in the sparsity induced on the estimated regression coefficients. In the first part of this thesis, we conduct a theoretical study of such structured regression methods. First, we prove statistical consistency of regression with hierarchical tree-structured norm regularizer known as hiLasso. Second, we formulate a generalization of the popular Dantzig Selector for sparse linear regression to any norm regularizer, called Generalized Dantzig Selector, and provide statistical consistency guarantees of estimation. Further, we provide the first known results on non-asymptotic rates of consistency for the recently proposed $k$-support norm regularizer. Finally, we show that in the presence of measurement errors in covariates, the tools we use for proving consistency in the noiseless setting are inadequate in proving statistical consistency. In the second part of the thesis, we consider application of regularized regression methods to statistical modeling problems in climate science. First, we consider application of Sparse Group Lasso, a special case of hiLasso, for predictive modeling of land climate variables from measurements of atmospheric variables over oceans. Extensive experiments illustrate that structured sparse regression provides both better performance and more interpretable models than unregularized regression and even unstructured sparse regression methods. Second, we consider application of regularized regression methods for discovering stable factors for predictive modeling in climate. Specifically, we consider the problem of determining dominant factors influencing winter precipitation over the Great Lakes Region of the US. Using a sparse linear regression method, followed by random permutation tests, we mine stable sets of predictive features from a pool of possible predictors. Some of the stable factors discovered through this process are shown to relate to known physical processes influencing precipitation over Great Lakes.
Multiple Regression in Industrial Organizational Psychology: Relative Importance and Model Sensitivity
(2018-01) Semmel, Sarah
When evaluating research findings, it is important to examine what statistical methods were used to reach and support the stated conclusions. Regression is a common analysis in the Industrial/Organizational psychology literature and researchers have debated how to interpret the standardized optimal weights produced in ordinary least squares (OLS) regression. Multiple methods for determining the relative importance of predictors in a regression model have been proposed, along with a variety definitions of what is meant by predictor importance. Conversely, it has been shown that by slightly decreasing the model R2 that is obtained through OLS multiple regression an infinite number of alternative weight vectors can be produced, calling into question the meaning of OLS weights when the alternative weights diverge from the OLS weights. Articles published from 2003-2014 in the Journal of Applied Psychology, Academy of Management Journal, and Psychological Science that used OLS regression were reviewed. It was found that regression is used to answer questions on a wide variety of topics and interpreted in a multitude of ways in the I/O psychology and general psychology literature. The study found that different relative importance analyses can result in different conclusions about what predictors are most important. Examining alternative weight vectors further brings into question conclusions drawn based on optimal weights. For the majority of studies examined alternative weight vectors were found that provided a different rank ordering of predictors with only a small loss in model fit. The findings in this paper highlight and reinforce the need for Industrial/Organizational psychologists to turn a critical eye on the interpretation of regression analyses, especially regression weights, in reaching substantive conclusions.
Reverse engineering biological networks: computational approaches for modeling biological systems from perturbation data
(2013-09) Kim, Yungil
A fundamental goal of systems biology is to construct molecule level models that explain and predict cellular or organism level properties. A popular approach to this problem, enabled by recent developments in genomic technologies, is to make precise perturbations of an organism's genome, take measurements of some phenotype of interest, and use these data to "reverse engineer" a model of the underlying network. Even with increasingly massive datasets produced by such approaches, this task is challenging because of the complexity of biological systems, our limited knowledge of them, and the fact that the collected data are often noisy and biased. In this thesis, we developed computational approaches for making inferences about biological systems from perturbation data in two different settings: (1) in yeast where a genome-wide approach was taken to make second-order perturbations across millions of mutants, covering most of the genome, but with measurement of only a gross cellular phenotype (cell fitness), and (2) in a model plant system where a focused approach was used to generate up to fourth-order perturbations over a small number of genes and more detailed phenotypic and dynamic state measurements were collected. These two settings demand different computational strategies, but we demonstrate that in both cases, we were able to gain specific, mechanistic insights about the biological systems through modeling. More specifically, in the yeast setting, we developed statistical approaches for integrating data from double perturbation experiments with data capturing physical interactions between proteins. This method revealed the highly organized, modular structure of the yeast genome, and uncovered surprising patterns of genetic suppression, which challenge the existing dogma in the genetic interaction community. In the model plant setting, we developed both a Bayesian network approach and a regularized regression strategy for integrating perturbations, dynamic gene expression levels, and measurements of plant immunity against bacterial pathogens after genetic perturbation. The models resulting from both methods successfully predicted dynamic gene expression and immune response to perturbations and captured similar biological mechanisms and network properties. The models also highlighted specific network motifs responsible for the emergent properties of robustness and tunability of the plant immune system, which are the basis for plants' ability to withstand attacks from diverse and fast-evolving pathogens. More broadly, our studies provide several guidelines regarding both experimental design and computational approaches necessary for inferring models of complex systems from combinatorial mutant analysis.
A study of the impact educational setting has on academic proficiency of American Indian students as measured by the Minnesota comprehensive assessment
(2013-01) Hillstrom, Rev PM Crowley
The Minnesota Department of Education has collected Minnesota Comprehensive Assessments (MCA) results on every American Indian student who has taken the tests. This information has been made available so communities and parents can assess how their districts, schools, and students are performing based upon MCA proficiency criteria. Prior to this study, there had been no known studies on the impact of educational setting (Urban-Minneapolis/St. Paul, Metro-seven-county Metro area, Out State-greater Minnesota, and Bureau of Indian Education [BIE] schools) on mathematic and/or reading proficiency as measured by the MCAs for American Indian students in the state of Minnesota. The research population for this study included all American Indian students in the state of Minnesota, grades 3-11, who participated in MCAs between 2007-2010. This study incorporated multiple variables, which used empirical data from four educational settings (Urban, Metro, Out State, and BIE) and two academic subjects (mathematics and reading). The analysis used three regression models (linear, non-linear, and logistic), which provided statistical information regarding the relationship between educational setting and proficiency as measured by the MCAs. The results of this research supported the theory that educational setting does have an impact on MCA proficiency for the American Indian student in the state of Minnesota between 2007-2010.
Sufficient dimension reduction and variable selection.
(2010-12) Chen, Xin
Sufficient dimension reduction (SDR) in regression was first introduced by Cook (2004). It reduces the dimension of the predictor space without loss of information and it is very helpful when the number of predictors is large. It alleviates the “curse of dimensionality” for many statistical methods. In this thesis, we study the properties of a dimension reduction method named “continuum regression”; we propose a unified method – coordinate-independent sparse estimation (CISE) – that can simultaneously achieve sparse sufficient dimension reduction and screen out irrelevant and redundant variables efficiently; we also introduce a new dimension reduction method called “principal envelope models”.

University Digital Conservancy

Browse by Subject

Browsing by Subject "Regression"