Browsing by Subject "High dimensional data"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
Item Methodologies and Algorithms on Some Non-convex Penalized Models for Ultra High Dimensional Data(2016-06) Peng, BoIn recent years, penalized models have gained considerable importance on deal- ing with variable selection and estimation problems under high dimensional settings. Of all the candidates, the l1 penalized, or the LASSO model retains popular application in diverse fields with sophisticated methodology and mature algorithms. However, as a promising alternative of the LASSO, non-convex penalized methods, such as the smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP) methods, produce asymptotically unbiased shrinkage estimates and owns attractive ad- vantages over the LASSO. In this thesis, we propose intact methodology and theory for multiple non-convex penalized models. The proposed theoretical framework includes estimator’s error bounds, oracle property and variable selection behaviors. Instead of common least square models, we focus on quantile regression and support vector ma- chines (SVMs) for exploration of heterogeneity and binary classification. Though we demonstrate current local linear approximation (LLA) optimization algorithm possesses those nice theoretical properties to achieve the oracle estimator in two iterations, the computation issue is highly challenging when p is large due to the non-smoothness of the loss function and the non-convexity of the penalty function. Hence, we also explore the potential of coordinate descent algorithms for fitting selected models, establishing convergence properties and presenting significant speed increase on current approaches. Simulated and real data analysis are carried out to examine the performance of non- convex penalized models and illustrate the outperformance of our algorithm in computational speed.Item Network selection, information filtering and scalable computation(2014-03) Ye, ChangqingThis dissertation explores two application scenarios of sparsity pursuit method on large scale data sets. The first scenario is classification and regression in analyzing high dimensional structured data, where predictors corresponds to nodes of a given directed graph. This arises in, for instance, identification of disease genes for the Parkinson's diseases from a network of candidate genes. In such a situation, directed graph describes dependencies among the genes, where direction of edges represent certain causal effects. Key to high-dimensional structured classification and regression is how to utilize dependencies among predictors as specified by directions of the graph. In this dissertation, we develop a novel method that fully takes into account such dependencies formulated through certain nonlinear constraints. We apply the proposed method to two applications, feature selection in large margin binary classification and in linear regression. We implement the proposed method through difference convex programming for the cost function and constraints. Finally, theoretical and numerical analyses suggest that the proposed method achieves the desired objectives. An application to disease gene identification is presented.The second application scenario is personalized information filtering which extracts the information specifically relevant to a user, predicting his/her preference over a large number of items, based on the opinions of users who think alike or its content. This problem is cast into the framework of regression and classification, where we introduce novel partial latent models to integrate additional user-specific and content-specific predictors, for higher predictive accuracy. In particular, we factorize a user-over-item preference matrix into a product of two matrices, each representing a user's preference and an item preference by users. Then we propose a likelihood method to seek a sparsest latent factorization, from a class of over-complete factorizations, possibly with a high percentage of missing values. This promotes additional sparsity beyond rank reduction. Computationally, we design methods based on a ``decomposition and combination'' strategy, to break large-scale optimization into many small subproblems to solve in a recursive and parallel manner. On this basis, we implement the proposed methods through multi-platform shared-memory parallel programming, and through Mahout, a library for scalable machine learning and data mining, for mapReduce computation. For example, our methods are scalable to a dataset consisting of three billions of observations on a single machine with sufficient memory, having good timings. Both theoretical and numerical investigations show that the proposed methods exhibit significant improvement in accuracy over state-of-the-art scalable methods.Item Statistical Methods for Large Complex Datasets(2016-05) Datta, AbhirupModern technological advancements have enabled massive-scale collection, processing and storage of information triggering the onset of the `big data' era where in every two days now we create as much data as we did in the entire twentieth century. This thesis aims at developing novel statistical methods that can efficiently analyze a variety of large complex datasets. Underlying the umbrella theme of big data modeling, we present statistical methods for two different classes of large complex datasets. The first half of the thesis focuses on the 'large n' problem for large spatial or spatio-temporal datasets where observations exhibit strong dependencies across space and time. In the second half of this thesis we present methods for high-dimensional regression in the `large p small n' setting for datasets that contain measurement errors or change points.Item Statistical Methods for Variable Selection in Causal Inference(2018-07) Koch, Brandon LeeEstimating the causal effect of a binary intervention or action (referred to as a "treatment") on a continuous outcome is often an investigator's primary goal. Randomized trials are ideal for estimating causal effects because randomization eliminates selection bias in treatment assignment. However, randomized trials are not always ethically or practically possible, and observational data must be used to estimate the causal effect of treatment. Unbiased estimation of causal effects with observational data requires adjustment for confounding variables that are related to both the outcome and treatment assignment. Adjusting for all measured covariates in a study protects against bias, but including covariates unrelated to outcome may increase the variability of the estimated causal effect. Standard variable selection techniques aim to maximize predictive ability of a model for the outcome and are used to decrease variability of the estimated causal effect, but they ignore covariate associations with treatment and may not adjust for important confounders weakly associated to outcome. We propose two approaches for estimating causal effects that simultaneously consider models for both outcome and treatment assignment. The first approach is a variable selection technique for identifying confounders and predictors of outcome using an adaptive group lasso approach that simultaneously performs coefficient selection, regularization, and estimation across the treatment and outcome models. In the second approach, two methods are proposed that simultaneously model outcome and treatment assignment using a Bayesian formulation with spike and slab priors on each covariate coefficient; the Spike and Slab Causal Estimator (SSCE) aims to achieve minimum bias of the causal effect estimator while Bilevel SSCE (BSSCE) aims to minimize its mean squared error. We also propose TEHTrees, a new method that combines matching and conditional inference trees to characterize treatment effect heterogeneity. One of its main virtues is that, by employing formal hypothesis testing procedures in constructing the tree, TEHTrees preserves the Type I error rate.