Distributed and robust techniques for statistical learning.

Forero, Pedro Andrés2012-07-202012-07-202012-05https://hdl.handle.net/11299/128511University of Minnesota Ph.D. dissertation. May 2012. Major: Electrical Engineering. Advisor: Professor Georgios B. Giannakis. 1 computer file (PDF); xiii, 113 pages, appendix A.The last decade has been marked by the advent of networked systems able to gather tremendous amounts of data. Testimony of this trend are data collection projects and digital services such as Google books, Internet marketing, and social networking sites. To fully exploit the potential benefits hidden in large collections of data, this thesis argues that more emphasis must be placed on data processing. Statistical learning approaches seeking to uncover the “right” information within the data while able to deal with their complexities are needed. Dealing with vast amounts of data, possibly distributed across multiple locations and often contaminated with outliers (inconsistent data) and missing entries, poses formidable processing challenges. This thesis takes a step forth towards overcoming the aforementioned challenges by proposing novel problem formulations and capitalizing on contemporary tools from optimization and compressive sampling. Power-limited networked systems deployed for data acquisition can elongate their service life by collaboratively processing data in-situ rather than transmitting all data back to a centralized processing unit. With this premise in mind, the viability of a fully distributed framework for clustering and classification is explored. Capitalizing on the idea of consensus, algorithms with performance guarantees equivalent to the ones achieved by a centralized algorithm having access to all network data are developed. Due to their wide applicability and popularity, focus is placed on developing alternatives for support vector machines, K-means and expectation-maximization algorithms. Managing the quality of data poses a major challenge. Outliers are hard to identify, especially in high dimensional data. The presence of outliers can be due to faulty sensors, malicious sources, model mismatch, or rarely seen events. In all cases, ill-handled outliers can deteriorate the performance of any information processing and management scheme. Robust clustering algorithms relying on a data model that explicitly captures outliers are developed. The outlier-aware data model translates the rare occurences of outliers in data to sparsity of pertinent outliers variables, thereby establishing a neat link between clustering and the area of compressive sampling. A similar outlier-aware model is used to derive robust versions of multidimensional scaling algorithms for high-dimensional data visualization. In this context, a robust multidimensional scaling algorithm able to cope with a common structured outlier contamination is also developed. Using data with missing entries is also challenging. Missing data can occur due to faulty sensors, privacy concerns, and limited measurement budgets. Specifically, prediction of a dynamical process evolving on a network based on observations at a few nodes is explored. Here, tools from semi-supervised learning and dictionary learning are leveraged to develop batch and online topology- and data-driven prediction algorithms able to cope with missing data.en-USDictionary learningDistributed learningNetwork predictionRobust clusteringRobust multidimensional scalingSensor networksElectrical EngineeringDistributed and robust techniques for statistical learning.Thesis or Dissertation