Distributed and robust techniques for statistical learning.

The last decade has been marked by the advent of networked systems able to gather tremendous amounts of data. Testimony of this trend are data collection projects and digital services such as Google books, Internet marketing, and social networking sites. To fully exploit the potential benefits hidden in large collections of data, this thesis argues that more emphasis must be placed on data processing. Statistical learning approaches seeking to uncover the “right” information within the data while able to deal with their complexities are needed. Dealing with vast amounts of data, possibly distributed across multiple locations and often contaminated with outliers (inconsistent data) and missing entries, poses formidable processing challenges. This thesis takes a step forth towards overcoming the aforementioned challenges by proposing novel problem formulations and capitalizing on contemporary tools from optimization and compressive sampling. Power-limited networked systems deployed for data acquisition can elongate their service life by collaboratively processing data in-situ rather than transmitting all data back to a centralized processing unit. With this premise in mind, the viability of a fully distributed framework for clustering and classification is explored. Capitalizing on the idea of consensus, algorithms with performance guarantees equivalent to the ones achieved by a centralized algorithm having access to all network data are developed. Due to their wide applicability and popularity, focus is placed on developing alternatives for support vector machines, K-means and expectation-maximization algorithms. Managing the quality of data poses a major challenge. Outliers are hard to identify, especially in high dimensional data. The presence of outliers can be due to faulty sensors, malicious sources, model mismatch, or rarely seen events. In all cases, ill-handled outliers can deteriorate the performance of any information processing and management scheme. Robust clustering algorithms relying on a data model that explicitly captures outliers are developed. The outlier-aware data model translates the rare occurences of outliers in data to sparsity of pertinent outliers variables, thereby establishing a neat link between clustering and the area of compressive sampling. A similar outlier-aware model is used to derive robust versions of multidimensional scaling algorithms for high-dimensional data visualization. In this context, a robust multidimensional scaling algorithm able to cope with a common structured outlier contamination is also developed. Using data with missing entries is also challenging. Missing data can occur due to faulty sensors, privacy concerns, and limited measurement budgets. Specifically, prediction of a dynamical process evolving on a network based on observations at a few nodes is explored. Here, tools from semi-supervised learning and dictionary learning are leveraged to develop batch and online topology- and data-driven prediction algorithms able to cope with missing data.

Keywords

Dictionary learning

Distributed learning

Network prediction

Robust clustering

Robust multidimensional scaling

Sensor networks

Electrical Engineering

Description

University of Minnesota Ph.D. dissertation. May 2012. Major: Electrical Engineering. Advisor: Professor Georgios B. Giannakis. 1 computer file (PDF); xiii, 113 pages, appendix A.

Collections

Dissertations

Suggested citation

Forero, Pedro Andrés. (2012). Distributed and robust techniques for statistical learning.. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/128511.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University Digital Conservancy

Distributed and robust techniques for statistical learning.

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

University Digital Conservancy

University of Minnesota Twin Cities

Distributed and robust techniques for statistical learning.

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation