Distributed and robust techniques for statistical learning.
2012-05
Loading...
View/Download File
Persistent link to this item
Statistics
View StatisticsJournal Title
Journal ISSN
Volume Title
Title
Distributed and robust techniques for statistical learning.
Authors
Published Date
2012-05
Publisher
Type
Thesis or Dissertation
Abstract
The last decade has been marked by the advent of networked systems able to gather tremendous
amounts of data. Testimony of this trend are data collection projects and digital services such
as Google books, Internet marketing, and social networking sites. To fully exploit the potential
benefits hidden in large collections of data, this thesis argues that more emphasis must be placed on
data processing. Statistical learning approaches seeking to uncover the “right” information within
the data while able to deal with their complexities are needed. Dealing with vast amounts of data,
possibly distributed across multiple locations and often contaminated with outliers (inconsistent
data) and missing entries, poses formidable processing challenges. This thesis takes a step forth
towards overcoming the aforementioned challenges by proposing novel problem formulations and
capitalizing on contemporary tools from optimization and compressive sampling.
Power-limited networked systems deployed for data acquisition can elongate their service life
by collaboratively processing data in-situ rather than transmitting all data back to a centralized processing
unit. With this premise in mind, the viability of a fully distributed framework for clustering
and classification is explored. Capitalizing on the idea of consensus, algorithms with performance
guarantees equivalent to the ones achieved by a centralized algorithm having access to all network
data are developed. Due to their wide applicability and popularity, focus is placed on developing
alternatives for support vector machines, K-means and expectation-maximization algorithms.
Managing the quality of data poses a major challenge. Outliers are hard to identify, especially
in high dimensional data. The presence of outliers can be due to faulty sensors, malicious sources,
model mismatch, or rarely seen events. In all cases, ill-handled outliers can deteriorate the performance
of any information processing and management scheme. Robust clustering algorithms relying on a data model that explicitly captures outliers are developed. The outlier-aware data
model translates the rare occurences of outliers in data to sparsity of pertinent outliers variables,
thereby establishing a neat link between clustering and the area of compressive sampling. A similar
outlier-aware model is used to derive robust versions of multidimensional scaling algorithms for
high-dimensional data visualization. In this context, a robust multidimensional scaling algorithm
able to cope with a common structured outlier contamination is also developed. Using data with
missing entries is also challenging. Missing data can occur due to faulty sensors, privacy concerns,
and limited measurement budgets. Specifically, prediction of a dynamical process evolving on a
network based on observations at a few nodes is explored. Here, tools from semi-supervised learning
and dictionary learning are leveraged to develop batch and online topology- and data-driven
prediction algorithms able to cope with missing data.
Description
University of Minnesota Ph.D. dissertation. May 2012. Major: Electrical Engineering. Advisor: Professor Georgios B. Giannakis. 1 computer file (PDF); xiii, 113 pages, appendix A.
Related to
Replaces
License
Collections
Series/Report Number
Funding information
Isbn identifier
Doi identifier
Previously Published Citation
Other identifiers
Suggested citation
Forero, Pedro Andrés. (2012). Distributed and robust techniques for statistical learning.. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/128511.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.