Browsing by Subject "dimensionality reduction"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
Item Clustering in a High-Dimensional Space Using Hypergraph Models(1997) Han, Eui-Hong; Karypis, George; Kumar, Vipin; Mobasher, BamshadClustering of data in a large dimension space is of a great interest in many data mining applications. Most of the traditional algorithms such as K-means or AutoCJass fail to produce meaningful clusters in such data sets even when they are used with well known dimensionality reduction techniques such as Principal Component Analysis and Latent Semantic Indexing. In this paper, we propose a method for clustering of data in a high dimensional space based on a hypergraph model. The hypergraph model maps the relationship present in the original data in high dimensional space into a hypergraph. A hyperedge ;epresents a relationship (affinity) among subsets of data and the weight of the hyperedge reflects the strength of this affinity. A hypergraph partitioning algorithm is used to find a partitioning of the vertices such that the corresponding data items in each partition are highly related and the weight of the hyperedges cut by the partitioning is minimized. We present results of experiments on three different data sets: S&PSOO stock data for the period of 1994-1996, protein coding data, and Web document data. Wherever aplicable, we compared our results with those of AutoClass and K-means clustering algorithm on original data as well as on the reduced dimensionality data obtained via Principal Component Analysis or Latent Semantic Indexing scheme. These experiments demonstrate that our approach is applicable and effective in a wide range of domains. More specifically, our approach performed much better than traditional schemes [or high dimensional data sets in terms of quality of clusters and runtime. Our approach was also able to filter out noise data from the clusters very effectively without compromising the quaJity of the clusters.Item Online Censoring for Large-Scale Regressions and Dynamical Processes with Application to Big Data(2015-07) Bermperidis, DimitriosIn an age of exponentially increasing data availability, performing inference tasks by utilizing the available information in its entirety is not always an affordable option. In this context, the present thesis introduces different methods for rendering large-scale linear regression and tracking of dynamic processes affordable, by processing a reduced number of data. The proposed algorithms utilize interval censoring of observations, in order to judiciously discard those deemed to have relatively small contribution towards enhancing the estimation or tracking accuracy. For linear regression, two groups of first- and second-order iterative algorithms are proposed: the first one focuses on reducing data storage and transmission costs, while the second is tailored for reducing the overall problem complexity. Leveraging principles of stochastic approximation, the introduced methods entail simple, closed-form updates, provable convergence guarantees, and can afford online processing of the data. As far as the tracking of dynamical processes, two distinct methods are put forth for reducing the number of data involved per time step. The first method builds on preprocessing the data for dimensionality reduction using low-complexity random projections, while the second performs censoring for data-adaptive measurement selection. Simulations on real and synthetic data, compare the proposed methods with competing alternatives and corroborate their efficacy in terms of estimation accuracy over complexity reduction.Item Scalable Learning Adaptive to Unknown Dynamics and Graphs(2019-06) Shen, YanningWith the scale of information growing every day, the key challenges in machine learning include the high-dimensionality and sheer volume of feature vectors that may consist of real and categorical data, as well as the speed and the typically streaming format of data acquisition that may also entail outliers and misses. The latter may be present, either unintentionally or intentionally, in order to cope with scalability, privacy, and adversarial behavior. These challenges provide ample opportunities for algorithmic and analytical innovations in online and nonlinear subspace learning approaches. Among the available nonlinear learning tools, those based on kernels have merits that are well documented. However, most rely on a preselected kernel, whose prudent choice presumes task-specific prior information that is generally not available. It is also known that kernel-based methods do not scale well with the size or dimensionality of the data at hand. Besides data science, the urgent need for scalable tools is a core issue also in network science that has recently emerged as a means of collectively understanding the behavior of complex interconnected entities. The rich spectrum of application domains comprises communication, social, financial, gene-regulatory, brain, and power networks, to name a few. Prominent tasks in all network science applications are those of topology identification and inference of nodal processes evolving over graphs. Most contemporary graph-driven inference approaches rely on linear and static models that are simple and tractable, but also presume that the nodal processes are directly observable. To cope with these challenges, the present thesis first introduces a novel online categorical subspace learning approach to track the latent structure of categorical data `on the fly.' Leveraging the random feature approximation, it then develops an adaptive online multi-kernel learning approach (termed AdaRaker), which accounts not only for data-driven learning of the kernel combination, but also for the unknown dynamics. Performance analysis is provided in terms of both static and dynamic regrets to quantify the novel learning function approximation. In addition, the thesis introduces a kernel-based topology identification approach that can even account for nonlinear dependencies among nodes and across time. To cope with nodal processes that may not be directly observable in certain applications, tensor-based algorithms that leverage piecewise stationary statistics of nodal processes are developed, and pertinent identifiability conditions are established. To facilitate real-time operation and inference of time-varying networks, an adaptive tensor decomposition based scheme is put forth to track the topologies of time-varying networks. Last but not least, the present thesis offers a unifying framework to deal with various learning tasks over possibly dynamic networks. These tasks include dimensionality reduction, classification, and clustering. Tests on both synthetic and real datasets from the aforementioned application domains are carried out to showcase the effectiveness of the novel algorithms throughout.Item Structured Learning with Parsimony in Measurements and Computations: Theory, Algorithms, and Applications(2018-07) Li, XingguoIn modern ``Big Data'' applications, structured learning is the most widely employed methodology. Within this paradigm, the fundamental challenge lies in developing practical, effective algorithmic inference methods. Often (e.g., deep learning) successful heuristic-based approaches exist but theoretical studies are far behind, limiting understanding and potential improvements. In other settings (e.g., recommender systems) provably effective algorithmic methods exist, but the sheer sizes of datasets can limit their applicability. This twofold challenge motivates this work on developing new analytical and algorithmic methods for structured learning, with a particular focus on parsimony in measurements and computation, i.e., those requiring low storage and computational costs. Toward this end, we make efforts to investigate the theoretical properties of models and algorithms that present significant improvement in measurement and computation requirement. In particular, we first develop randomized approaches for dimensionality reduction on matrix and tensor data, which allow accurate estimation and inference procedures using significantly smaller data sizes that only depend on the intrinsic dimension (e.g., the rank of matrix/tensor) rather than the ambient ones. Our next effort is to study iterative algorithms for solving high dimensional learning problems, including both convex and nonconvex optimization. Using contemporary analysis techniques, we demonstrate guarantees of iteration complexities that are analogous to the low dimensional cases. In addition, we explore the landscape of nonconvex optimizations that exhibit computational advantages over their convex counterparts and characterize their properties from a general point of view in theory.Item A Study of Dimensionality Reduction Techniques and its Analysis on Climate Data(2015-10) Kumar, ArjunDimensionality reduction is a significant problem across a wide variety of domains such as pattern recognition, data compression, image segmentation and clustering. Different methods exploit different features in the data to reduce dimensionality. Principle component Analysis is one such method that exploits the variance in data to embed data onto a lower dimensional space called the principle component space. These are linear techniques which can be expressed in the form B=TX where T is the transformation matrix that acts on the data matrix X to the reduced dimensionality representation B. Other linear techniques explored are Factor Analysis and Dictionary Learning. In many problems, the observations are high-dimensional but we may have reason to believe that the they lie near a lower-dimensional manifold. In other words, we may believe that high-dimensional data are multiple, indirect measurements of an underlying source, which typically cannot be directly measured. Learning a suitable low-dimensional manifold from high-dimensional data is essentially the same as learning this underlying source. Techniques such as ISOMAP, Locally Linear Embedding, Laplacian EigenMaps (LEMs) and many others try to embed the high-dimensional observations in the non-linear space onto a low dimensional manifold. We will explore these methods making comparative studies and their applications in the domain of climate science.Item Unsupervised Learning of Latent Structure from Linear and Nonlinear Measurements(2019-06) Yang, BoThe past few decades have seen a rapid expansion of our digital world. While early dwellers of the Internet exchanged simple text messages via email, modern citizens of the digital world conduct a much richer set of activities online: entertainment, banking, booking for restaurants and hotels, just to name a few. In our digitally enriched lives, we not only enjoy great convenience and efficiency, but also leave behind massive amounts of data that offer ample opportunities for improving these digital services, and creating new ones. Meanwhile, technical advancements have facilitated the emergence of new sensors and networks, that can measure, exchange and log data about real world events. These technologies have been applied to many different scenarios, including environmental monitoring, advanced manufacturing, healthcare, and scientific research in physics, chemistry, bio-technology and social science, to name a few. Leveraging the abundant data, learning-based and data-driven methods have become a dominating paradigm across different areas, with data analytics driving many of the recent developments. However, the massive amount of data also bring considerable challenges for analytics. Among them, the collected data are often high-dimensional, with the true knowledge and signal of interest hidden underneath. It is of great importance to reduce data dimension, and transform the data into the right space. In some cases, the data are generated from certain generative models that are identifiable, making it possible to reduce the data back to the original space. In addition, we are often interested in performing some analysis on the data after dimensionality reduction (DR), and it would be helpful to be mindful about these subsequent analysis steps when performing DR, as latent structures can serve as a valuable prior. Based on this reasoning, we develop two methods, one for the linear generative model case, and the other one for the nonlinear case. In a related setting, we study parameter estimation under unknown nonlinear distortion. In this case, the unknown nonlinearity in measurements poses a severe challenge. In practice, various mechanisms can introduce nonlinearity in the measured data. To combat this challenge, we put forth a nonlinear mixture model, which is well-grounded in real world applications. We show that this model is in fact identifiable up to some trivial indeterminancy. We develop an efficient algorithm to recover latent parameters of this model, and confirm the effectiveness of our theory and algorithm via numerical experiments.