Browsing by Subject "Clustering"

Now showing 1 - 11 of 11

Clustering Methods for Correlated Data
(2022-08) Becker, Andrew
Hierarchical Clustering is one of the most popular unsupervised clustering methods.Using a simple agglomerative algorithm, it iteratively combines similar clusters together forming cohesive groups of observations. This work focuses on Hierarchical Clustering and how it may be adapted to accommodate correlated observations. Chapter 2 investigates how to develop a statistical framework for Hierarchical Clustering so we may derive statistical properties from the clustering method. In Chapter 3, a new method, Hierarchical Cohesion Clustering is proposed. This method is a modification of the traditional methods which aims to accommodate correlated observations. This approach explores how repeated measurements may be preprocessed into intermediate clusters to improve clustering outcomes. The method is applied to a sequence-based time use dataset about how people spend their time throughout the day. In Chapter 4, we focus on how to incorporate spatial adjacency data when clustering. We continue to investigate Hierarchical Clustering methods, with a special focus on Hierarchical Cohesion Clustering. Applying the collection of methods to COVID-19 case rate data within counties, a comparison of the methods is performed with summaries of their respective strengths and weaknesses. Spatial simulations are included to better determine each approach’s efficacy and when certain approaches are preferable.
Computer Vision Algorithms for Yield Mapping in Specialty Farms
(2019-11) Roy, Pravakar
Precision farming and phenotyping technologies have the potential to drastically transform the agricultural landscape. For commodity crops such as maize, wheat and soy recurring farming tasks such as seeding, weeding, irrigation, fertilization, application of pesticides, harvesting, and storage are in the process of being completely automated. Specialty crops (tree fruit, flowers, vegetables, and nuts) are excellent candidates for similar automation as they have high monetary value, high management cost and high variability in growth. An important capability for both precision agriculture and phenotyping is yield mapping. Yield mapping for tree fruit is challenging because it involves solving multiple computer vision (fruit detection, counting, recovering underlying 3D geometry for tracking fruit across different frames in continuously changing illumination) as well as planning problems (path planning for covering all fruit, picking fruit). The main goal of this dissertation is to develop computer vision and deep learning algorithms for yield mapping in specialty farms. The dissertation is divided into three parts. The first part is dedicated to developing practical solutions for yield mapping in specialty farms. We present solutions for fruit detection, counting, recovering the underlying scene geometry and fruit tracking. We integrate these individual solutions in a modular manner and create a flexible framework for complete yield estimation. Additionally, we perform an extensive experimental evaluation of the developed system and sub-components. Our algorithms successfully predict 97% of the ground truth yield and outperform all existing state-of-the-art methods. Some of these efforts are now in the process of being commercialized. In the second part of the dissertation, we study a problem where a manipulator equipped with a camera, mounted on a ground robot is charged with accurately counting fruit by taking a minimum number of views. We present a method for efficiently enumerating combinatorially distinct world models most likely to generate the captured views. These are incorporated into single and multi-step planners for accurate fruit counting. We evaluate these planners in simulation as well as with experiments on a real robot. In the third part, we study the problem of realistic synthetic data generation for training deep neural networks. We present a method that jointly translates the synthetic images and their underlying semantics to the domain of the real data so that an adversarial discriminator (a deep neural network) cannot distinguish between the real and synthetic data. This method enables us to stylize the synthetic data to any fruit, lighting condition and environment. It can be applied to a wide variety of domain transfer tasks beyond fruit detection and counting (e.g from Grand Theft Auto (GTA) to Cityscapes for autonomous driving). Additionally, it enables us to perform image to image translation with significant changes in underlying geometry (e.g circles to triangles, sheep to giraffe, etc). These results in this dissertation together present a complete yield monitoring system for specialty crops, view planning strategies for accurate fruit counting and a framework for generating realistic synthetic data. These methods together push the state-of-the-art and take us one step closer toward building a sustainable infrastructure for intelligent integrated farm management.
Development and evaluation of methodologies for the classification of ecological communities.
(2008-12) Wan, Haibo
I developed and evaluated methodologies for studying ecological communities. The first component of my work, Chapter 2, develops a scientific framework, referred to as the nutshell philosophy, for classifying aquatic habitats in the St. Croix National Scenic Riverway. The "nutshell" is a compact philosophy for practitioners, including two parts: (1) a classification framework using a small set of environmental predictor variables; and (2) test of the classification using biological data. My classification resulted in two units: segments and reaches. Segments have a dimension of 15+ km and are delineated by major tributary outlets and channel slopes. Reaches are nested within the segments, with a dimension of 2+ km and are delineated by changes in stream substrate. A data set of mussel communities was clustered to test the classification. The clusters were consistent with the segments. The "nutshell" philosophy was validated in this case. The second component of my work, Chapter 3, introduces a method investigating stream substrate with underwater videos. With it a desired large area can be covered, and the resulting videos can be archived and used for quantitative analysis. The third component of my dissertation, Chapters 4 and 5, relate to clustering methods. Chapter 4 investigates impacts of selected cluster number on the accuracy of clustering algorithms. I clustered data sets of known structure with three different methods, varying the cluster number; clustering accuracy was evaluated using the Rand statistic. The Elbow phenomenon was typically found for the response of the Rand statistic with the cluster number. I described the Rand curves with an analytical relationship and proposed a threshold slope of 0.001 to locate the optimal cluster number. Chapter 5 evaluates performance of an emerging clustering method, Self-Organizing Map (SOM), with the more traditional methods of K-mean and Unweighted-Pair-Group-Method-using-Arithmetic-Averages (UPGMA). The SOM method was similar to the K-means in performance, and it was the best clustering method overall because of its additional and exceptional visualization feature. Although the UPGMA method also has visualization feature and worked well with data of low complexity, its performance decreases substantially with data of medium or high complexity.
Dictionary learning and sparse coding for unsupervised clustering
(University of Minnesota. Institute for Mathematics and Its Applications, 2009-09) Sprechmann, Pablo; Sapiro, Guillermo
Efficient inference algorithms for some probabilistic graphical models
(2014-02) Fu, Qiang
The probabilistic graphical model framework provides an essential tool to reason coherently from limited and noisy observations. The framework has been used in an enormous range of application domains, which include: natural language processing, computer vision, bioinformatic, robot navigation and many more. We propose several inference algorithms for some probabilistic graphical models. For Bayesian network graphical models, we focus on the problem of overlapping clustering, where a data point is allowed to belong to multiple clusters. We present an overlapping clustering algo- rithm based on multiplicative mixture models. We analyze a general setting where each component of the multiplicative mixture is from an exponential family, and present an efficient alternating maximization algorithm to learn the model and infer overlap- ping clusters. We also propose a Bayesian Overlapping Subspace Clustering (BOSC) model which is a hierarchical generative model for matrices with potentially overlapping uniform sub-block structures. The BOSC model can also handle matrices with missing entries. We propose an EM-style algorithm based on approximate inference using Gibbs sampling and parameter estimation using coordinate descent for the BOSC model. We propose an EM-style algorithm based on approximate inference using Gibbs sampling and parameter estimation using coordinate descent for the BOSC model. We also consider Markov random field graphical models and address the problem of maximum a posteriori (MAP) inference. We first show that the drought detection problem from the climate science domain can be formulated as a MAP inference problem and propose an automatic drought detection problem. We then present a parallel MAP inference algorithm called Bethe-ADMM based on two ideas: tree-decomposition of the graph and the alternating direction method of multipliers (ADMM). However, unlike the standard ADMM, we use an inexact ADMM augmented with a Bethe-divergence based proximal function, which makes each subproblem in ADMM easy to solve in parallel using the sum-product algorithm. We rigorously prove global convergence of Bethe-ADMM. The proposed algorithm is extensively evaluated on both synthetic and real datasets to illustrate its effectiveness. Further, the parallel Bethe-ADMM is shown to scale almost linearly with increasing number of cores.
Experimental investigations on the preferential concentration and clustering phenomena of wall-bounded gas-solid ﬂows
(2021-05) Fong, Kee Onn
The transport of solid particles by fluid flows are ubiquitous in many environmental, biomedical and industrial processes, yet a full understanding of the dynamics of these processes remain elusive. In this thesis, we present experimental investigations of two different phenomena that arises in particle-laden flows: the preferential concentration of particles in turbulent flows, and the clustering of particles in dense gas-solid flows. In the former case, we investigate inertial particles dispersed in a turbulent downward flow through a vertical channel. The working fluid is air laden with size-selected glass microspheres, having Stokes numbers St = O(10) and O(100) when based on the Kolmogorov and viscous time scales, respectively. Cases at friction Reynolds numbers 235 and 335 and solid volume fractions 3E-6 and 5E-5 are considered. Between the more dilute and denser cases, substantial differences are observed in all measured statistics e.g. the particle concentration profile, mean velocity profile and the velocity fluctuation levels; consistent with a scenario in which the increase in volume fraction from O(1E-6) to O(1E-5) triggers two-way and local four-way coupling effects. An analysis of the spatial distributions of particle positions and velocities in the higher volume fraction cases also reveals different behavior in the core and near-wall regions. In the channel core, dense clusters form that travel faster than the less concentrated particles; whereas in the near-wall region the particles arrange in highly elongated streaks that are associated with negative streamwise velocity fluctuations. In the denser regime, we present experimental observations on the velocities and spatial distribution of particles in a three-dimensional, gas-solid riser with particle volume fractions approaching 1%. The setup consists of a vertical square channel in which air flows upwards against falling 212 um glass spheres. We use a backlighting technique and high-speed imaging to quantify the spatial and temporally resolved particle concentration and velocity fields. By controlling the particle feed rate and the flow rate of the fluidizing air, volume fractions and bulk flow Reynolds number are adjusted independently. Results show that, in the present range of parameters, clustering of particles appear beyond a critical volume fraction regardless of fluidization velocities, influencing the mean and r.m.s. statistics and strongly modulating the two-point correlation statistics. Space-time autocorrelation analysis reveals the convection of structures in the velocity and concentration fields, and the fluctuations of velocities and concentrations are well-described by the classic gradient diffusion hypothesis. Particle-resolved measurements reveal that particles in the riser have a sub-Poissonian spatial distribution, and their streamwise velocity fluctuations are correlated in the streamwise direction. This indicates significant hydrodynamic interaction between the particles, especially in the direction of gravity.
Inference using Geometry and Density Information in Manifold Data
(2023-08) Bera, Sabyasachi
Clustering is the task of grouping a dataset so that data in the same group (called acluster) are more similar in some sense to each other than to those in other groups. While diferent notions of clustering exist in literature, it is commonly understood that data which are "close" to each other (geometric proximity) should be in the same cluster and clusters should capture the concentration pattern (high density regions) in the data. In many applications, especially when the data is from a topological manifold, we are required to capture both geometry and density information from the data simultaneously in order to cluster them in a meaningful way. We introduce g-distance, a data driven density sensitive distance, and explore its theoretical properties, geometry and usefulness in clustering applications under several data generating models. We derive the convergence limit of longest leg path distance (LLPD), a purely density based limiting form of g-distance. We compare several distances, for example, Euclidean distance, g-distance, LLPD, in clustering and manifold learning applications under several data generating models. Finally, as an application of high-dimensional learning and manifold learning, we develop a technique for record linkage on high-dimensional data using sparse principal components.
Integrated analysis of genomic data for inferring gene regulatory networks.
(2009-04) Zare Sangederazi, Hossein
As genomic technology and sequencing projects continue to advance, more emphasis needs to be put on data analysis, while addressing the issue of how best to extract information from diverse data sets. For example, functional annotation of new genes can no longer depends only on sequence analysis, but requires integration of additional sources of information including phylogeny, gene expression, protein interaction, metabolic and regulatory networks. Therefore, new biological discoveries will depend strongly on our ability to combine these diverse data sets. We demonstrate how information from gene expression, regulatory sequence patterns and location data can be combined to discover regulatory modules and to construct gene transcriptional regulatory networks. In the context of modeling regulatory sequences, we propose a higher order probabilistic model to efficiently discriminate between the binding sites of a transcription factor and non-specific DNA sequences. Moreover, a model-based algorithm is developed, which integrates gene expression data, modeled by mixtures of Gaussian, with the regulatory sequence patterns for clustering of functionally related genes. For the construction of the gene regulatory network, we introduce the concept of Gene-Regulon association in contrast to Gene-Gene interaction. Unlike Gene-Gene interaction methods, where the mRNA levels of the regulators play the important role, Gene-Regulon methods rely on the activity profiles of the transcription factors. These activity profiles, in the absence of their direct measurements, are estimated concurrently via a computational model. We develop a model selection algorithm, which is capable of capturing the activity profile of a transcription factor from the transcriptional activity of its target genes. In addition, we present a data driven approach based on nonlinear kernel embedding for capturing the nonlinear correlation and geometric connectivity pattern in gene expression data. We apply these methods for integrating gene expression and interaction data to construct a network of transcriptional regulation in Escherichia coli (E. coli).
Searching, Clustering and Regression on non-Euclidean Spaces
(2015-08) Wang, Xu
This dissertation considers three common tasks (e.g., searching, clustering, regression) over Riemannian spaces. The first task considers the problem of efficiently deciding which of a database of subspaces is most similar to a given input query. Motivated by applications in recognition, image retrieval and optimization, there has been significant recent interest in this problem. Current approaches to this problem have poor scaling in high dimensions, and may not guarantee sublinear query complexity. We present a new approach to approximate nearest subspace search, based on a simple, new locality sensitive hash for subspaces. For the second task, we advocates a novel framework for segmenting a dataset in a Riemannian manifold into clusters lying around low-dimensional submanifolds. This clustering problem is useful for applications such as action identification, dynamic texture clustering, brain fiber segmentation, and clustering of deformed images. The proposed clustering algorithm constructs an affinity matrix by exploiting the geometry and then applies spectral clustering. Theoretical guarantees are established for a variant of the algorithm. To avoid complication, these guarantees assume that the submanifolds are geodesic. Extensive validation on synthetic and real data demonstrates the resiliency of the proposed method against deviations from the theoretical model as well as its superior performance over state-of-the-art techniques. In the third task, we proposes a novel framework for manifold-valued regression and establishes its consistency as well as its contraction rate for a particular setting. Our setting assumes a predictor with values in the interval [0,1] and response with values in a compact Riemannian manifold. This setting is useful for applications such as modeling dynamic scenes or shape deformations, where the visual scene or the deformed objects can be modeled by a manifold. The proposed framework uses the heat kernel on manifolds as an averaging procedure. It directly generalizes the use of the Gaussian kernel in vector-valued regression problems. In order to avoid explicit dependence on estimates of the heat kernel, we follow a Bayesian setting, where Brownian motion induces a prior distribution on the space of continuous functions. We study the posterior consistency and contraction rate of the discrete and continuous Brownian motion priors.
Understand the Similarity of Internet Service Providers via Peer-to-Peer User Interest Analysis
(2019-06) Joshi, Prateek
Internet traffic continues to exhibit exponential growth in the past few years. This forces Internet service providers(ISPs) to continuously invest in infrastructure upgrades and deploy traffic management techniques, such as caching and locality, to fulfill the increasing user demand. To help ISPs better manage their infrastructures, it is important to compare and understand the similarity of their user interests. However, such a comparison is challenging because the ISP data is hard to obtain, not to mention the related modeling and analysis issues. In this thesis, we aim to understand the ISP similarity through an extensive analysis of Peer-to-Peer(P2P) user interest. To collect the P2P dataset, we develop a tool to automatically download BitTorrent's meta-info(torrent) files on the Internet. This tool also helps us to collect important peer and content information in these BitTorrent swarms without uploading any copyrighted files. As a result, we successfully obtained 16,697 active peers from 1,721 torrents in 1,097 unique Autonomous Systems(ASes). After that, we adopt the classic statistical and clustering approaches to compare their different user interests. Our research for the first time shows the existence of cloud users in such real-world content distribution systems as BitTorrent. The model analysis further indicates that we can adopt similar traffic management approaches (e.g., caching similar contents) across geographically closer ASes.
Using asymmetry in the spectral clustering of trajectories.
(2011-07) Atev, Stefan Emilov
Spatial trajectories, for example those of vehicles passing through a traffic intersection, are of interest in many data collection applications since they capture a lot of semantic information in a fairly compact representation. Data mining, or unsupervised learning from sets of trajectories can be challenging since intuitive notions of trajectory similarity are hard to encode rigorously. Many similarity measures for trajectories that are needed for tasks such as clustering fail to satisfy basic metric properties like the triangle inequality, or even symmetry. While in some simple practical applications such measures have been quite successful, the violation of basic properties poses unique challenges for more advanced methods such as spectral clustering. We show how the asymmetry of a trajectory similarity measure can be exploited when clustering a set of trajectories. The asymmetry is used both indirectly, in a traditional spectral clustering method, and directly, by developing a spectral clustering method that can handle asymmetric affinity matrices natively without requiring an artificial symmetrization step to be performed, thus avoiding the attendant loss of information entailed by that process. We propose a modification of the Hausdorff distance for comparing trajectories, which we first symmetrize in a non-standard fashion inspired by a local scaling approach, and then further show that the distance can be used directly without prior symmetrization. We perform a variety of experiments, with a focus on vehicle trajectories. A novel automated tracking method is developed to provide experimental data.

University Digital Conservancy

Browse by Subject

Browsing by Subject "Clustering"