Browsing by Subject "clustering"

Now showing 1 - 12 of 12

Clustering Based on Association Rule Hypergraphs
(1997) Han, Euihong; Karypis, George; Kumar, Vipin; Mobasher, Bamshad
Traditional clustering algorithms, used in data mining for transactional databases, arc mainly concerned with grouping transactions, but they do not generally provide an adequate mechanism for grouping items found within these transactions. Item clustering, on the other hand, can be useful in many data mining applications. We propose a new method for clustering related items in transactional databases that is based on partitioning an association rule hypcrgraph, where each association rule defines a hyperedge. We also discuss some of the applications of item clustering, such as the discovery of meta-rules among item clusters, and clustering of transactions. We evaluated our scheme experimentally on data from a number of domains, and, wherever applicable, compared it with AutoClass. In our experiment with stock-market data, our clustering scheme is able to successfully group stocks that belong to the same industry group. In the experiment with congressional voting data, this method is quite effective in finding clusters of transactions that correspond to either democrat or republican voting patterns. We found clusters of segments of protein-coding sequences from protein coding database that share the same functionality and thus are very valuable to biologist for determining functionality of new proteins. We also found clusters of related words in documents retrieved from the World Wide Web (a common and important application in information retrieval). These experiments demonstrate that our approach holds promise in a wide range of domains, and is much faster than traditional clustering algorithms such as AutoClass.
Clustering in a High-Dimensional Space Using Hypergraph Models
(1997) Han, Eui-Hong; Karypis, George; Kumar, Vipin; Mobasher, Bamshad
Clustering of data in a large dimension space is of a great interest in many data mining applications. Most of the traditional algorithms such as K-means or AutoCJass fail to produce meaningful clusters in such data sets even when they are used with well known dimensionality reduction techniques such as Principal Component Analysis and Latent Semantic Indexing. In this paper, we propose a method for clustering of data in a high dimensional space based on a hypergraph model. The hypergraph model maps the relationship present in the original data in high dimensional space into a hypergraph. A hyperedge ;epresents a relationship (affinity) among subsets of data and the weight of the hyperedge reflects the strength of this affinity. A hypergraph partitioning algorithm is used to find a partitioning of the vertices such that the corresponding data items in each partition are highly related and the weight of the hyperedges cut by the partitioning is minimized. We present results of experiments on three different data sets: S&PSOO stock data for the period of 1994-1996, protein coding data, and Web document data. Wherever aplicable, we compared our results with those of AutoClass and K-means clustering algorithm on original data as well as on the reduced dimensionality data obtained via Principal Component Analysis or Latent Semantic Indexing scheme. These experiments demonstrate that our approach is applicable and effective in a wide range of domains. More specifically, our approach performed much better than traditional schemes [or high dimensional data sets in terms of quality of clusters and runtime. Our approach was also able to filter out noise data from the clusters very effectively without compromising the quaJity of the clusters.
Hyperdimensional Computing based Classification and Clustering: Applications to Neuropsychiatric Disorders
(2023-12) Ge, Lulu
Since its introduction in 1988, hyperdimensional computing (HDC), also referred to as vector symbolic architecture (VSA), has attracted significant attention. Using hypervectors as unique data points, this brain-inspired computational paradigm represents, transforms, and interprets data effectively. So far, the potential of HDC has been demonstrated: comparable performance to traditional machine learning techniques, high noise immunity, massive parallelism, high energy efficiency, fast learning/inference speed, one/few-shot learning ability, etc. In spite of HDC’s wide range of potential applications, relatively few studies have been conducted to demonstrate its applicability. To this end, this dissertation focuses on the application of HDC to neuropsychiatric disorders: (a) seizure detection and prediction, (b) brain graph classification, and (c) transcranial magnetic stimulation (TMS) treatment analysis. We also develop novel clustering algorithms using HDC that are more robust than the existing HDCluster algorithm. In order to detect and predict seizures, intracranial electroencephalography (iEEG) data are analyzed through the use of HDC-based local binary pattern (LBP) and power spectral density (PSD) encoding. Our study examines the effectiveness of utilizing all features as well as a small number of selected features. Our results indicate that HDC can be used for seizure detection, where PSD encoding is superior to LBP encoding. We observe that even three features are efficient in detecting seizures with PSD encoding. However, in order to pave the way for seizure prediction using HDC, more efficient features must be explored. For the classification of brain graphs, data from functional magnetic resonance imaging (fMRI) are analyzed. Brain graphs describe the functional brain connectome under varying brain states, and are generated from the fMRI data collected at rest and during tasks. The brain graph structure is assumed to vary from task to task and from task to no task. Participants are asked to execute emotional and gambling tasks, but no tasks are assigned during resting periods. GrapHD, an HDC-based graph representation, initially developed for object detection, is herein expanded for the application to brain graph classification. Experimental results demonstrate that GrapHD encoding has the capability of classifying the brain graphs for three binary classification problems: emotion vs. gambling, emotion vs. no-task, and gambling vs. no-task. Furthermore, GrapHD requires fewer memory resources as compared to the extant HDC-based encoding approaches. In terms of clustering, HDCluster, an HDC-based clustering algorithm, has been proposed in 2019. Originally designed to mimic the traditional k-means, HDCluster exhibits higher clustering performance across versatile datasets. However, we have identified that the performance of the HDCluster may be significantly influenced by the random seed used to generate the seed hypervectors. To mitigate the impact of this random seed, we propose more robust HDC-based clustering algorithms, designed to outperform HDCluster. Experimental results substantiate that our HDC-based algorithms are more robust and capable of achieving higher clustering performance than the HDCluster. In the analysis of TMS treatment, we conduct two specific tasks. One is to identify the clinical trajectory patterns for patients who suffer from major depressive disorder (MDD) (Clustering). Another is to predict MDD severity using 34 measured cognitive variables (Classification). For clustering, we propose a novel HDC-based algorithm that manipulates HDCluster to determine the number of clusters for a system of clinical trajectories. For classification, we utilize two HDC-based encoding algorithms and examine the impact of using either all features or selected features. Experimental results indicate that our HDC algorithm mirrors the clustering pattern of the classical algorithm. Additionally, the HDC-based classifier effectively predicts the concept of clinical response.
Implications of clustering in cell type annotation with single cell RNAseq data
(2022-07) Cepela, Jason
Single cell RNA-sequencing (scRNAseq) provides high-resolution data necessary to investigate rare cell populations contributing to treatment resistance commonly observed in many forms of cancer. Generally, the first step in this investigation is understanding the cellular makeup of a tissue sample by annotating cells based on their RNA expression profile. In this study, we compare cluster-based cell type annotation methods with approaches that annotate cells individually without employing clustering. Cell type frequencies are identified across 22 ovarian cancer tumor samples sequenced using 10x Genomics single cell RNA sequencing. These approaches are compared to identify biases, enable the identification of rare cell populations, and ultimately allow the correlation of cellular profiles to clinical response with the long-term goal of using these data to tailor treatment options and improve patient outcomes.
The INCLude (InterNodal Complete Linkage) Hierarchical Clustering Method
(2015-02) Olsen, David
The goal of this project was to develop a general, complete linkage hierarchical clustering method that 1) substantially improves upon the accuracy of the standard complete linkage method and 2) can be fully automated or used with minimal operator supervision. For the first part of the project, a new, complete linkage hierarchical clustering method was developed. The INCLude (InterNodal Complete Linkage) hierarchical clustering method unwinds the assumptions that underlie the standard complete linkage method. Further, evaluating pairs of data points for linkage is decoupled from constructing cluster sets, and cluster sets are constructed de novo instead of updating previously constructed cluster sets. Thus, it is possible to construct only the cluster sets for select, possibly noncontiguous levels of an n(n - 1)/2 + 1-level hierarchical sequence. However, by unwinding the assumptions that underlie the standard complete linkage method, the size of a hierarchical sequence reverts back from n levels to n(n - 1)/2 + 1 levels, and the time complexity to construct cluster sets becomes O(n 4). For the second part of the project, a means was developed for finding meaningful levels of an n(n-1)/2 + 1-level hierarchical sequence prior to performing a cluster analysis. The means includes constructing at least one distance graph, which is visually examined for features that correlate with meaningful levels of the corresponding hierarchical sequence. Thus, it is possible to know which cluster sets to construct and construct only these cluster sets. This reduces the time complexity to construct cluster sets from O(n4) to O(ln2), where l is the number of meaningful levels. The second part also looked at how increasing the dimensionality of the data points helps reveal inherent structure in noisy data, which is necessary for finding meaningful levels. The third part of the project resolved how to mathematically capture the graphical relationships that underlie the above-described features and integrate the means into the new clustering method. By doing so, the new method becomes self-contained and incurs almost no extra cost to determine which cluster sets should be constructed and which should not. Empirical results from nine experiments are included.
Laboratory Investigation Of Dispere Multiphase-Turbulent Flows, Dilute & Dense Distributions Of Inertial Particles Settling In Air
(2020-05) Petersen, Alec
Turbulent multiphase flows are found throughout our universe, all over Earth and in many man-made systems. Despite surrounding us, their dynamics are still in many ways obscure and require further study. These chaotic systems are however quite complicated to both simulate or explore experimentally. In this thesis, we present our laboratory investigation of particle-laden turbulent flows in air. We first focus on the statistical dynamics of dilute multiphase turbulence. Utilizing a zero-mean-flow air turbulence chamber, we drop size-selected solid particles and study their dynamics with particle imaging and tracking velocimetry at multiple resolutions. The carrier flow is simultaneously measured by particle image velocimetry of suspended tracers, allowing the characterization of the interplay between both the dispersed and continuous phases. The turbulence Reynolds number based on the Taylor microscale ranges from 200 – 500, while the particle Stokes number based on the Kolmogorov scale varies between O(1) and O(10). Clustering is confirmed to be most intense for Stokes ≈ 1 , but it extends over larger scales for heavier particles. Individual clusters form a hierarchy of self-similar, fractal-like objects, preferentially aligned with gravity and sizes that can reach the integral scale of the turbulence. Remarkably, the settling velocity of Stokes ≈ 1 particles can be several times larger than the still-air terminal velocity, and the clusters can fall even faster. This is caused by downward fluid fluctuations preferentially sweeping the particles, and we propose that this mechanism is influenced by both large and small scales of the turbulence. The particle-fluid slip velocities show large variance, and both the instantaneous particle Reynolds number and drag coefficient can greatly differ from their nominal values. Finally, for sufficient loadings, the particles generally augment the small-scale fluid velocity fluctuations, which however may account for a limited fraction of the turbulent kinetic energy. We also investigate denser particle-laden flows, specifically plumes driven by the downward buoyancy of inertial particles. With similar tools, we conduct two experiments: one to capture the particle-phase behavior and another to measure the ambient air velocity. Our first focus is on the assumption of self-similarity, which unlike single-phase plumes is not a trivial assumption. We also characterize the mean plume properties observed: the particle-phase velocity and the plume spread comparing their evolution with axial distance from the plume source. From our measurements of the ambient air flow we calculate the entrainment velocity into the particle-laden plumes and using the time-averaged value we estimate the entrainment coefficient along the plume. We find a relatively stable entrainment rate, as expected in the assumption used to formulate many integral plume models. Lastly we compared our experimental results to single and multiphase plume models with the same initial conditions as our experiments. Our multiphase plume model, inspired by the work of Liu (2003) and Lai et al. (2016), well described our velocity measurements, which single phase models were completely unequipped for.
Large-scale Clustering using Random Sketching and Validation
(2015-08) Traganitis, Panagiotis
The advent of high-speed Internet, modern devices and global connectivity has introduced the world to massive amounts of data, that are being generated, communicated and processed daily. Extracting meaningful information from this humongous volume of data is becoming increasingly challenging even for high-performance and cloud computing platforms. While critically important in a gamut of applications, clustering is computationally expensive when tasked with high-volume high-dimensional data. To render such a critical task affordable for data-intensive settings, this thesis introduces a clustering framework, named random sketching and validation (SkeVa). This framework builds upon and markedly broadens the scope of random sample and consensus RANSAC ideas that have been used successfully for robust regression. Four main algorithms are introduced, which enable clustering of high-dimensional data, as well as subspace clustering for data generated by unions of subspaces and clustering of large-scale networks. Extensive numerical tests compare the SkeVa algorithms to their state-of-the-art counterparts and showcase the potential of the SkeVa frameworks.
Predicting Therapy Adherence : A Machine Learning Approach
(2021-12) Lima Diniz Araujo, Matheus
Ensuring adherence to medical therapy has been an open problem in health care practices since the Hippocrates Oath (400 BC) to modern medicine. In an ideal world, people would follow their doctor's recommendations. They would stick to their diet to lose weight, take their medication on time, and use their electronic health devices as recommended by the doctors. But the planned routine is rarely followed, causing a financial burden in the order of billions of dollars for the national healthcare system and many billions of dollars worldwide. A key mechanism to revert a tendency of non-adherence is early personalized intervention, targeting the key factors of undesired behavior, but this task is not trivial. After starting their therapy, individuals have an unpredictable series of life events that may impact their willingness to keep with the necessary therapy routine. Only recently, we achieved the ability to passively collect individual-level therapy data as patients progress in their treatments using digital devices. In this thesis, we proposes various machine-learning strategies that aim to leverage the data collected at the early stages of medical therapies to predict future adherence and recommend early accurate interventions that align with each individual's desired outcomes. We narrow most of the analysis in two sleep apnea therapies, Continuous Positive Airway Pressure (CPAP) and Upper-Airway Stimulation (UAS). But to reinforce the generalization of our methods we also show how they can be applied for the growth-hormone therapy management.
Robust Fragmentation: A Data-Driven Approach to Decision-Making Under Distributional Ambiguity
(2016-07) Moulton, Jeffrey
Decision makers often must consider many different possible future scenarios when they make a decision. A manager must choose inventory levels to maximize profit when the demand is unknown, investors choose a portfolio to maximize gain in the face of uncertain stock price fluctuations, and service providers must pick facility locations to minimize distances traveled where the location of the next customer is random. When uncertainty is involved, the desired outcome will be affected by factors and variables outside the decider's control and knowledge at the time of the decision. Such factors of uncertainty are hard to account for, especially when exact information about the uncertainty is unknown. It may be difficult to account for all possibilities or to estimate how likely each scenario is. There may be an infinite number of scenarios, too many for the human brain to contemplate. Generally, a precise distribution of these factors is unavailable. The unknown data, which can be quantified as a vector of parameters, follows an unknown distribution leading to the term distributional ambiguity. The decision maker may only have a sample of past events to guide them in balancing the costs and benefits of every possibility. Such limited historical data can help to provide guidance, but how best to use it? One would desire a decision strategy that will efficiently use the data and perform well in terms of the expectation, but also be robust to possible noise, changing conditions, and small sample size. To this end, we develop robust fragmentation (RF), a data-driven approach to decision-making under distributional ambiguity. We consider a stochastic programming framework, where the decision maker aims to optimize an expected outcome. However, we assume that the governing distribution of the random parameters affecting the outcome is unknown. Only a historical sample of past realizations of such parameters is available. Our goal is to leverage the historical data to effectively and robustly approximate the true problem. This is done by fragmenting the support of the data into pieces to dissect the structure. We reduce the data by replacing it with summary statistics by region. These parameters are used to construct an ambiguity set for the distribution. The ambiguity set consists of all distributions which would satisfy the same regional reduction. We therefore reduce the problem size and avoid overfitting to the training sample. Our approach allows for two types of ambiguity: ambiguity in support and ambiguity in probability. After constructing the ambiguity set, we consider the worst case expectation to provide distributional robustness. Constraining the distribution regionally allows us to capture detailed information about the distribution and keeps the approach from being too conservative. The ambiguity may be tailored to the structure, amount, and reliability of the data. Robust fragmentation is a generalization and extension of several classical and newer approaches to approximating a stochastic program, including the sample average approximation (SAA), moments-based distributionally robust optimization (MDRO), and distributionally robust optimization based on probabilities of individual events. We demonstrate how RF conceptually fits into the greater picture and provides an adaptive way of balancing the benefits and drawbacks of each kind of approach. We make comparisons both theoretically and through numerical experiments. We outline a procedure for breaking up the data and formulating the RF optimization problem for polytopal, ellipsoidal, and other types of fragmentations. We prove convergence results to demonstrate the consistency of our approach. RF can handle overlapping regions in the fragmentation by adapting the way statistics are computed. Our algorithm for fragmenting the data relies heavily on strategic clustering of the data sample to dissect the modality. We discuss methods and heuristics for implementing our approach. We extend and compare different clustering methods in terms of how well they serve as a preliminary step to our procedure. We consider applications in management science and financial mathematics and perform numerical experiments to demonstrate how the approach fits into existing frameworks and to evaluate its performance. It turns out that the new approach has a clear advantage when the data is multimodal and the training sample is small, but not too small relative to the number of modes. Additionally, we illustrate some extensions to the general model. The black swan approach involves adding an extra layer of robustness to account for rare and unexpected events of large magnitude and consequence. Two-stage and multistage stochastic programs extend the framework to decisions that must be made in stages, where after each stage random occurences influence the parameters of the next stage. We derive formulations for these extended models and investigate the production transportation-problem as an application.
Scalable and Ensemble Learning for Big Data
(2019-05) Traganitis, Panagiotis
The turn of the decade has trademarked society and computing research with a ``data deluge.'' As the number of smart, highly accurate and Internet-capable devices increases, so does the amount of data that is generated and collected. While this sheer amount of data has the potential to enable high quality inference, and mining of information, it introduces numerous challenges in the processing and pattern analysis, since available statistical inference and machine learning approaches do not necessarily scale well with the number of data and their dimensionality. In addition to the challenges related to scalability, data gathered are often noisy, dynamic, contaminated by outliers or corrupted to specifically inhibit the inference task. Moreover, many machine learning approaches have been shown to be susceptible to adversarial attacks. At the same time, the cost of cloud and distributed computing is rapidly declining. Therefore, there is a pressing need for statistical inference and machine learning tools that are robust to attacks and scale with the volume and dimensionality of the data, by harnessing efficiently the available computational resources. This thesis is centered on analytical and algorithmic foundations that aim to enable statistical inference and data analytics from large volumes of high-dimensional data. The vision is to establish a comprehensive framework based on state-of-the-art machine learning, optimization and statistical inference tools to enable truly large-scale inference, which can tap on the available (possibly distributed) computational resources, and be resilient to adversarial attacks. The ultimate goal is to both analytically and numerically demonstrate how valuable insights from signal processing can lead to markedly improved and accelerated learning tools. To this end, the present thesis investigates two main research thrusts: i) Large-scale subspace clustering; and ii) unsupervised ensemble learning. The aforementioned research thrusts introduce novel algorithms that aim to tackle the issues of large-scale learning. The potential of the proposed algorithms is showcased by rigorous theoretical results and extensive numerical tests.
Scalable GPU Algorithms for Similarity Query Processing and Clustering on Trajectory Data
(2019-11) Mustafa, Hamza
With the increasing prevalence of location sensor devices like GPS, it has been possible to collect large datasets of a special type of spatio-temporal data called trajectory data. A trajectory is a discrete sequence of positions that a moving object occupies in space as time passes. Such large datasets enable researchers to study the behavior of the objects describing these movements by issuing spatial queries such as trajectory similarity queries and trajectory clustering. Top-K trajectory similarity queries retrieve the K most similar trajectories to a given query trajectory. This query has applications in many areas, such as urban planning, ecology and social networking; however, this query is computationally expensive. In this work, we introduce a new parallel top-K trajectory similarity query technique for GPUs, FastTopK, to deal with these challenges. Our experiments on two large real-life datasets showed that FastTopK produces on average 107.96X smaller candidate result sets, and 3.36X faster query execution times than the existing state of-the-art technique, TKSimGPU. The second type of trajectory query covered in this work is trajectory clustering. One important algorithm for clustering is DBSCAN, which is especially useful for ﬁnding clusters of arbitrary shapes. As opposed to other clustering techniques, like K-means, it does not require the number of clusters to be speciﬁed as an input parameter, and it is highly robust to outliers. However, DBSCAN has a worst-case quadratic time complexity that makes it diﬃcult to handle large dataset sizes. To address this problem, several works have been proposed that exploit the massive parallelism of GPUs for DBSCAN clustering of point data. Nonetheless, none of these works have been experimentally compared against each other and none have been extended to cluster trajectory data. In this thesis, we review the existing GPU algorithms for DBSCAN clustering on point data and conduct the ﬁrst experimental study comparing these GPU algorithms using three real-world datasets to identify the best performing algorithm. Our results show that G-DBSCAN is the fastest being up to 969X faster than CPU DBSCAN on 128K points, while CUDA-DClust is the best performing GPU algorithm in terms of execution time and memory requirements, performing 53X faster than CPU DBSCAN while taking up to 166X less memory than G-DBSCAN. Lastly, we use the work of our analysis of all the existing GPU-based DBSCAN clustering algorithms on point data to develop a new GPU-based trajectory clustering algorithm, GTRACLUS. Our experiments show that on two real world datasets, trajectory clustering can be made more than 20X faster than the CPU-based TRACLUS by using the proposed GTRACLUS algorithm.
Unsupervised Learning of Latent Structure from Linear and Nonlinear Measurements
(2019-06) Yang, Bo
The past few decades have seen a rapid expansion of our digital world. While early dwellers of the Internet exchanged simple text messages via email, modern citizens of the digital world conduct a much richer set of activities online: entertainment, banking, booking for restaurants and hotels, just to name a few. In our digitally enriched lives, we not only enjoy great convenience and efficiency, but also leave behind massive amounts of data that offer ample opportunities for improving these digital services, and creating new ones. Meanwhile, technical advancements have facilitated the emergence of new sensors and networks, that can measure, exchange and log data about real world events. These technologies have been applied to many different scenarios, including environmental monitoring, advanced manufacturing, healthcare, and scientific research in physics, chemistry, bio-technology and social science, to name a few. Leveraging the abundant data, learning-based and data-driven methods have become a dominating paradigm across different areas, with data analytics driving many of the recent developments. However, the massive amount of data also bring considerable challenges for analytics. Among them, the collected data are often high-dimensional, with the true knowledge and signal of interest hidden underneath. It is of great importance to reduce data dimension, and transform the data into the right space. In some cases, the data are generated from certain generative models that are identifiable, making it possible to reduce the data back to the original space. In addition, we are often interested in performing some analysis on the data after dimensionality reduction (DR), and it would be helpful to be mindful about these subsequent analysis steps when performing DR, as latent structures can serve as a valuable prior. Based on this reasoning, we develop two methods, one for the linear generative model case, and the other one for the nonlinear case. In a related setting, we study parameter estimation under unknown nonlinear distortion. In this case, the unknown nonlinearity in measurements poses a severe challenge. In practice, various mechanisms can introduce nonlinearity in the measured data. To combat this challenge, we put forth a nonlinear mixture model, which is well-grounded in real world applications. We show that this model is in fact identifiable up to some trivial indeterminancy. We develop an efficient algorithm to recover latent parameters of this model, and confirm the effectiveness of our theory and algorithm via numerical experiments.

University Digital Conservancy

Browse by Subject

Browsing by Subject "clustering"