Browsing by Subject "Data Mining"
Now showing 1 - 11 of 11
- Results Per Page
- Sort Options
Item Anomaly detection for symbolic sequences and time series data(2009-09) Chandola, VarunThis thesis deals with the problem of anomaly detection for sequence data. Anomaly detection has been a widely researched problem in several application domains such as system health management, intrusion detection, health-care, bio-informatics, fraud detection, and mechanical fault detection. Traditional anomaly detection techniques analyze each data instance (as a univariate or multivariate record) independently, and ignore the sequential aspect of the data. Often, anomalies in sequences can be detected only by analyzing data instances together as a sequence, and hence cannot detected by traditional anomaly detection techniques. The problem of anomaly detection for sequence data is a rich area of research because of two main reasons. First, sequences can be of different types, e.g., symbolic sequences, time series data, etc., and each type of sequence poses unique set of problems. Second, anomalies in sequences can be defined in multiple ways and hence there are different problem formulations. In this thesis we focus on solving one particular problem formulation called semi-supervised anomaly detection. We study the problem separately for symbolic sequences, univariate time series data, and multivariate time series data. The state of art on anomaly detection for sequences is limited and fragmented across application domains. For symbolic sequences, several techniques have been proposed within specific domains, but it is not well-understood as to how a technique developed for one domain would perform in a completely different domain. For univariate time series data, limited techniques exist, and are only evaluated for specific domains, while for multivariate time series data, anomaly detection research is relatively untouched. This thesis has two key goals. First goal is to develop novel anomaly detection techniques for different types of sequences which perform better than existing techniques across a variety of application domains. The second goal is to identify the best anomaly detection technique for a given application domain. By realizing the first goal we develop a suite of anomaly detection techniques for a domain scientist to choose from, while the second goal will help the scientist to choose the technique best suited for the task. To achieve the first goal we develop several novel anomaly detection techniques for univariate symbolic sequences, univariate time series data, and multivariate time series data. We provide extensive experimental evaluation of the proposed techniques on data sets collected across diverse domains and generated from data generators, also developed as part of this thesis. We show how the proposed techniques can be used to detect anomalies which translate to critical events in domains such as aircraft safety, intrusion detection, and patient health management. The techniques proposed in this thesis are shown to outperform existing techniques on many data sets. The technique proposed for multivariate time series data is one of the very first anomaly detection technique that can detect complex anomalies in such data. To achieve the second goal, we study the relationship between anomaly detection techniques and the nature of the data on which they are applied. A novel analysis framework, Reference Based Analysis (RBA), is proposed that can map a given data set (of any type) into a multivariate continuous space with respect to a reference data set. We apply the RBA framework to not only visualize and understand complex data types, such as multivariate categorical data and symbolic sequence data, but also to extract data driven features from symbolic sequences, which when used with traditional anomaly detection techniques are shown to consistently outperform the state of art anomaly detection techniques for these complex data types. Two novel techniques for symbolic sequences are proposed using the RBA framework which perform better than the best technique for each different data set.Item Developing a Predictive Model for Hospital-Acquired Catheter-Associated Urinary Tract Infections Using Electronic Health Records and Nurse Staffing Data(2016-08) Park, Jung InThere are a number of clinical guidelines and studies about hospital-acquired catheter-associated urinary tract infections (CAUTIs), but the rate of CAUTI occurrence is still rising. Hospitals are focusing on preventing hospital-acquired CAUTI, as the Centers for Medicare and Medicaid Services (CMS) does not provide payment for hospital-acquired infections anymore. There is a need to explore additional factors associated with hospital-acquired CAUTI and develop a predictive model to detect patients at high risk. This study developed a predictive model for hospital-acquired CAUTIs using electronic health records (EHRs) and nurse staffing data from multiple data sources. Research using large amounts of data could provide additional knowledge about hospital-acquired CAUTI. The first aim of the study was to create a quality, de-identified dataset combining multiple data sources for machine learning tasks. To address the first aim of the study, three datasets were combined into a single dataset. After integrating the datasets, data were cleaned and prepared for analysis. The second aim of the study was to develop and evaluate predictive models to find the best predictive model for hospital-acquired CAUTI. For the second aim of the study, three predictive models were created using the following data mining method: decision trees (DT), logistic regression (LR), and support vector machine (SVM). The models were evaluated and DT model was determined as the best predictive model for hospital-acquired CAUTI. The findings from this study have presented factors associated with hospital-acquired CAUTI. The study results demonstrated that female gender, old adult (≥56), Charlson comorbidity index score ≥ 3, longer length of stay, glucose lab result > 200 mg/dl, present of rationale for continued use of catheter, higher percent of direct care RNs with associate’s degree in nursing, less total nursing hours per patient day, and lower percent of direct care RNs with specialty nursing certification was related to CAUTI occurrence. Implications for future research include the use of different analytic software to investigate detailed results for LR model, adding more factors associated with CAUTI in modeling, using a larger sample with more patients with CAUTI, and patient outcomes research using nursing-sensitive indicators. This study has important implications for nursing practice. According to the study results, nurse specialty certification, nurse’s education at the baccalaureate level or higher, and more nursing hours per patient day were associated with better patient outcomes. Therefore, considerable efforts are needed to promote possession of nurse specialty certification and higher level of nursing education, as well as enough supply of nursing workforce.Item Finding Integrative Biomarkers from Biomedical Datasets: An application to Clinical and Genomic Data(2015-08) Dey, SanjoyHuman diseases, such as cancer, diabetes and schizophrenia, are inherently complex and governed by the interplay of various underlying factors ranging from genetic and genomic influences to environmental effects. Recent advancements in high throughput data collection technologies in bioinformatics have resulted in a dramatic increase in diverse data sets that can provide information about such factors related to diseases. These types of data include DNA microarrays providing cellular information, Single Nucleotide Polymorphisms (SNPs) providing genetic information, metabolomics data in terms of proteins and other metabolites, structural and functional brain data from magnetic resonance imaging (MRI), and electronic health records (EHRs) containing copious information about histo-pathological factors, demographic, and environmental effects. Despite their richness, each of these datasets only provides information about a part of the complex biological mechanism behind human diseases. Thus, effective integration of the partial information of any of these genomic and clinical data can help reveal disease complexities in greater detail by generating new data-driven hypotheses beyond the traditional hypotheses about biomarkers. In particular, integrative biomarkers, i.e., patterns of features that are predictive of disease and that go beyond the simple biomarkers derived from a single dataset, can lead to a customized and more effective approach to improving healthcare. This thesis focuses on addressing the key issues related to integrative biomarkers by developing new data mining approaches. One very important issue of biomarker discovery is that the models have to easily interpretable, i.e., integrative models have to be not only predictive of the disease, but also interpretable enough so that domain experts can infer useful knowledge from the obtained patterns. In one such effort to make models interpretable, domain information about disease relationships was used as prior knowledge during model development. In addition, a novel metric called I-score was proposed using medical literature to quantify the interpretability of the obtained patterns. Another key issue of integrative biomarker discovery is that there may be many potential relationships present among diverse datasets. For example, a very important types of relationship in biomarker discovery is interaction, which are those biomarkers spanning multiple datasets, whose combined features are more indicative of disease than the individual constituent factors. In particular, the individual effects of each type of factor on disease predisposition can be small and thus, remain undetected by most disease association techniques performed on individual datasets. Different types of relationships are explored and an association analysis based framework is proposed to discover them. The proposed framework is especially effective for discovering higher-order relationships, which cannot be found by the existing prominent integrative approaches for the biomarker discovery. When applied on real datasets collected from three different types of data from schizophrenic and normal subjects, this approach yielded significant integrated biomarkers which are biologically relevant. Disease heterogeneity creates further issues for integrative biomarker discovery, biomarkers obtained from clinicogenomic studies may not be applicable to all patients in the same degree, i.e., a disease consist of multiple subtypes, each occurring in different subpopulations. Some potential reasons responsible for disease heterogeneity are different pathways playing different roles in the same disease and confounding factors such as age, ethnicity and race, or genetic predisposition, which can be available in rich EHR data. Most biomarker discovery techniques use full space model development techniques, i.e., they assess the performance of biomarkers on all patients without finding the distinct subpopulations. In this thesis, more customized models were built depending on patient\'s characteristics to handle disease heterogeneity. In summary, several data mining techniques developed in this thesis advance the state-of-the art in integration of diverse biomedical datasets. Moreover, their applications on large-scale EHR yield significant discoveries, which can ultimately lead to generating new data-driven hypotheses for inferring meaningful information about complex disease mechanism.Item A GUI For Defining Inductive Logic Programming Tasks For Novice Users(2017-03) Basak, PriyankanaInductive logic programming, which involves learning a solution to a problem where data is more naturally viewed as multiple tables with relationships between the tables, is an extremely powerful learning method. But these methods have suffered from the fact that very few are written in languages other than Prolog and because describing such problems is difficult. To describe an inductive logic programming problem the user needs to designate many tables and relationships and often provide some knowledge about the relationships in order for the techniques to work well. The goal of this thesis is to develop a Java-based Graphical User Interface (GUI) for novice users that will allow them to define ILP problems by connecting to an existing database and allowing users to define such a problem in an understandable way, perhaps with the assistance of data exploration techniques from the GUI.Item Hypergraph Analytics: Modeling Higher-Order Structures And Probabilities(2020-05) Sharma, AnkitData structured in the form of overlapping or non-overlapping sets are found in a variety of domains, sometimes explicitly but often subtly. For example, teams, which are of prime importance in industry and social science studies are “sets of individuals”; “item sets” in pattern mining of customer transactions are sets, and for various types of analysis in language studies a sentence can be considered as a “set or bag of words”. Although building models and inference algorithms for structured data has been an essential task in the fields of machine learning and statistics, research on “set-like” data remains less explored. Relationships between pairs of elements can be modeled as edges in a graph. However, for modeling relationships that involve all members of a set, hyperedges in a Hypergraph are more natural representations. Hypergraphs are less known graph-theoretic structure as compared to graphs. Because of this popularity graphs have been employed prolifically to model data of all kinds. Little attention is given to the fact that whether the data is naturally being generated as dyadic interactions or not. We think that much data is even deliberately converted to a graph for the sake of fitting it into a graph-based model and destroying the precious information present when it was originally generated. This thesis describes analyzing complex group structured data from domains like social networks, customer transaction data, and general categorical data, via the lens of Hypergraphs. To do so, we propose the Hypergraph Analytics Framework, under which we shall be interested in three higher-level questions pertaining to the hypergraph modeling. Firstly, how to model higher-order hypergraph information and what kind of lower-order approximations are available or sufficient depending upon the problem at hand. This question is addressed across the thesis as we employ different hypergraph models contingent upon the problem at hand. Secondly, we shall be interested in understanding what kind of inferences are possible over the hypergraph structure and what kind of probabilities can be learned. For this, we shall be dissecting the problem of hypergraph inference into various hyperedge prediction sub-problems and developing inference methods for each of them. We develop inference methods for both cross-sectional analysis: when we ignore the time information about group interactions into account, as well as longitudinal analysis: where we leverage temporal data. We also develop separate methods for conducting inference over observed and unobserved regions of the hypergraph structure. This variety of inference mechanisms on hypergraph structure together constitute the first part of the thesis, which we refer to as the \textit{Spatial Analysis} within our Hypergraph Analytics framework. Lastly, we are interested in learning what kinds of compression algorithms are possible for hypergraphs and how effective these techniques are. Here we develop techniques to compress the hypergraph topology to lower-dimensional latent space. We shall be chiefly considering hyperedge compression or hyperedge embeddings. We examine two different embedding approaches. First, is an algebraic approach which leverages leverage the relationship between hypergraphs and symmetric higher-order tensors. Symmetric tensor decomposition techniques are then developed to learn embeddings. Second, is a neural networks based solution which employs auto-encoders regularized by hypergraph structure. Together, both these approaches constitute the second part of the thesis, which we refer to as \textit{Spectral Analysis} within the proposed Hypergraph Analytics framework.Item Machine learning algorithms for spatio-temporal data mining(2008-12) Vatsavai, Ranga RajuRemote sensing, which provides inexpensive, synoptic-scale data with multi-temporal coverage, has proven to be very useful in land cover mapping, environmental monitoring, forest and crop inventory, urban studies, natural and man made object recognition, etc. Thematic information extracted from remote sensing imagery is also useful in variety of spatiotemporal applications. However, increasing spatial, spectral, and temporal resolutions invalidate several assumptions made by the traditional classification methods. In this thesis we addressed four specific problems, namely, small training samples, multisource data, aggregate classes, and spatial autocorrelation. We developed a novel semi-supervised learning algorithm to address the small training sample problem. A common assumption made in previous works is that the labeled and unlabeled training samples are drawn from the same mixture model. However, in practice we observed that the number of mixture components for labeled and unlabeled training samples differ significantly. Our adaptive semi-supervised algorithm over comes this important limitation by eliminating unlabeled samples from additional components through a matching process. Multisource data classification is addressed through a combination of knowledge-based and semi-supervised approaches. We solved the aggregate class classification problem by relaxing the unimodal assumption. We developed a novel semi-supervised algorithm to address the spatial autocorrelation problem. Experimental evaluation on remote sensing imagery showed the efficacy of our novel methods over conventional approaches. Together, our research delivered significant improvements in thematic information extraction from remote sensing imagery.Item Mining dynamic relationships from spatio-temporal datasets: an application to brain fMRI data(2014-05) Atluri, GowthamSpatio-temporal datasets are being widely collected in several domains such as climate science, neuorscience, sociology, and transportation. These data sets offer tremendous opportunities to address the imminent problems facing our society such as climate change, dementia, traffic congestion, crime etc. One example of a spatio-temporal dataset that is the focus of this dissertation is Functional Magnetic Resonance Imaging (fMRI) data. fMRI captures the activity at all locations in the brain and at regular time intervals. Using this data one can investigate the processes in the brain that relate to human psychological functions such as cognition, decision making etc. or physiological functions such as sensory perception or motor skills. Above all, one can advance the diagnosis and treatment procedures for mental disorders.The focus of this thesis is to study dynamic relationships between brain regions using fMRI data. Existing work in neuroscience has predominantly treated the relationships among brain regions as stationary. There is growing evidence in this community that the relationships between brain regions are transient. In the time series data mining community transient relationships have been studied and are shown to be useful for various tasks such as clustering and classification of time series data. In this work we focused on discovering combinations of brain regions that exhibit high similarity in the activity time series in small intervals. We proposed an efficient approach that can discover all such combinations exhaustively. We demonstrated its effectiveness on synthetic and real world data sets.We applied our approach on fMRI data collected in different settings on different groups of people and studied the reliability and replicability of the combinations we discover. Reliability is the degree to which a combination that is discovered using fMRI scans from a population can be found again using a different set of scans on the same population. Replicability is the degree to which a combination discovered using scans from one set of subjects can be discovered again using scans from a different set of subjects. These two factors reflect the generality of the combinations we discover. Our results suggest that the combinations we discover are indeed reliable and replicable. This indicates the validity of the combinations and they suggest that the underlying neuronal principles drive these combinations. We also investigated the utility of the combinations in studying differences between healthy and schizophrenia subjects.Existing work in estimating transient relationships among time series typically uses sliding time windows of a fixed length that are shifted from one end to the other using a fixed step size. This approach does not directly identify the intervals in which a pair of time series exhibit similarity. We proposed another computational approach to discover the time intervals where a given pair of time series are highly similar. We showed that our approach is efficient using synthetic datasets. We demonstrated the effectiveness of our approach on a synthetic dataset. Using this approach we provided a characterization of the transient nature of a relationship between time series and showed its utility in identifying task related transient connectivity in fMRI data that is collected while a subject is resting and while involved in a task.In summary, the computational approaches proposed in this thesis advance the state-of-the-art in time series data mining. Whereas the extensive evaluations that are performed on multiple fMRI datasets demonstrate the validity of the findings and provide novel hypothesis that can be systematically studied to advance the state-of-the-art in neuroscience.Item Nurturing tagging communities(2009-03) Sen, Shilad WielandMember contributions power many online communities. Users have uploaded billions of images to flickr, bookmarked millions pages on del.icio.us, and authored millions of encyclopedia articles at Wikipedia. Tags --- member contributed words or phrases that describe items --- have emerged as a powerful method for searching, organizing, and making sense of, these vast corpora. In this thesis we explore the dynamics, challenges, and possibilities of tagging systems. We study the way in which factors influencing an individual user's choice of tags can affect the evolution of community tags as a whole. Like other community-maintained systems, tagging systems can suffer from low quality contributions. We study interfaces and algorithms that can differentiate between low quality and high quality tags. Finally, we explore tagommenders, tag-based recommendation algorithms that combine the flexibility of tags with the automation of recommender systems. We base our explorations on tagging activity in the MovieLens movie recommendation system. We analyze tagging behavior, user studies, and surveys, of 97,000 tags and 3,600 users. Our results provide insight into the dynamics of existing tagging communities, and suggest mechanisms that address challenges of, and provide extensions to, tagging systems.Item Predictive modeling using dimensionality reduction and dependency structures(2011-07) Agovic, AmrudinAs a result of recent technological advances, the availability of collected high dimensional data has exploded in various fields such as text mining, computational biology, health care and climate sciences. While modeling such data there are two problems that are frequently faced. High dimensional data is inherently difficult to deal with. The challenges associated with modeling high dimensional data are commonly referred to as the "curse of dimensionality." As the number of dimensions increases the number of data points necessary to learn a model increases exponentially. A second and even more difficult problem arises when the observed data exhibits intricate dependencies which cannot be neglected. The assumption that observations are independently and identically distributed (i.i.d.) is very widely used in Machine Learning and Data Mining. Moving away from this simplifying assumption with the goal to model more intricate dependencies is a challenge and the main focus of this thesis. In dealing with high dimensional data, dimensionality reduction methods have proven very useful. Successful applications of non-probabilistic approaches include Anomaly Detection, Face Detection, Pose Estimation, and Clustering. Probabilistic approaches have been used in domains such as Visualization, Image retrieval and Topic Modeling. When it comes to modeling intricate dependencies, the i.i.d. assumption is seldomly abandoned. As a result of the simplifying assumption relevant dependencies tend to be broken. The goal of this work is to address the challenges of dealing with high dimensional data while capturing intricate dependencies in the context of predictive modeling. In particular we consider concepts from both non-probabilistic and probabilistic dimensionality reduction approaches.Item Toward Automating and Systematizing the Use of Domain Knowledge in Feature Selection(2015-08) Groves, WilliamConstructing prediction models for real-world domains often involves practical complexities that must be addressed to achieve good prediction results. Often, there are too many sources of data (features). Limiting the set of features in the prediction model is essential for good performance, but prediction accuracy may be degraded by the inadvertent removal of relevant features. The problem is even more acute in situations where the number of training instances is limited, as limited sample size and domain complexity are often attributes of real-world problems. This thesis explores the practical challenges of building regression models in large multivariate time-series domains with known relationships between variables. Further, we explore the conventional wisdom related to preparing datasets for model calibration in machine learning, and discuss best practices for learning time-varying concepts from data. The core contribution of this work is a novel wrapper-based feature selection framework called Developer-Guided Feature Selection (DGFS). It systematically incorporates domain knowledge for domains characterized by a large number of observable features. The observable features may be related to each other by logical, temporal, or spatial relationships, some of which are known to the model developer a priori. The approach relies on limited domain-specific knowledge but can replace or improve upon more elaborate domain specific models and on fully automated feature selection for many applications. As a wrapper-based approach, DGFS can augment existing multivariate techniques used in high-dimensional domains to produce improved modeling results particularly in situations where the volume of training data is limited. We demonstrate the viability of our method in several complex domains (natural and synthetic) that have significant temporal aspects and many observable features.Item User classification in Online communities.(2012-08) Pal, Aditya