Browsing by Author "Chandola, Varun"
Now showing 1 - 11 of 11
Results Per Page
Sort Options
Item A Comparative Evaluation of Anomaly Detection Techniques for Sequence Data(2008-07-07) Chandola, Varun; Mithal, Varun; Kumar, VipinAnomaly detection has traditionally dealt with record or transaction type data sets. But in many real domains, data naturally occurs as sequences, and therefore the desire of studying anomaly detection techniques in sequential data sets. The problem of detecting anomalies in sequence data sets is related to but different from the traditional anomaly detection problem, because the nature of data and anomalies are different than those found in record data sets. While there are many surveys and comparative evaluations for traditional anomaly detection, similar studies are not done for sequence anomaly detection. We investigate a broad spectrum of anomaly detection techniques for symbolic sequences, proposed in diverse application domains. Our hypothesis is that symbolic sequences from different domains have distinct characteristics in terms of the nature of sequences as well as the nature of anomalies which makes it important to investigate how different techniques behave for different types of sequence data. Such a study is critical to understand the relative strengths and weaknesses of different techniques. Our paper is one such attempt where we have comparatively evaluated 7 anomaly detection techniques on 10 public data sets, collected from three diverse application domains. To gain further understanding in the performance of the techniques, we present a novel way to generate sequence data with desired characteristics. The results on the artificially generated data sets help us in experimentally verifying our hypothesis regarding different techniques.Item A Multi-Step Framework for Detecting Attack Scenarios(2006-02-21) Shaneck, Mark; Chandola, Varun; Liu, Haiyang; Choi, Changho; Simon, Gyorgy; Eilertson, Eric; Kim, Yongdae; Zhang, Zhi-Li; Srivastava, Jaideep; Kumar, VipinWith growing dependence upon interconnected networks, defending these networks against intrusions is becoming increasingly important. In the case of attacks that are composed of multiple steps, detecting the entire attack scenario is of vital importance. In this paper, we propose an analysis framework that is able to detect these scenarios with little predefined information. The core of the system is the decomposition of the analysis into two steps: first detecting a few events in the attack with high confidence, and second, expanding from these events to determine the remainder of the events in the scenario. Our experiments show that we can accurately identify the majority of the steps contained within the attack scenario with relatively few false positives. Our framework can handle sophisticated attacks that are highly distributed, try to avoid standard pre-defined attack patterns, use cover traffic or "noisy" attacks to distract analysts and draw attention away from the true attack, and attempt to avoid detection by signature-based schemes through the use of novel exploits or mutation engines.Item Anomaly Detection for Discrete Sequences: A Survey(2009-05-22) Chandola, Varun; Banerjee, Arindam; Kumar, VipinThis survey attempts to provide a comprehensive and structured overview of the existing research for the problem of detecting anomalies in discrete sequences. The aim is to provide a global understanding of the sequence anomaly detection problem and how techniques proposed for different domains relate to each other. Our specific contributions are as follows: We identify three distinct formulations of the anomaly detection problem, and review techniques from many disparate and disconnected domains that address each of these formulations. Within each problem formulation, we group techniques into categories based on the nature of the underlying algorithm. For each category, we provide a basic anomaly detection technique, and show how the existing techniques are variants of the basic technique. This approach shows how different techniques within a category are related or different from each other. Our categorization reveals new variants and combinations that have not been investigated before for anomaly detection. We also provide a discussion of relative strengths and weaknesses of different techniques. We show how techniques developed for one problem formulation can be adapted to solve a different formulation; thereby providing several novel adaptations to solve the different problem formulations. We highlight the applicability of the techniques that handle discrete sequences to other related areas such as online anomaly detection and time series anomaly detection.Item Anomaly detection for symbolic sequences and time series data(2009-09) Chandola, VarunThis thesis deals with the problem of anomaly detection for sequence data. Anomaly detection has been a widely researched problem in several application domains such as system health management, intrusion detection, health-care, bio-informatics, fraud detection, and mechanical fault detection. Traditional anomaly detection techniques analyze each data instance (as a univariate or multivariate record) independently, and ignore the sequential aspect of the data. Often, anomalies in sequences can be detected only by analyzing data instances together as a sequence, and hence cannot detected by traditional anomaly detection techniques. The problem of anomaly detection for sequence data is a rich area of research because of two main reasons. First, sequences can be of different types, e.g., symbolic sequences, time series data, etc., and each type of sequence poses unique set of problems. Second, anomalies in sequences can be defined in multiple ways and hence there are different problem formulations. In this thesis we focus on solving one particular problem formulation called semi-supervised anomaly detection. We study the problem separately for symbolic sequences, univariate time series data, and multivariate time series data. The state of art on anomaly detection for sequences is limited and fragmented across application domains. For symbolic sequences, several techniques have been proposed within specific domains, but it is not well-understood as to how a technique developed for one domain would perform in a completely different domain. For univariate time series data, limited techniques exist, and are only evaluated for specific domains, while for multivariate time series data, anomaly detection research is relatively untouched. This thesis has two key goals. First goal is to develop novel anomaly detection techniques for different types of sequences which perform better than existing techniques across a variety of application domains. The second goal is to identify the best anomaly detection technique for a given application domain. By realizing the first goal we develop a suite of anomaly detection techniques for a domain scientist to choose from, while the second goal will help the scientist to choose the technique best suited for the task. To achieve the first goal we develop several novel anomaly detection techniques for univariate symbolic sequences, univariate time series data, and multivariate time series data. We provide extensive experimental evaluation of the proposed techniques on data sets collected across diverse domains and generated from data generators, also developed as part of this thesis. We show how the proposed techniques can be used to detect anomalies which translate to critical events in domains such as aircraft safety, intrusion detection, and patient health management. The techniques proposed in this thesis are shown to outperform existing techniques on many data sets. The technique proposed for multivariate time series data is one of the very first anomaly detection technique that can detect complex anomalies in such data. To achieve the second goal, we study the relationship between anomaly detection techniques and the nature of the data on which they are applied. A novel analysis framework, Reference Based Analysis (RBA), is proposed that can map a given data set (of any type) into a multivariate continuous space with respect to a reference data set. We apply the RBA framework to not only visualize and understand complex data types, such as multivariate categorical data and symbolic sequence data, but also to extract data driven features from symbolic sequences, which when used with traditional anomaly detection techniques are shown to consistently outperform the state of art anomaly detection techniques for these complex data types. Two novel techniques for symbolic sequences are proposed using the RBA framework which perform better than the best technique for each different data set.Item Anomaly Detection: A Survey(2007-08-15) Chandola, Varun; Banerjee, Arindam; Kumar, VipinAnomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.Item Detecting Anomalies in a Time Series Database(2009-02-05) Chandola, Varun; Cheboli, Deepthi; Kumar, VipinWe present a comprehensive evaluation of a large number of semi-supervised anomaly detection techniques for time series data. Some of these are existing techniques and some are adaptations that have never been tried before. For example, we adapt the window based discord detection technique to solve this problem. We also investigate several techniques that detect anomalies in discrete sequences, by discretizing the time series data. We evaluate these techniques on a large variety of data sets obtained from a broad spectrum of application domains. The data sets have different characteristics in terms of the nature of normal time series and the nature of anomalous time series. We evaluate the techniques on different metrics, such as accuracy in detecting the anomalous time series, sensitivity to parameters, and computational complexity, and provide useful insights regarding the effectiveness of different techniques based on the experimental evaluation.Item MINDS: Architecture & Design(2006-07-14) Chandola, Varun; Eilertson, Eric; Ertoz, Levent; Simon, Gyorgy; Kumar, VipinThis chapter provides an overview of the Minnesota Intrusion Detection System (MINDS), which uses a suite of data mining based algorithms to address different aspects of cyber security. The various components of MINDS such as the scan detector, anomaly detector and the profiling module detect different types of attacks and intrusions on a computer network. The scan detector aims at detecting scans which are the precursors to any network attack. The anomaly detection algorithm is very effective in detecting behavioral anomalies in the network traffic which typically translate to malicious activities such as denial-of-service (DoS) traffic, worms, policy violations and inside abuse. The profiling module helps a network analyst to understand the characteristics of the network traffic and detect any deviations from the normal profile. Our analysis shows that the intrusions detected by MINDS are complementary to those of traditional signature based systems, such as SNORT, which implies that they both can be combined to increase overall attack coverage. MINDS has shown great operational success in detecting network intrusions in two live deployments at the University of Minnesota and as a part of the Interrogator architecture at the US Army Research Labs Center for Intrusion Monitoring and Protection (ARL-CIMP).Item Similarity Measures for Categorical Data--A Comparative Study(2007-10-15) Chandola, Varun; Boriah, Shyam; Kumar, VipinMeasuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively well-understood, but for categorical data, the similarity computation is not straightforward. Several data-driven similarity measures have been proposed in the literature to compute the similarity between two categorical data instances but their relative performance has not been evaluated. In this paper we study the performance of a variety of similarity measures in the context of a specific data mining task: outlier detection. Results on a variety of data sets show that while no one measure dominates others for all types of problems, some measures are able to have consistently high performance.Item Summarization - Compressing Data into an Informative Representation Report(2005-06-08) Chandola, Varun; Kumar, VipinSummarization is an important problem in many domains involving large datasets. Summarization can be essentially viewed as transformation of data into a concise yet meaningful representation which could be used for efficient storage or manual inspection. In this paper, we formulate the problem of summarization of a large dataset of transactions as an optimization problem involving two objective functions - compaction gain and information loss. We propose metrics to characterize the output of any summarization algorithm. We propose data mining techniques to obtain a summary for a given set of transactions while optimizing these two objective functions. We illustrate one application of summarization in the field of network data where we show how our technique can be effectively used to summarize network traffic into a meaningful representation. We first present a direct application of a standard clustering scheme to generate summaries. We then show how this could be significantly improved by using a multi-step approach which involves generating candidate summaries for a dataset using association analysis and then choosing a subset of these candidates as the summary with the desired compaction and information content. We present results of experiments conducted on real and artificial datasets to demonstrate the effectiveness of our techniques.Item Understanding Anomaly Detection Techniques for Symbolic Sequences(2009-01-05) Chandola, Varun; Mithal, Varun; Kumar, VipinWe present a comparative evaluation of a large number of anomaly detection techniques on a variety of publicly available as well as artificially generated data sets. Many of these are existing techniques while some are slight variants and/or adaptations of traditional anomaly detection techniques to sequence data. The specific contributions of this paper are as follows: (i). This evaluation facilitates understanding of the relative strengths and weaknesses of different techniques. Through careful experimentation, we illustrate that the performance of different techniques is dependent on the nature of sequences, and the nature of anomalies in the sequences. No one technique outperforms all others. For most techniques we also identify some data sets on which they perform very well, and some on which they perform poorly. (ii). We investigate variants that have not been tried before. For example, we evaluate a k-nearest neighbor based technique that performs better than a clustering based technique that was proposed for sequences. Also, we propose FSA-z, a variant of an existing Finite State Automaton (FSA) based technique, which performs consistently superior to the original FSA based technique. (iii). We propose a novel way of generating artificial sequence data sets to evaluate anomaly detection techniques. (iv). We characterize the nature of normal and anomalous test sequences, and associate the performance of each technique to one or more of such characteristics.Item Understanding Categorical Similarity Measures for Outlier Detection(2008-03-04) Chandola, Varun; Boriah, Shyam; Kumar, VipinCategorical attributes are present is many data sets that are analyzed using KDD techniques. A recent empirical study of 14 different data driven categorical similarity measures in the context of outlier detection showed that these measures have widely different performances when applied to several different publicly available data sets. As a next step in understanding the relation between the performance of a similarity measure and the nature of the data, we present an analysis framework to help give insights as to which similarity measure is better suited for what type of data. In this paper we present a framework for modeling categorical data sets with a desired set of characteristics. We also propose a set of separability statistics for a categorical data set that can be used to understand the performance of a similarity measure for outlier detection. In addition, we present three techniques to estimate the proposed separability statistics from a given categorical data set. We experimentally evaluate the different similarity measures in the context of outlier detection, and show how the performance of a similarity measure is related to the various data characteristics.