Browsing by Author "Kumar, Vipin"

Now showing 1 - 20 of 108

A Cautionary Note on Decadal Sea Level Pressure Predictions from GCMs
(2018-02-06) Liess, Stefan; Snyder, Peter K.; Kumar, Arjun; Kumar, Vipin
Decadal prediction of sea level pressure (SLP) plays an important role in regional climate prediction, because it shows changes in atmospheric behavior on time scales that are relevant for policy makers. These changes consist of a combination of externally forced and internally driven climate system characteristics. A comparison of SLP trends in a subset of seven Coupled Model Intercomparison Project (CMIP) phase 5 general circulation models (GCM) during the satellite-era to their CMIP3 counterparts reveals an unrealistically strong forecast skill in CMIP3 models for trend predictions for 2001-2011 when using the 1979-2000 period to train the forecast. Boreal-winter SLP trends over five high-, mid-, and low-latitude zones were calculated over a two-decade initialization period for each ensemble member and then ranked based on their performance relative to observations in all five zones over the same time period. The same method is used to rank the ensemble members during the following decade. In CMIP3, 17 out of 38 ensemble members retain their rank in the 2001-2011 hindcast period and 3 retain the neighboring rank. However, these numbers are much lower in more recent CMIP5 decadal predictions over a similar period with the same number of ensembles. The conclusion to consider the forecast skill in CMIP3 predictions during the 2001-2011 as unrealistic is corroborated by comparisons to earlier periods from the 1960s to the 1980s in both CMIP3 and CMIP5 simulations. Thus, although the 2001-2011 CMIP3 predictions show statistically significant forecast skill, this skill should be treated as a spurious result that is unlikely to be reproduced by newer more accurate GCMs.
A Comparative Evaluation of Anomaly Detection Techniques for Sequence Data
(2008-07-07) Chandola, Varun; Mithal, Varun; Kumar, Vipin
Anomaly detection has traditionally dealt with record or transaction type data sets. But in many real domains, data naturally occurs as sequences, and therefore the desire of studying anomaly detection techniques in sequential data sets. The problem of detecting anomalies in sequence data sets is related to but different from the traditional anomaly detection problem, because the nature of data and anomalies are different than those found in record data sets. While there are many surveys and comparative evaluations for traditional anomaly detection, similar studies are not done for sequence anomaly detection. We investigate a broad spectrum of anomaly detection techniques for symbolic sequences, proposed in diverse application domains. Our hypothesis is that symbolic sequences from different domains have distinct characteristics in terms of the nature of sequences as well as the nature of anomalies which makes it important to investigate how different techniques behave for different types of sequence data. Such a study is critical to understand the relative strengths and weaknesses of different techniques. Our paper is one such attempt where we have comparatively evaluated 7 anomaly detection techniques on 10 public data sets, collected from three diverse application domains. To gain further understanding in the performance of the techniques, we present a novel way to generate sequence data with desired characteristics. The results on the artificially generated data sets help us in experimentally verifying our hypothesis regarding different techniques.
A Comparison of Document Clustering Techniques
(2000-05-23) Steinbach, Michael; Karypis, George; Kumar, Vipin
This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document data.
A Computationally Efficient and Statistically Powerful Framework for Searching High-order Epistasis with Systematic Pruning and Gene-set Constraints
(2010-06-21) Fang, Gang; Haznadar, Majda; Wang, Wen; Steinbach, Michael; Van Ness, Brian; Kumar, Vipin
This paper has not yet been submitted.
A Data Mining Framework for Forest Fire Mapping
(2012-03-29) Karpatne, Anuj; Chen, Xi; Chamber, Yashu; Mithal, Varun; Lau, Michael; Steinhaeuser, Karsten; Boriah, Shyam; Steinbach, Michael; Kumar, Vipin
Forests are an important natural resource that support economic activity and play a significant role in regulating the climate and the carbon cycle, yet forest ecosystems are increasingly threatened by fires caused by a range of natural and anthropogenic factors. Mapping these fires, which can range in size from less than an acre to hundreds of thousands of acres, is an important task for supporting climate and carbon cycle studies as well as informing forest management. There are two primary approaches to fire mapping: field and aerial-based surveys, which are costly and limited in their extent; and remote sensing-based approaches, which are more cost-effective but pose several interesting methodological and algorithmic challenges. In this paper, we introduce a new framework for mapping forest fires based on satellite observations. Specifically, we develop spatio-temporal data mining methods for Moderate Resolution Imaging Spectroradiometer (MODIS) data to generate a history of forest fires. A systematic comparison with alternate approaches across diverse geographic regions demonstrates that our algorithmic paradigm is able to overcome some of the limitations in both data and methods employed by other prior efforts.
A Framework for Discovering Co-location Patterns in Data Sets with Extended Spatial Objects
(2003-09-22) Xiong, Hui; Shekhar, Shashi; Huang, Yan; Kumar, Vipin; Ma, Xiaobin; Soung Yoo, Jin
Co-location patterns are subsets of spatial features (e.g. freeways, frontage roads) usually located together in geographic space. Recent literature has provided a transaction-free approach to discover co-location patterns over spatial point data sets to avoid potential loss of proximity relationship information in partitioning continuous geographic space into transactions. This paper provides a more general transaction-free approach to mine data sets with extended spatial objects, e.g. line-strings and polygons. Key challenges include modeling of neighborhood and relationships among extended spatial objects as well as controlling of related geometric computation costs. Based on a buffer-based definition of neighborhoods, a new model of finding co-location patterns over extended spatial objects has been proposed. Furthermore, this paper presents two pruning approaches, namely a prevalence-based pruning approach and a geometric filter-and-refine approach. Experimental evaluation with a real data set (the roadmap of Minneapolis and St.~Paul metropolitan area) shows that the geometric filter-and-refine approach can speed up the prevalence-based pruning approach by a factor of 30 to 40. Finally, the extended co-location mining algorithm proposed in this paper has been used to select most challenging field test routes for a novel GPS-based approach to accessing road user charges.
A Hierarchical Clustering Algorithm Using Dynamic Modeling
(1999-03-09) Karypis, George; Han, Euihong; Kumar, Vipin
Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. Existing clustering algorithms, such as K-means, PAM, CLARANS, DBSCAN, CURE, and ROCK are designed to find clusters that fit some static models. These algorithms can breakdown if the choice of parameters in the static model is incorrect with respect to the data set being clustered, or if the model is not adequate to capture the characteristics of clusters. Furthermore, most of these algorithms breakdown when the data consists of clusters that are of diverse shapes, densities, and sizes. In this paper, we present a novel hierarchical clustering algorithm called Chameleon that measures the similarity of two clusters based on a dynamic model. In the clustering process, two clusters are merged only if the inter-connectivity and closeness (proximity) between two clusters are high relative to the internal inter-connectivity of the clusters and closenessof items within the clusters. The merging process using the dynamic model presented in this paper facilitates discovery of natural and homogeneous clusters. The methodology of dynamic modeling of clusters used in Chameleon is applicable to all types of data as long as a similarity matrix can be constructed. We demonstrate the effectiveness of Chameleon in a number of data sets that contain points in 2D space, and contain clusters of different shapes, densities, sizes, noise, and artifacts. Experimental results on these data sets show that Chameleon can discover natural clusters that many existing state-of-the art clustering algorithms fail to find.
A Multi-Step Framework for Detecting Attack Scenarios
(2006-02-21) Shaneck, Mark; Chandola, Varun; Liu, Haiyang; Choi, Changho; Simon, Gyorgy; Eilertson, Eric; Kim, Yongdae; Zhang, Zhi-Li; Srivastava, Jaideep; Kumar, Vipin
With growing dependence upon interconnected networks, defending these networks against intrusions is becoming increasingly important. In the case of attacks that are composed of multiple steps, detecting the entire attack scenario is of vital importance. In this paper, we propose an analysis framework that is able to detect these scenarios with little predefined information. The core of the system is the decomposition of the analysis into two steps: first detecting a few events in the attack with high confidence, and second, expanding from these events to determine the remainder of the events in the scenario. Our experiments show that we can accurately identify the majority of the steps contained within the attack scenario with relatively few false positives. Our framework can handle sophisticated attacks that are highly distributed, try to avoid standard pre-defined attack patterns, use cover traffic or "noisy" attacks to distract analysts and draw attention away from the true attack, and attempt to avoid detection by signature-based schemes through the use of novel exploits or mutation engines.
A New Algorithm for Multi-objective Graph Partitioning
(1999-02-15) Kumar, Vipin; Karypis, George; Schloegel, Kirk
Recently, a number of graph partitioning applications have emerged with addtional requirements that the traditional graph partitioning model alone cannot effectively handle. One such class of problems is those in which multiple objectves, each of which can be modeled as a sum of weights of the edges of a graph, must be simultaneously optimized. This class of problems must be solved utilizing a multi-objective graph partitioning algorithm. In this paper, we describe a new algorithm for computing partitionings for multi-objective graphs. We explain how this scheme is able to handle the class of problems in which the objectives represent similar quantities, as well as, the class of problems in which the objectives represent dissimilar quantitites. We show that our multi-objective graph partitioning algorithm is better able to compute partitionings based on a user-supplied preference vector than partitioning with respect to a single objective only. We also show that by modifyig the input preference vector, the multi-objective graph partitioning algorithm is able to gracefully tradeoff increases in one or more objectives for decreases in ther objectives. This gives the user a fine-tuned control of the tradeoffs among the objectives.
A New Teleconnection : The Australian Southern Oscillation
(2012-09-21) Kumar, Arjun; Liess, Stefan; Kawale, Jaya; Ormsby, Dominick; Steinhaeuser, Karsten; Kumar, Vipin
A possibly new teleconnection has been discovered off the east coast of Australia in the region around Tasman sea and Southern Ocean. Found in pressure anomalies using a novel graph based approach called shared reciprocal nearest neighbors, this dipole appears in reanalysis datasets such as NCEP, JRA, ERA and MERRA. The HadSLP2 observation data shows the new dipole, despite of limited observations in the Tasman Sea. Tests are performed in order to understand the uniqueness of the dipole and its relationship to existing well known phenomena. The dipole index is correlated with known dipole indices such as the SO (Southern Oscillation), AAO (Antarctic Oscillation) with which it shares a marginally higher correlation of less than 0.4 and other northern teleconnections with which it is shown to have a poor relationship. We limit further analysis with only the AAO and SO indices as these are spatially close, have a higher correlation with the new index and tend to influence it in one or more seasons. Seasonal analysis is done to look at the variation in strength as well as its influence on other variables such as TAS (Temperature at Surface), OLR (Outgoing Longwave Radiation), Precipitation etc. We also look at composite maps and do significance tests to determine the significant regions in these maps. We also determine regions that are influenced by the new dipole index alone and are not influenced by other dipoles namely the SO and AAO by looking at difference maps. We discover the dipole at different geopotential heights - 700 hPa, 500 hPa and 50 hPa (Sea Level Pressure is 1013 hPa)- and determine if the dipole is a sea surface phenomenon such as the SO or an upper atmospheric phenomenon such as the AAO. Our tests have shown that we may indeed be looking at a new phenomenon and further tests are being conducted to confirm that.
A Novel Error-Tolerant Frequent Itemset Model for Binary and Real-Valued Data
(2009-10-12) Gupta, Rohit; Rao, Navneet; Kumar, Vipin
Frequent pattern mining has been successfully applied to a broad range of applications, however, it has two major drawbacks, which limits its applicability to several domains. First, as the traditional 'exact' model of frequent pattern mining uses a strict definition of support, it limits the recovery of frequent itemset patterns in real-life data sets where the patterns may be fragmented due to random noise/errors. Second, as traditional frequent pattern mining algorithms works with only binary or boolean attributes, it requires transformation of real-valued attributes to binary attributes, which often results in loss of information. As many of the real-life data sets are both noisy and real-valued in nature, past approaches have tried to independently address these issues and there is no systematic approach that addresses both of these issues together. In this paper, we propose a novel Error-Tolerant Frequent Itemset (ETFI) model for binary as well as real-valued data. We also propose a bottom-up pattern mining algorithm to sequentially discover all ETFIs from both types of data sets. To illustrate the efficacy of our proposed ETFI approach, we use two real-valued S.Cerevisiae microarray gene-expression data sets and evaluate the patterns obtained in terms of their functional coherence as evaluated using the GO-based functional enrichment analysis. Our results clearly demonstrate the importance of directly accounting for errors/noise in the data. Finally, the statistical significance of the discovered ETFIs as estimated by using two randomization tests, reveal that discovered ETFIs are indeed biologically meaningful and are neither obtained by random chance nor capture random structure in the data.
A pattern mining based integrative framework for biomarker discovery
(2012-02-10) Dey, Sanjoy; Atluri, Gowtham; Steinbach, Michael; MacDonald, Angus; Lim, Kelvin; Kumar, Vipin
Recent advancement in high throughput data collection technologies has resulted in the availability of diverse biomedical datasets that capture complementary information pertaining to the biological processes in an organism. Biomarkers that are discovered by integrating these datasets obtained from a case-control studies have the potential to elucidate the biological mechanisms behind complex human diseases. In this paper we define an interaction-type integrative biomarker as one whose features together can explain the disease, but not individually. In this paper, we propose a pattern mining based integrative framework (PAMIN) to discover an interaction-type integrative biomarkers from diverse case control datasets. PAMIN first finds patterns form individual datasets to capture the available information separately and then combines these patterns to find integrated patterns (IPs) consisting of variables from multiple datasets. We further use several interestingness measures to characterize the IPs into specific categories. Using synthetic data we compare the IPs found using our approach with those of CCA and discriminative-CCA (dCCA). Our results indicate that PAMIN can discover interaction type patterns that competing approaches like CCA and discriminative-CCA cannot find. Using real datasets we also show that PAMIN discovers a large number of statistically significant IPs than the competing approaches.
A Study of Time Series Noise Reduction Techniques in the Context of Land Cover Change Detection
(2011-08-12) Chen, Xi; Mithal, Varun; VangalaReddy, Sruthi; Brugere, Ivan; Boriah, Shyam; Kumar, Vipin
Remote sensing data sets frequently suffer from noise due to atmospheric conditions and instrument issues. This noise negatively affects the usability of these data sets and therefore noise reduction techniques are frequently used to reduce the impact of noise. A well-known remote sensing data set, MODIS Enhance Vegetation Index (EVI), measures the amount of vegetation (based on surface reflectance) observed from satellite. This data set has been used for land cover change detection, in both regional-scale and global-scale studies. Many noise reduction techniques have seen proposed in the remote sensing literature but comparative studies to understand relative performance of these techniques are scarce. Furthermore, the existing comparative studies typically evaluate a small number of techniques on a specific geographical region. Therefore, little is known about the global applicability of these techniques. In addition, time series based land cover change detection algorithms are known to be negatively impacted by the presence of noise. This paper investigates the interrelations of regional noise characteristics, change detection algorithms, and noise reduction methods. The methods for noise reduction are applied in three different geographic regions and through comparison we outline the noise characteristics relevant to the performance of land cover change detection.
A Unified Algorithm for Load-Balancing Adaptive Scientific Simulations
(2000-05-22) Schloegel, Kirk; Karypis, George; Kumar, Vipin
Adaptive scientific simulations require that periodic repartitioning occur dynamically throughout the course of the simulation. The computed repartitionings should minimize both the inter-processor communications incurred during the iterative mesh-based computation and the data redistribution costs required to balance the load. Recently developed schemes for computing repartitionings provide the user with only a limited control of the tradeoffs among these objectives. This paper describes a new Unified Repartitioning Algorithm that can gracefully tradeoff one objective for the other dependent upon a user-defined parameter describing the relative costs of these objectives. We show that the Unified Repartitioning Algorithm is able to minimize the precise overheads associated with repartitioning as well as or better than other repartitioning schemes for a variety of problems, regardless of the relative costs of performing inter-processor communication and data redistribution. Our experiments show that the Unified Repartitioning Algorithm is extremely fast and scalable to very large problems.
A Universal Formulation of Sequential Patterns
(1999-05-18) Joshi, Mahesh; Karypis, George; Kumar, Vipin
This report outlines a more general formulation of sequential patterns, which unifies the generalized patterns proposed by Srikant and Agarwal [SA96] and episode discovery approach taken by Manilla et al [MTV97]. We show that just by varying the values of timing constraint parameters and counting methods, our formulation can be made identical to either one of these. Furthermore, our approach defines several other couting methods which could be suitable for various applications. The algorithm usd to discover these universal sequential patterns is based on a modification of the GSP algorithm proposed in [SA96]. Some of these modifications are made to take care of the newly introduced timing constraints and patttern restrictions, whereas some modifications are made for performance reasons. In the end, we present an application, which illustrates the deficiencies of current approaches that can be overcome by the proposed universal formulation
An approach for global monitoring of surface water extent variations using MODIS data
(2016-08-29) Khandelwal, Ankush; Karpatne, Anuj; Marlier, Miriam E.; Kim, Jongyoun; Lettenmaier, Dennis P.; Kumar, Vipin
Freshwater resources are among the most basic requirements of human society. Nonetheless, global information about the space-time variations of the area of freshwater bodies, and the water stored in them, is surprisingly limited. We introduce a MODIS-based algorithm to map the global areal extent of surface water bodies at 500m spatial resolution at nominal eight-day intervals from 2000 to 2015. We demonstrate the algorithm construction and performance for five reservoirs on four continents with different shapes. The algorithm performs well compared to satellite radar altimetry and in situ height measurements, and in comparison with surface area estimates based on higher resolution Landsat data. We further present a summary of our global scale results over 69 reservoirs for which altimetry measurements are available, and show that our surface area estimates match well with relative height variations and show significant improvements over previous estimates. One of the main reasons for these improvements is a novel post-processing technique that makes use of imperfect labels produced by supervised classification approaches on multiple dates to estimate the elevation structure of locations that is used to enhance the quality and completeness of imperfect labels. However, the approach is still challenged in regions with frequent cloud cover, snow and ice coverage, or complicated geometries that require finer spatial resolution remote sensing data. The surface area estimates we describe here are publically available.
An Efficient, Scalable, Parallel Classifer for Data Mining
(1997) Srivastava, Anurag; Singh, Vineet; Han, Eui-Hong; Kumar, Vipin
Classification is an important data mining problem. Recently, there has been significant interest in classification using training datasets that are large enough that they do not fit in main memory and need to be disk-resident. Although training data can be reduced by sampling, it has been shown that it can be advantageous to use the entire training dataset since that can increase accuracy. Most current algorithms are unsui:table for large disk-resident datasets because their space and time complexities (including I/0) are prohibitive. A recent algorithm called SPRINT promises to alleviate some of the data size restrictions. We present a new algorithm called SPEC that provides similar accuracy, reduces I/0, reduces memory requirements, and improves scalability (time and space) on both sequential and parallel computers. We provide some theoretical results as well as experimental results on the IBM SP2.
Anomaly Detection for Discrete Sequences: A Survey
(2009-05-22) Chandola, Varun; Banerjee, Arindam; Kumar, Vipin
This survey attempts to provide a comprehensive and structured overview of the existing research for the problem of detecting anomalies in discrete sequences. The aim is to provide a global understanding of the sequence anomaly detection problem and how techniques proposed for different domains relate to each other. Our specific contributions are as follows: We identify three distinct formulations of the anomaly detection problem, and review techniques from many disparate and disconnected domains that address each of these formulations. Within each problem formulation, we group techniques into categories based on the nature of the underlying algorithm. For each category, we provide a basic anomaly detection technique, and show how the existing techniques are variants of the basic technique. This approach shows how different techniques within a category are related or different from each other. Our categorization reveals new variants and combinations that have not been investigated before for anomaly detection. We also provide a discussion of relative strengths and weaknesses of different techniques. We show how techniques developed for one problem formulation can be adapted to solve a different formulation; thereby providing several novel adaptations to solve the different problem formulations. We highlight the applicability of the techniques that handle discrete sequences to other related areas such as online anomaly detection and time series anomaly detection.
Anomaly Detection: A Survey
(2007-08-15) Chandola, Varun; Banerjee, Arindam; Kumar, Vipin
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Association Analysis for Real-valued Data: Definitions and Application to Microarray Data
(2008-03-03) Pandey, Gaurav; Atluri, Gowtham; Steinbach, Michael; Myers, Chad L.; Kumar, Vipin
The discovery of biclusters, which denote groups of items that show coherent values across a subset of all the transactions in a data set, is an important type of analysis performed on real-valued data sets in several domains, such as biology. Several algorithms have been proposed to find different types of biclusters in such data sets. However, the search schemes used by these algorithms are unable to search the space of all possible biclusters exhaustively. Pattern mining algorithms in association analysis also essentially produce biclusters as their result, since the patterns consist of items that are supported by a subset of all the transactions. However, a major limitation of the numerous techniques developed in association analysis is that they are only able to analyze data sets that are constituted of binary and/or categorical variables, and their application to real-valued data sets often involves some lossy transformation such as discretization or binarization of the attributes. In this paper, we propose a novel association analysis framework for exhaustively and efficiently mining range support patterns from such a data set. On one hand, this framework reduces the loss of information incurred by binarization- and discretization-based approaches, and on the other, it enables the exhaustive discovery of coherent biclusters. We compared the performance of our framework with two standard biclustering algorithms through the evaluation of the functional coherence on patterns/biclusters derived from microarray data. These experiments show that the real-valued patterns discovered by our framework are better enriched by small biologically interesting functional classes. We also demonstrate the complementarity between our framework and the commonly used biclustering algorithm ISA, using specific examples of patterns that are found and functions that are covered by the former but not the latter. The source code and data sets used in this paper are available at http://www.cs.umn.edu/vk/gaurav/rap.

University Digital Conservancy

Browse by Author

Browsing by Author "Kumar, Vipin"