Browsing by Author "Kuramochi, Michihiro"
Now showing 1 - 11 of 11
- Results Per Page
- Sort Options
Item An Efficient Algorithm for Discovering Frequent Subgraphs(2002-06-25) Kuramochi, Michihiro; Karypis, GeorgeOver the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to non-traditional domains, existing frequent pattern discovery approach cannot be used. This is because the transaction framework that isassumed by these algorithms cannot be used to effectively model the datasets in these domains. An alternate way of modeling the objects in these datasets is to represent them using graphs. Within that model, the problem of finding frequent patterns becomes that of discovering subgraphs that occur frequently over the entire set of graphs. In thispaper we present a computationally efficient algorithm, called FSG, for finding all frequent subgraphs in large graph databases. We experimentally evaluate the performance of FSG using a variety of real and synthetic datasets. Our results show that despite the underlying complexity associated with frequent subgraph discovery, FSG is effective in finding all frequently occurring subgraphs in datasets containing over 100,000 graph transactions and scales linearly with respect to the size of the database.Item Automated Approaches for Classifying Structures(2002-06-26) Deshpande, Mukund; Kuramochi, Michihiro; Karypis, GeorgeIn this paper we study the problem of classifying chemical compound datasets. We present an algorithm that first mines the chemical compound dataset to discover discriminating sub-structures; these discriminating sub-structures are used as features to build a powerful classifier. The advantage of our classification technique is that it requires very little domain knowledge and can easily handle large chemical datasets. We evaluated the performance of our classifier on two widely available chemical compound datasets and have found it to give good results.Item Discovering Frequent Geometric Subgraphs(2004-10-21) Kuramochi, Michihiro; Karypis, GeorgeData mining-based analysis methods are increasingly being applied to datasets derived from science and engineering domains that model various physical phenomena and objects. In many of these datasets, a key requirement for their effective analysis is the ability to capture the relational and geometric characteristics of the underlying entities and objects. Geometric graphs, by modeling the various physical entities and their relationships with vertices and edges, provide a natural method to represent such datasets. In this paper we present gFSG, a computationally efficient algorithm for finding frequent patterns corresponding to geometric subgraphs in a large collection of geometric graphs. gFSG is able to discover geometric subgraphs that can be rotation, scaling, and translation invariant, and it can accommodate inherent errors on the coordinates of the vertices. We evaluated its performance using a large database of over 20,000 chemical structures, and our results show that it requires relatively little time, can accommodate low support values, and scales linearly with the number of transactions.Item Discovering Geometric Frequent Subgraphs(2002-06-17) Kuramochi, Michihiro; Karypis, GeorgeAs data mining techniques are being increasingly applied tonon-traditional domains, existing approaches for finding frequent itemsets cannot be used as they cannot model the requirement of these domains. An alternate way of modeling the objects in these data sets, is to use a graph to model the database objects. Within that model, the problem of finding frequent patterns becomes that of discoveringsubgraphs that occur frequently over the entire set of graphs. In this paper we present a computationally efficient algorithm for finding frequent geometric subgraphs in a large collection of geometric graphs. Our algorithm is able to discover geometric subgraphs that can be translation, rotation, and scaling invariant, and it can accommodate inherent errors on the coordinates of the vertices. We evaluated the performance of the algorithm using a large database of over 20,000 real two-dimensional chemical structures, and our experimental results show that our algorithms requires relatively little time, can accommodate low support values, and scales linearly on the number of transactions.Item Finding Frequent Patterns in a Large Sparse Graph(2003-09-25) Kuramochi, Michihiro; Karypis, GeorgeThis paper presents two algorithms based on the horizontal and vertical pattern discovery paradigms that find the connected subgraphs that have a sufficient number of edge-disjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods to determine the number of the edge-disjoint embeddings of a subgraph that are based on approximate and exact maximum independent set computations and use it to prune infrequentsubgraphs. Experimental evaluation on real datasets from various domains show that both algorithms achieve good performance, scale well to sparse input graphs with more than 100,000 vertices and around 200,000 edges, and significantly outperform previously developed algorithms.Item Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds(2003-03-10) Deshpande, Mukund; Kuramochi, Michihiro; Karypis, GeorgeIn this paper we study the problem of classifying chemical compound datasets. We present a sub-structure-based classification algorithm that decouples the sub-structure discovery process from the classification model constructionand uses frequent subgraph discovery algorithms to find all topological and geometric sub-structures present in the dataset. The advantage of our approach is that during classification model construction, all relevant sub-structures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use ofhighly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Our experimental evaluation on eight different classification problems shows that our approach is computationally scalable and outperforms existing schemes by 10% to 35%, on the average.Item Frequent Subgraph Discovery(2001-07-01) Kuramochi, Michihiro; Karypis, GeorgeOver the years, frequent itemset discovery algorithms have been used to solve various interesting problems. As data mining techniques are being increasingly applied to non-traditional domains, existing approaches for finding frequent itemsets cannot be used as they cannot model the requirement of these domains. An alternate way of modelingthe objects in these data sets, is to use a graph to model the database objects. Within that model, the problem of finding frequent patterns becomes that of discovering subgraphs that occur frequently over the entire set of graphs. In this paper we present a computationally efficient algorithm for finding all frequent subgraphs in large graph databases. We evaluated the performance of the algorithm by experiments with synthetic datasets as well as a chemical compound dataset. The empirical results show that our algorithm scales linearly with the number of input transactions and it is able to discover frequent subgraphs from a set of graph transactions reasonably fast, even though we have to deal with computationally hard problems such as canonical labeling of graphs and subgraph isomorphism which are not necessary for traditional frequent itemset discovery.Item Gene Classification using Expression Profiles: A Feasibility Study(2001-07-23) Kuramochi, Michihiro; Karypis, GeorgeAs various genome sequencing projects have already been completed or are near completion, genome researchers are shifting their focus from structural genomics to functional genomics. Functional genomics represents the next phase, that expands the biological investigation to studying the functionality of genes of a single organism as well asstudying and correlating the functionality of genes across many different organisms. Recently developed methods for monitoring genome-wide mRNA expression changes hold the promise of allowing us to inexpensively gain insights into the function of unknown genes. In this paper we focus on evaluating the feasibility of using supervised machine learning methods for determining the function of genes basedsolely on their expression profiles. We experimentally evaluate the performance of traditional classification algorithms such as support vector machines and k-nearest neighbors on the yeast genome, and present new approaches for classification that improve the overall recall with moderate reductions in precision. Our experiments showthat the accuracies achieved for different classes variesdramatically. In analyzing these results we show that the achieved accuracy is highly dependent on whether or not the genes of that class were significantly active during the various experimental conditions, suggesting that gene expression profiles can become a viable alternative to sequence similarity searches provided that the genes are observed under a wide range of experimental conditions.Item GREW - A Scalable Frequent Subgraph Discovery Algorithm(2004-06-22) Kuramochi, Michihiro; Karypis, GeorgeExisting algorithms that mine graph datasets to discover patterns corresponding to frequently occurring subgraphs can operate efficiently on graphs that are sparse, contain a large number of relatively small connected components, have vertices with low and bounded degrees, and contain well-labeled vertices and edges. However, there are a number of applications that lead to graphs that do not share these characteristics, for which these algorithms highly become unscalable. In this paper we propose a heuristic algorithm called GREW to overcome the limitations of existing complete or heuristic frequent subgraph discovery algorithms. GREW is designed to operate on a large graph and to find patterns corresponding to connected subgraphs that have a large number of vertex-disjoint embeddings. Our experimental evaluation shows that GREW is efficient, can scale to very large graphs, and find non-trivial patterns that cover large portions of the input graph and the lattice of frequent patterns.Item PAFI: A Pattern Finding Toolkit(2003-07-07) Seno, Masakazu; Kuramochi, Michihiro; Karypis, GeorgeN/AItem Promoter Prediction of Prokaryotes(2001-07-23) Kuramochi, Michihiro; Deshpande, Mukund; Karypis, George; Zhang, Qing; Kapur, VivekThe availability of computational methods to identify and define the precise structure and location of promoters in prokaryotic genomes will provide a critical first step towards understanding the mechanisms by which genes are organized and regulated. We examine three different methods for promoter identification, two of which are adopted from related work and the other is a novel approach based on feature extraction. By the results of a set of experiments we evaluated prediction accuracy for identifying promoter regions fromnon-coding regions.