Browsing by Author "Wang, Jianyong"
Now showing 1 - 4 of 4
- Results Per Page
- Sort Options
Item BAMBOO:Accelerating Closed Itemset Mining by Deeply Pushing the Length-Decreasing Support Constraint(2003-09-29) Wang, Jianyong; Karypis, GeorgePrevious study has shown that mining frequent patterns with length-decreasing support constraint is very helpful in removing some uninteresting patterns based on the observation that short patterns will tend to be interesting if they have a high support, whereas long patterns can still be very interesting even if their support is relatively low. However, a large number of non-closed(i.e., redundant) patterns can still not be filtered out by simply applying the length-decreasing support constraint. As a result, a more desirable pattern discovery task could be mining closed patterns under the length-decreasing support constraint. In this paper we study how to push deeply the length-decreasing support constraint into closed itemset mining, which is a particularly challenging problem due to the fact that the downward-closure property cannot be used toprune the search space. Therefore, we have proposed several pruning methods and optimization techniques to enhance the closed itemset mining algorithm, and developed an efficient algorithm, BAMBOO. Extensive performance study based on various length-decreasing support constraints and datasets with different characteristics has shown that BAMBOO not only generates more concise result set, but also runs orders of magnitude faster than several recently developed pattern discovery algorithms. In addition, BAMBOO also shows very good scalability in terms of the database size.Item Efficient Closed Pattern Mining in the Presence of Tough Block Constraints(2003-11-25) Gade, Krishna; Wang, Jianyong; Karypis, GeorgeIn recent years, various constrained frequent pattern mining problem formulations and associated algorithms have been developed that enable the user to specify various itemset-based constraints that better capture the underlying application requirements and characteristics. In this paper we introduce a new class of {em block} constraints that determine the significance of an itemset pattern by considering the dense block that is formed by the pattern's items and its associated set of transactions. Block constraints provide a natural framework by which a number of important problems can be specified and make it possible to solve numerous problems on binary and real-valued datasets. However, developing computationally efficient algorithms to find these block constraints poses a number of challenges as unlike the different itemset-based constraints studied earlier, these block constraints are {em tough} as they are neither anti-monotone, monotone, nor convertible. To overcome this problem, we introduce a new class of pruning methods that can be used to significantly reduce the overall search space and make it possible to develop computationally efficient block constraint mining algorithms. We present an algorithm called cbminer that takes advantage of these pruning methods to develop an algorithm for finding the closed itemsets that satisfy the block constraints. Our extensive performance study shows that cbminer generates more concise result set and can be order(s) of magnitude faster than the traditional frequent closed itemset mining algorithms.Item HARMONY: Efficiently Mining the Best Rules for Classification(2004-09-16) Wang, Jianyong; Karypis, GeorgeMany studies have shown that rule-based classification algorithms perform well in classifying categorical and sparse high-dimensional databases. However, a fundamental limitation with many rule-based classifiers is that they find the classification rules in a coarse-grained manner. They usually use heuristic methods to prune the search space, and select the rules based on the sequential database covering paradigm. Thus, the so-mined rules may not be the globally best rules for some instances in the training database. To make worse, these algorithms fail to fully exploit some more effective search space pruning methods in order to scale to large databases. In this paper we propose a new classifier, HARMONY, which directly mines the final set of classification rules. HARMONY uses an instance-centric rule-generation approach in the sense that it can assure for each training instance, one of the highest-confidence rules covering this instance is included in the result set, which helps a lot in achieving high classification accuracy. By introducing several novel search strategies and pruning methods into the traditional frequent itemset mining framework, HARMONY also has high efficiency and good scalability. Our thorough performance study with some large text and categorical databases has shown that HARMONY outperforms many well-known classifiers in terms of both accuracy and efficiency, and scales well w.r.t. the database size.Item SUMMARY: Efficiently Summarizing Transactions for Clustering(2004-06-22) Wang, Jianyong; Karypis, GeorgeFrequent itemset mining was initially proposed and has been studied extensively in the context of association rule mining. In recent years, several studies have also extended its application to the transaction (or document) classification and clustering. However, most of the frequent-itemset based clustering algorithms need to first mine a large intermediate set of frequent itemsets in order to identify a subset of the most promising ones that can be used for clustering. In this paper, we study how to directly find a subset of high quality frequent itemsets that can be used as a concise summary of the transaction database and to cluster the categorical data. By exploring some properties of the subset of itemsets that we are interested in, we proposed several search space pruning methods and designed an efficient algorithm called SUMMARY. Our empirical results have shown that SUMMARY runs very fast even when the minimum support is extremely low and scales very well with respect to the database size, and surprisingly, as a pure frequent itemset mining algorithm it is very effective in clustering the categorical data and summarizing the dense transaction databases.