Browsing by Author "Zhao, Ying"

Now showing 1 - 11 of 11

A Polynomial Time Approximation Scheme for Rectlinear Steiner Mininum Tree Construction in the Presence of Obstacles
(2002-04-18) Liu, Jian; Zhao, Ying
One problem in VLSI physical designs is to route multi-terminal nets in the presence of obstacles. This paper presents a polynomial time approximation scheme for construction of a rectilinear minimum tree in the presence of obstacles. Given any m rectanglar obstacles, n nodes andepislon>0, the scheme find a (1+epislon)-approximation to the optimum solution in time of n^(O(1/epsilon)), providing an alternative of previous heuristics.Key words:Rectilinear Steiner Minimum Tree in presence of obstacles, VLSI routing, PTAS, Guillitone cut, approximation algorithm.
Clustering in Life Sciences
(2002-04-22) Zhao, Ying; Karypis, George
Clustering is the task of organizing a set of objects into meaningful groups. These groups can be disjoint, overlapping, or organized in some hierarchical fashion. The key element of clustering is the notion that the discovered groups are meaningful. This definition is intentionally vague, as what constitutes meaningful is to a large extent, application dependent. In some applications this maytranslate to groups in which the pairwise similarity between their objects is maximized, and the pairwise similarity between objects of different groups is minimized. In some other applications this may translate to groups that contain objects that share some keycharacteristics, even though their overall similarity is not the highest. Clustering is an exploratory tool for analyzing large datasets, and has been used extensively in numerous application areas.The primary goal of this chapter is to provide an overview of the various issues involved in clustering large datasets, describe the merits and underlying assumptions of some of the commonly used clustering approaches, and provide insights on how to cluster datasets arising in various areas within life-sciences.
Comparison of Agglomerative and Partitional Document Clustering Algorithms
(2002-04-17) Zhao, Ying; Karypis, George
Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters,and in greatly improving the retrieval performance either via cluster-driven dimensionality reduction, term-weighting, or query expansion. This ever-increasing importance of document clustering and the expanded range of its applications led to the development of a number of novel algorithms and new clustering criterion functions, especially in the context of partitional clustering.The focus of this paper is to experimentally evaluate the performance of seven different global criterion functions in the context of agglomerative clustering algorithms and compare the clustering results of agglomerative algorithms and partitional algorithms for each one of the criterion functions. Our experimental evaluation shows that for every criterion function, partitional algorithms always lead to better clustering results than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance.
Criterion Functions for Document Clustering: Experiments and Analysis
(2001-11-29) Zhao, Ying; Karypis, George
In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize this information with the ultimate goal of helping them to find what they are looking for. Fast and high-quality document clustering algorithms play an important role towards this goal as they have been shown to provide both an intuitive navigation/browsing mechanism by organizing large amounts of information into a small number of meaningful clusters as well as to greatlyimprove the retrieval performance either via cluster-driven dimensionality reduction, term-weighting, or query expansion. This ever-increasing importance of document clustering and the expanded range of its applications led to the development of a number of new and novel algorithms with different complexity-quality trade-offs. Among them, a class of clustering algorithms that have relatively low computational requirements are those that treat the clustering problem as an optimization process which seeks to maximize or minimize a particular {em clustering criterion function} defined over the entire clustering solution. The focus of this paper is to evaluate the performance of different criterion functions for the problem of clustering documents. Our study involves a total of eight different criterion functions, three of which are introduced inthis paper and five that have been proposed in the past. Our evaluation consists ofboth a comprehensive experimental evaluation involving fifteen different datasets, as well as an analysis of the characteristics of the various criterionfunctions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion function lead to the best overall results. Our theoretical analysis of the criterion function shows that their relative performance depends on (i) the degree to which they cancorrectly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.
Effective Document Clustering for Large Heterogeneous Law Firm Collections
(2005-04-22) Conrad, Jack G.; Zhao, Ying; Al-Kofahi, Khalid; Karypis, George
Computational resources for research in legal environments have historically implied remote access to large databases of legal documents such as case law, statutes, law reviews and administrative materials. Today, by contrast, there exists enormous growth in lawyers electronic work product within these environments, specifically within law firms. Along with this growth has come the need for accelerated knowledge managementautomated assistance in organizing, analyzing, retrieving and presenting this content in a useful and distributed manner. In cases where a relevant legal taxonomy is available, together with representative labeled data, automated text classification tools can be applied. In the absence of these resources, document clustering offers an alternative approach to organizing collections, and an adjunct to search. To explore this approach further, we have conducted sets of successively more complex clustering experiments using primary and secondary law documents as well as actual law firm data. Tests were run to determine the efficiency and effectiveness of a number of essential clustering functions. After examining the performance of traditional or hard clustering applications, we investigate soft clustering (multiple cluster assignments) as well as hierarchical clustering. We show how these latter clustering approaches are effective, in terms of both internal and external quality measures, and useful to legal researchers. Moreover, such techniques can ultimately assist in the automatic or semi-automatic generation of taxonomies for subsequent use by classification programs.
Evaluation of Hierarchical Clustering Algorithms for Document Datasets
(2002-06-03) Zhao, Ying; Karypis, George
Fast and high-quality document clustering algorithms play animportant role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections. The focus of this paper is to evaluate different hierarchical clustering algorithms and toward this goal we compared variouspartitional and agglomerative approaches. Our experimentalevaluation showed that partitional algorithms always lead tobetter clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance.We also present a new class of clustering algorithms called {em constrained agglomerative algorithms} that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone.
Hierarchical Clustering Algorithms for Document Datasets
(2003-06-23) Zhao, Ying; Karypis, George
Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, clustering algorithms that build meaningfulhierarchies out of large document collections are ideal tools for their interactive visualization and exploration as they provide data-views that are consistent, predictable, and at different levels of granularity. This paper focuses on document clustering algorithms that build such hierarchical solutions and (i) presents a comprehensive study of partitional and agglomerative algorithms that use different criterion functions and merging schemes, and (ii) presents a new class of clusteringalgorithms called {em constrained agglomerative algorithms}, which combine features from both partitional and agglomerative approaches that allows them to reduce the early-stage errors made by agglomerative methods and hence improve the quality of clustering solutions. The experimental evaluation shows that, contrary to the common belief, partitional algorithms always lead to better solutions than agglomerative algorithms; making them ideal for clustering large document collections due to not only their relatively low computational requirements, butalso higher clustering quality. Furthermore, the constrained agglomerative methods consistently lead to better solutions than agglomerative methods alone and for many cases they outperform partitional methods, as well.
Improve Precategorized Collection Retrieval by Using Supervised Term Weighting Schemes
(2001-12-10) Zhao, Ying; Karypis, George
The emergence of the world-wide-web has led to an increased interest in methods for searching for information. A key characteristic of many of the online document collections is that the documents have predefined category information, for example, the variety of scientific articles accessible via digital libraries (e.g., ACM, IEEE, etc.), medicalarticles, news-wires, and various directories (e.g., Yahoo, OpenDirectory Project, etc.). However, most previous information retrieval systems have not taken the pre-existing category information into account. In this paper, we proposed weight adjustment schemes based upon the category information in the vector-space model, which are able to select the most content specific and discriminating features. Our experimental results on TREC data sets show that the pre-existing category information does provideadditional beneficial information to improve retrieval.The proposed weight adjustment schemes perform better than the vector-space model with the inverse document frequency (IDF) weighting scheme when queries are less specific. The proposed weighting schemes can also benefit retrieval when clusters are used as an approximation to categories.
Prediction of Contact Maps Using Support Vector Machines
(2002-12-09) Zhao, Ying; Karypis, George
Contact map prediction is of great interests for its application in fold recognition and protein 3D structure determination. In particular, we focusd on predicting non-local interactions in this paper. We employed Support Vector Machines (SVMs) as the machine learning tool and incorporated AAindex to extract correlated mutation analysis (CMA) and sequence profile (SP) features.In addition, we evaluated the effectiveness of different features for various fold classes.On average, our predictor achieved an prediction accuracy of $0.2238$ with an improvement over a random predictor of a factor $11.7$, which is better than reported studies. Our study showed that predicted secondary structure features play an important roles for the proteins containing beta structures. Models based on secondary structure features and CMA features produce different sets of predictions. Our study also suggests that models learned separately for different protein fold families may achieve better performance than a unified model.
Soft Clustering Criterion Functions for Partitional Document Clustering
(2004-05-26) Zhao, Ying; Karypis, George
Recently published studies have shown that partitional clustering algorithms that optimize certain criterion functions, which measure key aspects of inter- and intra-cluster similarity, are very effective in producing hard clustering solutions for document datasets and outperform traditional partitional and agglomerative algorithms. In this paper we study the extent to which these criterion functions can be modified to include soft membership functions and whether or not the resulting soft clustering algorithms can further improve the clustering solutions. Specifically, we focus on four of these hard criterion functions, derive their soft-clustering extensions, present a comprehensive experimental evaluation involving twelve different datasets, and analyze their overall characteristics. Our results show that introducing softness into the criterion functions tends to lead to better clustering results for most datasets.
Topic-driven Clustering for Document Datasets
(2005-04-22) Zhao, Ying; Karypis, George
In this paper, we define the problem of topic-driven clustering, which organizes a document collection according to a given set of topics (either from domain experts, or as a requirement satisfying users' needs). We propose three topic-driven schemes that consider the similarity between the document to its topic and the relationship among the documents within the same cluster and from different clusters simultaneously. We present the experimental results of the proposed topic-driven schemes on five datasets. Our experimental results show that the proposed topic-driven schemes are efficient and effective with topic prototypes of different levels of specificity.

University Digital Conservancy

Browse by Author

Browsing by Author "Zhao, Ying"