Browsing by Author "Shan, Hanhuai"

Now showing 1 - 6 of 6

Bayesian Cluster Ensembles
(2008-10-14) Wang, Hongjun; Shan, Hanhuai; Banerjee, Arindam
Cluster ensembles provide a framework for combining multiple base clusterings of a dataset to generate a stable and robust consensus clustering. There are important variants of the basic cluster ensemble problem, notably including cluster ensembles with missing values, as well as row-distributed or column-distributed cluster ensembles. Existing cluster ensemble algorithms are applicable only to a small subset of these variants. In this paper, we propose Bayesian Cluster Ensembles (BCE), which is a mixed-membership model for learning cluster ensembles, and is applicable to all the primary variants of the problem. We propose two methods, respectively based on variational approximation and Gibbs sampling, for learning a Bayesian cluster ensemble. We compare BCE extensively with several other cluster ensemble algorithms, and demonstrate that BCE is not only versatile in terms of its applicability, it mostly outperforms the other algorithms in terms of stability and accuracy.
Bayesian Co-clustering
(2008-07-08) Shan, Hanhuai; Banerjee, Arindam
In recent years, co-clustering has emerged as a powerful data mining tool that can analyze dyadic data connecting two entities. However, almost all existing co-clustering techniques are partitional, and allow individual rows and columns of a data matrix to belong to only one cluster. Several current applications, such as recommendation systems and market basket analysis, can substantially benefit from a mixed membership of rows and columns. In this paper, we present Bayesian co-clustering (BCC) models, that allow a mixed membership in row and column clusters. BCC maintains separate Dirichlet priors for rows and columns over the mixed membership and assumes each observation to be generated by an exponential family distribution corresponding to its row and column clusters. We propose a fast variational algorithm for inference and parameter estimation. The model is designed to naturally handle sparse matrices as the inference is done only based on the non-missing entries. In addition to finding co-cluster structure in observations, the model outputs a low dimensional co-embedding, and accurately predicts missing values in the original matrix. We demonstrate the efficacy of the model through experiments on both simulated and real data.
Generalized Probabilistic Matrix Factorizations for Collaborative Filtering
(2010-09-30) Shan, Hanhuai; Banerjee, Arindam
Probabilistic matrix factorization (PMF) have shown great promise in collaborative filtering. In this paper, we consider several variants and generalizations of PMF framework inspired by three broad questions: Are the prior distributions used in existing PMF models suitable, or can one get better predictive performance with different priors? Are there suitable extensions to leverage side information for prediction? Are there benefits to taking into account row and column biases, e.g., a critical user gives low ratings, and a popular movie gets high ratings in movie recommendation systems? We develop new families of PMF models to address these questions along with efficient approximate inference algorithms for learning and prediction. Through extensive experiments on movie recommendation datasets, we illustrate that simpler models directly capturing correlations among latent factors can outperform existing PMF models, side information can benefit prediction accuracy, and accounting for row/column biases leads to improvements in predictive performance.
Mixed-Membership Naive Bayes Models
(2009-01-16) Shan, Hanhuai; Banerjee, Arindam
In recent years, mixture models have found widespread usage in discovering latent cluster structure from data. A popular special case of finite mixture models are naive Bayes models, where the probability of a feature vector factorizes over the features for any given component of the mixture. Despite their popularity, naive Bayes models suffer from two important restrictions: first, they do not have a natural mechanism for handling sparsity, where each data point may have only a few observed features; and second, they do not allow objects to be generated from different latent clusters with varying degrees (i.e., mixed-memberships) in the generative process. In this paper, we first introduce marginal naive Bayes (MNB) models, which generalize naive Bayes models to handle sparsity by marginalizing over all missing features. More importantly, we propose mixed-membership naive Bayes (MMNB) models, which generalizes (marginal) naive Bayes models to allow for mixed memberships in the generative process. MMNB models can be viewed as a natural generalization of latent Dirichlet allocation (LDA) with the ability to handle heterogenous and possibly sparse feature vectors. We propose two variational inference algorithms to learn MMNB models from data. While the first exactly follows the corresponding ideas for LDA, the second uses much fewer variational parameters leading to a much faster algorithm with smaller time and space requirements. An application of the same idea in the context of topic modeling leads to a new Fast LDA algorithm. The efficacy of the proposed mixed-membership models and the fast variational inference algorithms are demonstrated by extensive experiments on a wide variety of different datasets.
Probabilistic models for multi-relational data analysis.
(2012-06) Shan, Hanhuai
With the widespread application of data mining technologies to real life problems, there has been an increasing realization that real data are usually multi-relational, capturing a variety of relations among objects in the same or different entities. For example, in movie recommender systems, the movie rating matrix captures the relation between movies and users, the social network captures the relation among users, and the cast of the movies captures the relation between movies and actors/actresses. The multi-relational data analysis on such data includes two important tasks: (1) To discover multi-relational clusters across multiple entities, i.e., multi-relational clustering. (2) To predict missing entries, i.e., multi-relational missing value prediction. Clustering and missing value prediction give us a better understanding of data and help us with decision making. For example, clusters of users and movies, as well as whether each user cluster likes each movie cluster, provide us with a high-level overview of movie rating data. In addition, the prediction of the missing ratings helps us decide whether to recommend the movies to corresponding users. Moreover, it is particularly meaningful to perform clustering and missing value prediction under the multi-relational setting, since they are able to combine multiple sources of information together effectively, which usually outperforms the algorithms on a single source of data alone. We develop probabilistic models for multi-relational data analysis due to their advantage in incorporating prior knowledge from multiple sources through prior distributions, and their modularity in combining multiple models through sharing latent variables. By performing experiments on a variety of data sets, such as movie recommendation data and ecological data on plant's traits, we show that multi-relational clustering and missing value prediction have superior performance compared to the algorithms on a single data source only.
Probabilistic Tensor Factorization for Tensor Completion
(2011-10-28) Shan, Hanhuai; Banerjee, Arindam; Natarajan, Ramesh
Multi-way tensor datasets emerge naturally in a variety of domains, such as recommendation systems, bioinformatics, and retail data analysis. The data in these domains usually contains a large number of missing entries. Therefore, many applications in those domains aim at missing value prediction, which boils down to a tensor completion problem. While tensor factorization algorithms can be a potentially powerful approach to tensor completion, most existing methods have the following limitations: First, some tensor factorization algorithms are unsuitable for tensor completion since they cannot work with incomplete tensors. Second, deterministic tensor factorization algorithms can only generate point estimates for the missing entries, while in some cases, it is desirable to obtain multiple-imputation datasets which are more representative of the joint variability for the predicted missing values. Therefore, we propose probabilistic tensor factorization algorithms, which are naturally applicable to incomplete tensors to provide both point estimate and multiple imputation for the missing entries. In this paper, we mainly focus on the applications to retail sales datasets, but the framework and algorithms are applicable to other domains as well. Through extensive experiments on real-world retail sales data, we show that our models are competitive with state-of-the-art algorithms, both in prediction accuracy and running time.

University Digital Conservancy

Browse by Author

Browsing by Author "Shan, Hanhuai"