Browsing by Author "Banerjee, Arindam"
Now showing 1 - 19 of 19
- Results Per Page
- Sort Options
Item A Social Query Model for Distributed Search(2008-05-27) Banerjee, Arindam; Basu, SugatoDecentralized search by routing queries over a network is fast emerging as an important research problem, with potential applications in social search as well as peer-to-peer networks. In this paper, we introduce a novel Social Query Model (SQM) for decentralized search, which factors in realistic elements such as expertise levels and response rates of nodes, and has the Pagerank model and certain Markov Decision Processes as special cases. In the context of the model, we establish the existence of a query routing policy that is simultaneously optimal for all nodes, in that no subset of nodes will jointly have any incentive to use a different local routing policy. For computing the optimal policy, we present an efficient distributed approximation algorithm that is almost linear in the number of edges in the network. Extensive experiments on both simulated random graphs and real small-world networks demonstrate the potential of our model and the effectiveness of the proposed routing algorithm.Item A Unified View of Graph-based Semi-Supervised Learning: Label Propagation, Graph-Cuts, and Embeddings(2009-05-12) Agovic, Amrudin; Banerjee, ArindamRecent years have seen a growing number of graph-based semi-supervised learning methods. While the literature currently contains several of these methods, their relationships with one another and with other graph-based data analysis algorithms remain unclear. In this paper, we present a unified view of graph-based semi-supervised learning. Our framework unifies three important and seemingly unrelated approaches to semi-supervised learning, viz label propagation, graph cuts and manifold embeddings. We show that most existing label propagation methods solve a special case of a generalized label propagation (GLP) formulation which is a constrained quadratic program involving a graph Laplacian. Different methods arise simply based on the choice of the Laplacian and the nature of the constraints. Further, we show that semi-supervised graph-cut problems can also be viewed and solved as special cases of the GLP formulation. In addition, we show that semi-supervised non-linear manifold embedding methods also solve variants of the GLP problem and propose a novel family of semi-supervised algorithms based on existing embedding methods. Finally, we present comprehensive empirical performance evaluation of the existing label propagation methods as well as the new ones derived from manifold embedding. The new family of embedding based label propagation methods are found to be competitive on several datasets.Item Anomaly Detection for Discrete Sequences: A Survey(2009-05-22) Chandola, Varun; Banerjee, Arindam; Kumar, VipinThis survey attempts to provide a comprehensive and structured overview of the existing research for the problem of detecting anomalies in discrete sequences. The aim is to provide a global understanding of the sequence anomaly detection problem and how techniques proposed for different domains relate to each other. Our specific contributions are as follows: We identify three distinct formulations of the anomaly detection problem, and review techniques from many disparate and disconnected domains that address each of these formulations. Within each problem formulation, we group techniques into categories based on the nature of the underlying algorithm. For each category, we provide a basic anomaly detection technique, and show how the existing techniques are variants of the basic technique. This approach shows how different techniques within a category are related or different from each other. Our categorization reveals new variants and combinations that have not been investigated before for anomaly detection. We also provide a discussion of relative strengths and weaknesses of different techniques. We show how techniques developed for one problem formulation can be adapted to solve a different formulation; thereby providing several novel adaptations to solve the different problem formulations. We highlight the applicability of the techniques that handle discrete sequences to other related areas such as online anomaly detection and time series anomaly detection.Item Anomaly Detection: A Survey(2007-08-15) Chandola, Varun; Banerjee, Arindam; Kumar, VipinAnomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.Item Bayesian Cluster Ensembles(2008-10-14) Wang, Hongjun; Shan, Hanhuai; Banerjee, ArindamCluster ensembles provide a framework for combining multiple base clusterings of a dataset to generate a stable and robust consensus clustering. There are important variants of the basic cluster ensemble problem, notably including cluster ensembles with missing values, as well as row-distributed or column-distributed cluster ensembles. Existing cluster ensemble algorithms are applicable only to a small subset of these variants. In this paper, we propose Bayesian Cluster Ensembles (BCE), which is a mixed-membership model for learning cluster ensembles, and is applicable to all the primary variants of the problem. We propose two methods, respectively based on variational approximation and Gibbs sampling, for learning a Bayesian cluster ensemble. We compare BCE extensively with several other cluster ensemble algorithms, and demonstrate that BCE is not only versatile in terms of its applicability, it mostly outperforms the other algorithms in terms of stability and accuracy.Item Bayesian Co-clustering(2008-07-08) Shan, Hanhuai; Banerjee, ArindamIn recent years, co-clustering has emerged as a powerful data mining tool that can analyze dyadic data connecting two entities. However, almost all existing co-clustering techniques are partitional, and allow individual rows and columns of a data matrix to belong to only one cluster. Several current applications, such as recommendation systems and market basket analysis, can substantially benefit from a mixed membership of rows and columns. In this paper, we present Bayesian co-clustering (BCC) models, that allow a mixed membership in row and column clusters. BCC maintains separate Dirichlet priors for rows and columns over the mixed membership and assumes each observation to be generated by an exponential family distribution corresponding to its row and column clusters. We propose a fast variational algorithm for inference and parameter estimation. The model is designed to naturally handle sparse matrices as the inference is done only based on the non-missing entries. In addition to finding co-cluster structure in observations, the model outputs a low dimensional co-embedding, and accurately predicts missing values in the original matrix. We demonstrate the efficacy of the model through experiments on both simulated and real data.Item Common Component Analysis for Multiple Covariance Matrices(2010-08-04) Wang, Huahua; Banerjee, Arindam; Boley, DanielWe consider the problem of finding a suitable common low-dimensional subspace for accurately representing a given set of covariance matrices. When the set contains only one covariance matrix, the subspace is given by Principal Component Analysis (PCA). For multiple covariance matrices, we term the problem Common Component Analysis (CCA). While CCA can be posed as a tensor decomposition problem, standard approaches to tensor decomposition have two critical issues: (i) Tensor decomposition methods are iterative and rely on the initialization. A bad initialization may lead to poor local optima; (ii) For a given level of approximation error, one does not know how to choose a suitable low dimensionality. In this paper, we present a detailed analysis of CCA which yields an effective initialization and iterative algorithms for the problem. The proposed methodology has provable approximation guarantees w.r.t. the global optimum, and also allows one to choose the dimensionality for a given level of approximation error. We also establish conditions under which the methodology will obtain the global optimum. We illustrate the effectiveness of the proposed method through extensive experiments on synthetic data as well as two real stock market datasets, where major financial events can be visualized in low dimensions.Item Conditionally Positive Definite Kernels and Infinitely Divisible Distributions(2008-10-28) Banerjee, Arindam; Srivastava, NisheethWe give a precise characterization of two important classes of conditionally positive definite (CPD) kernels in terms of integral transforms of infinitely divisible distributions. In particular, we show that for any stationary CPD kernel A(x,y) = f(x-y), f is the log-characteristic function of a uniquely determined infinitely divisible distribution; further, for any additive CPD kernel A(x,y) = g(x+y), g is the log-moment generating function of a uniquely determined infinitely divisible distribution. The results strengthen the connections between CPD kernels and infinitely divisible distributions.Item Generalized Probabilistic Matrix Factorizations for Collaborative Filtering(2010-09-30) Shan, Hanhuai; Banerjee, ArindamProbabilistic matrix factorization (PMF) have shown great promise in collaborative filtering. In this paper, we consider several variants and generalizations of PMF framework inspired by three broad questions: Are the prior distributions used in existing PMF models suitable, or can one get better predictive performance with different priors? Are there suitable extensions to leverage side information for prediction? Are there benefits to taking into account row and column biases, e.g., a critical user gives low ratings, and a popular movie gets high ratings in movie recommendation systems? We develop new families of PMF models to address these questions along with efficient approximate inference algorithms for learning and prediction. Through extensive experiments on movie recommendation datasets, we illustrate that simpler models directly capturing correlations among latent factors can outperform existing PMF models, side information can benefit prediction accuracy, and accounting for row/column biases leads to improvements in predictive performance.Item If You are Happy and Know It ... Tweet(2012-06-18) Taheri, Amir Asiaee; Tepper, Mariano; Banerjee, Arindam; Sapiro, GuillermoExtracting sentiment from Twitter data is one of the fundamental problems in Social Media Analytics. The length constraint of Twitter, an average of about six words per message, renders determining the positive or negative sense of a tweet difficult even for a human judge. In this work we present a general framework for single tweet (in contrast with batches of tweets) sentiment analysis which consists of three steps: extracting tweets about a desired target, separating tweets with sentiment, and discriminating positive and negative tweets. We study the performance of a number of classical and new machine learning algorithms for classification in each step. We also show that the intrinsic sparsity of tweets enables us to perform classification with their low dimensional random projections without losing accuracy. In addition, we present weighted variants of all employed algorithms which exploit available labeling uncertainty in order to further improve classification accuracy. Finally, we show that aggregating per-tweet sentiment analysis results, improve the accuracies and make our approach a good candidate for batch tweet sentiment analysis.Item Jensen-Bregman LogDet Divergence for Efficient Similarity Computations on Positive Definite Tensors(2012-05-02) Cherian, Anoop; Sra, Suvrit; Banerjee, ArindamCovariance matrices provide an easy platform for fusing multiple features compactly and as a result have found immense success in several computer vision applications, including activity recognition, visual surveillance, and diffusion tensor imaging. An important task in all of these applications is to compute the distance between covariance matrices using a (dis)similarity function, for which the natural choice is the Riemannian metric corresponding to the manifold inhabited by these matrices. As this Riemannian manifold is not flat, the dissimilarities should take into account the curvature of the manifold. As a result such distance computations tend to slow down, especially when the matrix dimensions are large or gradients are required. Further, suitability of the metric to enable efficient nearest neighbor retrieval is an important requirement in the contemporary times of big data analytics. To alleviate these difficulties, this paper proposes a novel dissimilarity measure for covariances, the Jensen-Bregman LogDet Divergence (JBLD). This divergence enjoys several desirable theoretical properties, at the same time is computationally less demanding (compared to standard measures). To address the problem of efficient nearest neighbor retrieval on large covariance datasets, we propose a metric tree framework using kmeans clustering on JBLD. We demonstrate the superior performance of JBLD on covariance datasets from several computer vision applications.Item MAP Inference on Million Node Graphical Models: KL-divergence based Alternating Directions Method(2012-02-27) Fu, Qiang; Wang, Huahua; Banerjee, Arindam; Liess, Stefan; Snyder, Peter K.Motivated by a problem in large scale climate data analysis, we consider the problem of maximum a posteriori (MAP) inference in graphical models with millions of nodes. While progress has been made in recent years, existing MAP inference algorithms are inherently sequential and hence do not scale well. In this paper, we present a parallel MAP inference algorithm called KL-ADM based on two ideas: tree-decomposition of a graph, and the alternating directions method (ADM). However, unlike standard ADM, we use an inexact ADM augmented with a Kullback-Leibler (KL) divergence based regularization. The unusual modification leads to an efficient iterative algorithm while avoiding double-loops. We rigorously prove global convergence of KL-ADM. We illustrate the effectiveness of KL-ADM through extensive experiments on large synthetic and real datasets. The application on real world precipitation data finds all major droughts in the last century.Item Meta Algorithms for Portfolio Selection(2010-09-20) Das, Puja; Banerjee, ArindamWe consider the problem of sequential portfolio selection in the stock market. There are theoretically well grounded algorithms for the problem, such as Universal Portfolio (UP), Exponentiated Gradient (EG) and Online Newton Step (ONS). Such algorithms enjoy the property of being universal, i.e., having low regret with the best constant rebalanced portfolio. However, the practical performance of such popular algorithms is sobering compared to heuristics such as Anticor, which have no theoretical guarantees but perform surprisingly well in practice. Motivated by such discrepancies, in this paper we focus on designing meta algorithms for portfolio selection which can leverage the best of both worlds. Such algorithms work with a pool of base algorithms and use online learning to redistribute wealth among the base algorithms. We develop two meta-algorithms: MA_EG which uses online gradient descent following EG and MA_ONS which uses online Newton step following ONS. If one of the base algorithms is universal, it follows that the meta-algorithm is universal. Through extensive experiments on two real stock market datasets, we show that the meta-algorithms are competitive and often better than the best base algorithm, including heuristics, while maintaining the guarantee of being an universal algorithm.Item Mixed-Membership Naive Bayes Models(2009-01-16) Shan, Hanhuai; Banerjee, ArindamIn recent years, mixture models have found widespread usage in discovering latent cluster structure from data. A popular special case of finite mixture models are naive Bayes models, where the probability of a feature vector factorizes over the features for any given component of the mixture. Despite their popularity, naive Bayes models suffer from two important restrictions: first, they do not have a natural mechanism for handling sparsity, where each data point may have only a few observed features; and second, they do not allow objects to be generated from different latent clusters with varying degrees (i.e., mixed-memberships) in the generative process. In this paper, we first introduce marginal naive Bayes (MNB) models, which generalize naive Bayes models to handle sparsity by marginalizing over all missing features. More importantly, we propose mixed-membership naive Bayes (MMNB) models, which generalizes (marginal) naive Bayes models to allow for mixed memberships in the generative process. MMNB models can be viewed as a natural generalization of latent Dirichlet allocation (LDA) with the ability to handle heterogenous and possibly sparse feature vectors. We propose two variational inference algorithms to learn MMNB models from data. While the first exactly follows the corresponding ideas for LDA, the second uses much fewer variational parameters leading to a much faster algorithm with smaller time and space requirements. An application of the same idea in the context of topic modeling leads to a new Fast LDA algorithm. The efficacy of the proposed mixed-membership models and the fast variational inference algorithms are demonstrated by extensive experiments on a wide variety of different datasets.Item Online Quadratically Constrained Convex Optimization with Applications to Risk Adjusted Portfolio Selection(2012-02-28) Das, Puja; Banerjee, ArindamWhile online convex optimization has emerged as a powerful large scale optimization approach, much of existing literature assumes a simple way to project onto a given feasible set. The assumption is often not true, and the projection step usually becomes the key computational bottleneck. Motivated by applications in risk-adjusted portfolio selection, in this paper we consider online quadratically constrained convex optimization problems, where the feasible set involves intersections of ellipsoids. We show that regret guarantees for the online problem can be achieved by solving a suitable quadratically constrained quadratic program (QCQP) at each step, and present an efficient algorithm for solving QCQPs based on the alternating directions method. We then specialize the general framework to risk adjusted portfolio selection. Through extensive experiments on two real world stock datasets, our proposed algorithm RAMP is shown to significantly outperform existing approaches at any given risk level and match the performance of the best heuristics which do not accommodate risk constraints.Item Probabilistic Tensor Factorization for Tensor Completion(2011-10-28) Shan, Hanhuai; Banerjee, Arindam; Natarajan, RameshMulti-way tensor datasets emerge naturally in a variety of domains, such as recommendation systems, bioinformatics, and retail data analysis. The data in these domains usually contains a large number of missing entries. Therefore, many applications in those domains aim at missing value prediction, which boils down to a tensor completion problem. While tensor factorization algorithms can be a potentially powerful approach to tensor completion, most existing methods have the following limitations: First, some tensor factorization algorithms are unsuitable for tensor completion since they cannot work with incomplete tensors. Second, deterministic tensor factorization algorithms can only generate point estimates for the missing entries, while in some cases, it is desirable to obtain multiple-imputation datasets which are more representative of the joint variability for the predicted missing values. Therefore, we propose probabilistic tensor factorization algorithms, which are naturally applicable to incomplete tensors to provide both point estimate and multiple imputation for the missing entries. In this paper, we mainly focus on the applications to retail sales datasets, but the framework and algorithms are applicable to other domains as well. Through extensive experiments on real-world retail sales data, we show that our models are competitive with state-of-the-art algorithms, both in prediction accuracy and running time.Item Social Topic Models for Community Extraction(2008-02-11) Pathak, Nishith; DeLong, Colin; Erickson, Kendrick; Banerjee, ArindamWith social interaction playing an increasingly important role in the online world, the capability to extract latent communities based on such interactions is becoming vital for a wide variety of applications. However, existing literature on community extraction has largely focused on methods based on the link structure of a given social network. Such link-based methods ignore the content of social interactions, which may be crucial for accurate and meaningful community extraction. In this paper, we present a Bayesian generative model for community extraction which naturally incorporates both the link and content information present in the social network. The model assumes that actors in a community communicate on topics of mutual interest, and the topics of communication, in turn, determine the communities. Further, the model naturally allows actors to belong to multiple communities. The model is instantiated in the context of an email network, and a Gibbs sampling algorithm is presented to do inference. Through extensive experiments and visualization on the Enron email corpus, we demonstrate that the model is able to extract well-connected and topically meaningful communities. Additionally, the model extracts relevant topics that can be mapped back to corresponding real-life events involving Enron.Item StochColor: Stochastic Coloring based Graph Partitioning(2010-04-30) Pathak, Nishith; Banerjee, Arindam; Srivastava, JaideepGraph partitioning is a classical problem in computer science. Most algorithms consist of heuristic, spectral and stochastic flow based methods. In this paper a novel technique for graph partitioning is presented. The proposed algorithm, called StochColor extracts partitions from the most likely state of a stochastic graph coloring process. Empirical results show that StochColor is comparable to or significantly better than state of the art spectral clustering and stochastic flow based methods, across a variety of applications.Item Stock Portfolio Selection Using Two-tiered Lazy Updates(2014-04-16) Cook, Alexander; Johnson, Nicholas; Banerjee, ArindamPeople make and lose vast sums of money every day on stock exchanges around the world. This research focused on developing a computer algorithm to build profitable portfolios, while taking into account transaction costs associated with trading stocks. The theory behind our algorithm is based on a subset of Machine Learning called Online Learning. Online Learning makes updated decisions as new information is provided. For our case, a new decision is made each day on what stocks to buy/sell based on transaction costs and the previous day’s stock performance. The Lazy Update part of our algorithm seeks to minimize the quantity of trading, since this leads to transaction costs being incurred. Our algorithm builds on prior work and dynamically learns which sectors to invest in and takes into account risk, which has not been considered before in the literature. Our Online Lazy Updates algorithm runs at a low level on choosing stocks within a sector, and at a high level on choosing the best sectors to invest in. We successfully establish our ability to be profitable with transaction costs on real-world datasets.