Browsing by Author "Hsu, Kuo-Wei"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
Item I/O-Scalable Bregman Co-clustering and Its Application to the Analysis of Social Media(2011-10-10) Hsu, Kuo-Wei; Srivastava, JaideepAdoption of social media has experienced explosive growth in recent years, and this trend appears likely to continue. A natural consequence has been the creation of vast quantities of data being generated by social media applications, and hence increased interest from the database community. This data is also providing unique opportunities to understand the sociological and psychological aspects, human interaction, and media production/consumption, and hence the growth in areas such as user modeling, behavior analysis, and social network analysis, which together is being labeled as the emerging area of Computational Social Science (CSS) [37, 59]. These new types of data analysis are leading to the introduction of new computational techniques, e.g. p* modeling, ERGMs [62], co-clustering [6], etc. This paper focuses on a scalable implementation of Bregman co-clustering algorithm and its application to social media analysis. Bregman co-clustering algorithm performs two-way clustering and is theoretically scalable while we discuss an OLAP based implementation to achieve this goal. Principally, we demonstrate how aggregations required by the algorithm can be mapped naturally to summary statistics computed by an OLAP engine and stored in data cubes. Our OLAP based implementation of the algorithm is able to handle large-scale datasets, i.e. datasets that are too large for main memory based implementations. Further, we explore the suitability of the relational model for modeling social media data. Specifically, we argue that data cubes and the star schema are well suited for managing social media data. Our research is a step toward the increasing interest the research community has in connecting three research areas, namely database, data mining, and social media analysis.Item Mapping Multi-Layer Baysian LDA to Massively Parallel Supercomputers(2011-10-10) Hsu, Kuo-Wei; Lin, Ching-Yung; Srivastava, JaideepLDA, short for Latent Dirichlet Allocation, is a hierarchical Bayesian model for content analysis. LDA has seen a wide variety of applications, but it also presents computational challenges because the iterative computation of approximate inference is required. Recently an approach based on Gibbs Sampling and MPI is proposed to address these challenges, while this report presents the work that maps it to a massively parallel supercomputer, Blue Gene. The work enhances the runtime performance by utilizing special hardware architecture of Blue Gene such as dual floating-point unit and by using general programming/compiling techniques such as loop unfolding. Results from the empirical evaluation using a real-world large-scale data set indicate the following findings: First, the use of dual floating-point unit contributes to a significant performance gain, and thus it should be considered in the design of processors for computationally intensive machine learning applications. Second, although it is a simple technique and most compilers support it, loop unfolding improves the performance gain even further. Since loop unfolding is general enough to be applied to other platforms, this report suggests that compilers should perform loop unfolding in a more intelligent manner.Item Unsupervised Learning Based Distributed Detection of Global Anomalies(2008-07-18) Zhou, Junlin; Lazarevic, Aleksandar; Hsu, Kuo-Wei; Srivastava, JaideepAnomaly detection has recently become an important problem in many industrial and financial applications. Very often, the databases from which anomalies have to be found are located at multiple local sites and cannot be merged due to privacy reasons or communication overhead. In this paper, a novel general framework for distributed anomaly detection is proposed. The proposed method consists of three steps: (i) building local models for distributed data sources with unsupervised anomaly detection methods, (ii) transforming local models into uniform models, and (iii) reusing learned models for new data and combining their results by considering both quality and diversity of them to detect anomalies in a global view. In experiments performed on several synthetic and real life large data sets, the proposed distributed anomaly detection method achieved prediction performance comparable or even slightly better than the global anomaly detection algorithm applied on the data set obtained when all distributed data sets were merged.