Browsing by Author "Anastasiu, David C."
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item Big Data and Recommender Systems(2016-09-12) Anastasiu, David C.; Christakopoulou, Evangelia; Smith, Shaden; Sharma, Mohit; Karypis, GeorgeRecommender systems are ubiquitous in today's marketplace and have great commercial importance, as evidenced by the large number of companies that sell recommender systems solutions. Successful recommender systems use past product purchase and satisfaction data to make high quality personalized recommendations. The vast amounts of data available to recommender systems today forces a total re-evaluation of the methods used to compute recommendations. In this paper, we provide an overview of recommender systems in the era of Big Data. We highlight prevailing recommendation algorithms and how they have been adapted to operate in parallel and distributed computing environments. Within the recommender systems context, we focus our discussion on two specific challenges: how to scale up finding nearest neighbors and how to scale latent factor recommendation methods.Item Big Data Frequent Pattern Mining(2014-07-09) Anastasiu, David C.; Iverson, Jeremy; Smith, Shaden; Karypis, GeorgeFrequent pattern mining is an essential data mining task, with a goal of discovering knowledge in the form of repeated patterns. Many efficient pattern mining algorithms have been discovered in the last two decades, yet most do not scale to the type of data we are presented with today, the so-called "Big Data". Scalable parallel algorithms hold the key to solving the problem in this context. In this chapter, we review recent advances in parallel frequent pattern mining, analyzing them through the Big Data lens. We identify three areas as challenges to designing parallel frequent pattern mining algorithms: memory scalability, work partitioning, and load balancing. With these challenges as a frame of reference, we extract and describe key algorithmic design patterns from the wealth of research conducted in this domain.Item Document Clustering: The Next Frontier(2013-02-04) Anastasiu, David C.; Tagarelli, Andrea; Karypis, GeorgeThe proliferation of documents, on both the Web and in private systems, makes knowledge discovery in document collections arduous. Clustering has been long recognized as a useful tool for the task. It groups like-items together, maximizing intra-cluster similarity and inter-cluster distance. Clustering can provide insight into the make-up of a document collection and is often used as the initial step in data analysis. While most document clustering research to date has focused on moderate length single topic documents, real-life collections are often made up of very short or long documents. Short documents do not contain enough text to accurately compute similarities. Long documents often span multiple topics that general document similarity measures do not take into account. In this paper we will first give an overview of general purpose document clustering, and then focus on recent advancements in the next frontier in document clustering: long and short documents.Item PL2AP: Fast Parallel Cosine Similarity Search(2015-11-16) Anastasiu, David C.; Karypis, GeorgeSolving the AllPairs similarity search problem entails finding all pairs of vectors in a high dimensional sparse dataset that have a similarity value higher than a given threshold. The output form this problem is a crucial component in many real-world applications, such as clustering, online advertising, recommender systems, near-duplicate document detection, and query refinement. A number of serial algorithms have been proposed that solve the problem by pruning many of the possible similarity candidates for each query object, after accessing only a few of their non-zero values. The pruning process results in unpredictable memory access patterns that can reduce search efficiency. In this context, we introduce pL2AP, which efficiently solves the AllPairs cosine similarity search problem in a multi-core environment. Our method uses a number of cache-tiling optimizations, combined with fine-grained dynamically balanced parallel tasks, to solve the problem 1.5x--232x faster than existing parallel baselines on datasets with hundreds of millions of non-zeros.Item Understanding Computer Usage Evolution(2014-10-10) Anastasiu, David C.; Rashid, Al M.; Tagarelli, Andrea; Karypis, GeorgeThe proliferation of computing devices in recent years has dramatically changed the way people work, play, communicate, and access information. The personal computer (PC) now has to compete with smartphones, tablets, and other devices for tasks it used to be the default device for. Understanding how PC usage evolves over time can help provide the best overall user experience for current customers, can help determine when they need brand new systems vs. upgraded components, and can inform future product design to better anticipate user needs. In this paper, we introduce a method for the analysis of users' computer usage evolution. Our algorithm, Orion, segments the application-level usage of different users into a sequence of prototypical usage patterns shared among users, referred to as protos. Using an iterative process, protos are automatically derived from the segmentation, and an optimal segmentation is determined from the protos by a dynamic programming algorithm. To ensure that the segmentation is robust, constraints on the length and the number of segments are utilized. We show the validity of our method by analyzing a dataset consisting of over 28K users whose PC usage covers approximately 1M weeks. Our results show that different groups of users exhibit different usage patterns, the usage patterns of nearly 50% of the users change over time, and more than 20% of the users undergo multiple changes. Moreover, many of the differences in the usage patterns and their changes appear to correlate with various user-specific information, such as their geographic location and/or the type of computer system that they have. To show the versatility of Orion, we present additional results from an analysis of 57K grocery store orders of nearly 1000 users.