Browsing by Author "Heintz, Benjamin"

Now showing 1 - 7 of 7

MESH: A Flexible Distributed Hypergraph Processing System
(2017-10-16) Heintz, Benjamin; Singh, Shivangi; Hong, Rankyung; Khandelwal, Guarav; Tesdahl, Corey; Chandra, Abhishek
With the rapid growth of large online social networks, the ability to analyze large-scale social structure and behavior has become critically important, and this has led to the development of several scalable graph processing systems. In reality, however, social interaction takes place not just between pairs of individuals as in the graph model, but rather in the context of multi-user groups. Research has shown that such group dynamics can be better modeled through a more general hypergraph model, resulting in the need to build scalable hypergraph processing systems. In this paper, we present MESH, a flexible distributed framework for scalable hypergraph processing. MESH provides an easy-to-use and expressive application programming interface that naturally extends the “think like a vertex” model common to many popular graph processing systems. Our framework provides a flexible implementation based on an underlying graph processing system, and enables different design choices for the key implementation issues of hypergraph representation and partitioning. We implement MESH on top of the popular GraphX graph processing framework in Apache Spark. Using a variety of real datasets, we experimentally demonstrate that MESH provides flexibility based on data and application characteristics, examine its scaling with cluster size, and show that it is competitive in performance to HyperX, another hypergraph processing system based on Spark, showing that simplicity and flexibility need not come at the cost of performance.
MESH: A Flexible Distributed Hypergraph Processing System
(2019-03-31) Heintz, Benjamin; Hong, Rankyung; Singh, Shivangi; Khandelwal, Guarav; Tesdahl, Corey; Chandra, Abhishek
With the rapid growth of large online social networks, the ability to analyze large-scale social structure and behavior has become critically important, and this has led to the development of several scalable graph processing systems. In reality, however, social interaction takes place not only between pairs of individuals as in the graph model, but rather in the context of multi-user groups. Research has shown that such group dynamics can be better modeled through a more general hypergraph model, resulting in the need to build scalable hypergraph processing systems. In this paper, we present MESH, a flexible distributed framework for scalable hypergraph processing. MESH provides an easy-to-use and expressive application programming interface that naturally extends the “think like a vertex” model common to many popular graph processing systems. Our framework provides a flexible implementation based on an underlying graph processing system, and enables different design choices for the key implementation issues of partitioning a hypergraph representation. We implement MESH on top of the popular GraphX graph processing framework in Apache Spark. Using a variety of real datasets and experiments conducted on a local 8-node cluster as well as a 65-node Amazon AWS testbed, we demonstrate that MESH provides flexibility based on data and application characteristics, as well as scalability with cluster size. We further show that it is competitive in performance to HyperX, another hypergraph processing system based on Spark, while providing a much simpler implementation (requiring about 5X fewer lines of code), thus showing that simplicity and flexibility need not come at the cost of performance.
MESH: A Flexible Distributed Hypergraph Processing System
(2016-12-05) Heintz, Benjamin; Singh, Shivangi; Tesdahl, Corey; Chandra, Abhishek
With the rapid growth of large online social networks, the ability to analyze large-scale social structure and behavior has become critically important, and this has led to the development of several scalable graph processing systems. In reality, however, social interaction takes place not just between pairs of individuals as in the graph model, but rather in the context of multi-user groups. Research has shown that such group dynamics can be better modeled through a more general hypergraph model, resulting in the need to build scalable hypergraph processing systems. In this paper, we present MESH, a flexible distributed framework for scalable hypergraph processing. MESH provides an easy-to-use and expressive application programming interface that naturally extends the "think like a vertex" model common to many popular graph processing systems. Our framework provides a flexible implementation that enables different design choices for the key implementation issues of hypergraph representation and partitioning. We implement MESH on top of the popular GraphX graph processing framework in Apache Spark. Using a variety of real datasets, we experimentally demonstrate that MESH provides flexibility based on data and application characteristics, and is competitive in performance to HyperX, another hypergraph processing system based on Spark, showing that simplicity and flexibility need not come at the cost of performance.
Optimizing Grouped Aggregation in Geo-Distributed Streaming Analytics
(2015-01-26) Heintz, Benjamin; Chandra, Abhishek; Sitaraman, Ramesh K.
Large quantities of data are generated continuously over time and from disparate sources such as users, devices, and sensors located around the globe. This results in the need for efficient geo-distributed streaming analytics to extract timely information. A typical analytics service in these settings uses a simple hub-and-spoke model, comprising a single central data warehouse and multiple edges connected by a wide-area network (WAN). A key decision for a geo-distributed streaming service is how much of the computation should be performed at the edge versus the center. In this paper, we examine this question in the context of windowed grouped aggregation, an important and widely used primitive in streaming queries. Our work is focused on designing aggregation algorithms to optimize two key metrics of any geo-distributed streaming analytics service: WAN traffic and staleness (the delay in getting the result). Toward this end, we present a family of optimal offline algorithms that jointly minimize both staleness and traffic. Using this as a foundation, we develop practical online aggregation algorithms based on the observation that grouped aggregation can be modeled as a caching problem where the cache size varies over time. This key insight allows us to exploit well known caching techniques in our design of online aggregation algorithms. We demonstrate the practicality of these algorithms through an implementation in Apache Storm, deployed on the PlanetLab testbed. The results of our experiments, driven by workloads derived from traces of a popular web analytics service offered by a large commercial CDN, show that our online aggregation algorithms perform close to the optimal algorithms for a variety of system configurations, stream arrival rates, and query types.
Optimizing MapReduce for Highly Distributed Environments
(2012-02-13) Heintz, Benjamin; Chandra, Abhishek; Sitaraman, Ramesh K.
MapReduce, the popular programming paradigm for large-scale data processing, has traditionally been deployed over tightly-coupled clusters where the data is already locally available. The assumption that the data and compute resources are available in a single central location, however, no longer holds for many emerging applications in commercial, scientific and social networking domains, where the data is generated in a geographically distributed manner. Further, the computational resources needed for carrying out the data analysis may be distributed across multiple data centers or community resources such as Grids. In this paper, we develop a modeling framework to capture MapReduce execution in a highly distributed environment comprising distributed data sources and distributed computational resources. This framework is flexible enough to capture several design choices and performance optimizations for MapReduce execution. We propose a model-driven optimization that has two key features: (i) it is end-to-end as opposed to myopic optimizations that may only make locally optimal but globally suboptimal decisions, and (ii) it can control multiple MapReduce phases to achieve low runtime, as opposed to single-phase optimizations that may control only individual phases. Our model results show that our optimization can provide nearly 82% and 64% reduction in execution time over myopic and single-phase optimizations, respectively. We have modified Hadoop to implement our model outputs, and using three different MapReduce applications over an 8-node emulated PlanetLab testbed, we show that our optimized Hadoop execution plan achieves 31-41% reduction in runtime over a vanilla Hadoop execution. Our model-driven optimization also provides several insights into the choice of techniques and execution parameters based on application and platform characteristics.
Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems
(2016-12) Heintz, Benjamin
Big Data touches every aspect of our lives, from the way we spend our free time to the way we make scientific discoveries. Netflix streamed more than 42 billion hours of video in 2015, and in the process recorded massive volumes of data to inform video recommendations and plan investments in new content. The CERN Large Hadron Collider produces enough data to fill more than one billion DVDs every week, and this data has led to the discovery of the Higgs boson particle. Such large scale computing is challenging because no one machine is capable of ingesting, storing, or processing all of the data. Instead, applications require distributed systems comprising many machines working in concert. Adding to the challenge, many data streams originate from geographically distributed sources. Scientific sensors such as LIGO span multiple sites and generate data too massive to process at any one location. The machines that analyze these data are also geo-distributed; for example Netflix and Facebook users span the globe, and so do the machines used to analyze their behavior. Many applications need to process geo-distributed data on geo-distributed systems with low latency. A key challenge in achieving this requirement is determining where to carry out the computation. For applications that process unbounded data streams, two performance metrics are critical: WAN traffic and staleness (i.e., delay in receiving results). To optimize these metrics, a system must determine when to communicate results from distributed resources to a central data warehouse. As an additional challenge, constrained WAN bandwidth often renders exact computation infeasible. Fortunately, many applications can tolerate inaccuracy, albeit with diverse preferences. To support diverse applications, systems must determine what partial results to communicate in order to achieve the desired staleness-error tradeoff. This thesis presents answers to these three questions--where to compute, when to communicate, and what partial results to communicate--in two contexts: batch computing, where the complete input data set is available prior to computation; and stream computing, where input data are continuously generated. We also explore the challenges facing emerging programming models and execution engines that unify stream and batch computing.
Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics
(2016-03-03) Heintz, Benjamin; Chandra, Abhishek; Sitaraman, Ramesh K.
Many applications must ingest rapid streams of data and produce analytics results in near-real-time. Whether the input streams represent sensor data from smart homes, user interaction logs from streaming video clients, or server logs from a content delivery network (CDN), it is common for such streams to originate from geographically distributed sources. The typical infrastructure for processing these geo-distributed streams follows a hub-and-spoke model, where several edge resources perform partial computation before forwarding results over a wide-area network (WAN) to a central location for final processing. Due to limited WAN bandwidth, it is not always possible to produce exact results in near-real-time. When this is the case, applications must either sacrifice timeliness by allowing delayed---and in turn stale---results, or sacrifice accuracy by allowing some error in final results. In this paper, we focus on windowed grouped aggregation, an important and widely used primitive in streaming analytics, and we study the tradeoff between the key metrics of staleness and error. We present optimal offline algorithms for minimizing staleness under an error constraint and for minimizing error under a staleness constraint. Using these offline algorithms as references, we present practical online algorithms for effectively trading off timeliness and accuracy in the face of bandwidth limitations. Using a workload derived from a web analytics service offered by a large commercial CDN, we demonstrate the effectiveness of our techniques through a trace-driven simulation. Our results show that our proposed algorithms outperform several baseline algorithms for a range of error and staleness bounds, for a variety of aggregation functions under different network bandwidth constraints.

University Digital Conservancy

Browse by Author

Browsing by Author "Heintz, Benjamin"