Browsing by Author "Chandra, Abhishek"

Now showing 1 - 20 of 32

Accessibility-based Resource Selection in Loosely-coupled Distributed Systems
(2007-11-20) Kim, Jinoh; Chandra, Abhishek; Weissman, Jon
Large-scale distributed systems provide an attractive scalable infrastructure for network applications. However, the loosely-coupled nature of this environment can make data access unpredictable, and in the limit, unavailable. Availability is normally characterized as a binary property, yes or no, often with an associated probability. However, availability conveys little in terms of expected data access performance. Using availability alone, jobs may suffer intolerable response time, or even fail to complete, due to poor data access. We introduce the notion of accessibility, a more general concept, to capture both availability and performance. An increasing number of data-intensive applications require not only considerations of node computation power but also accessibility for adequate job allocations. For instance, selecting a node with intolerably slow connections can offset any benefit to running on a fast node. In this paper, we present accessibility-aware resource selection techniques by which it is possible to choose nodes that will have efficient data access to remote data sources. We have that the local data access observations collected from a node's neighbors are sufficient to characterize accessibility for that node. We then present resource selection heuristics guided by this principle, and show that they significantly outperform standard techniques. We also investigate the impact of churn in which nodes change their status of participation such that they lose their memory of prior observations. Despite this level of unreliability, we show that the suggested techniques yield good results.
Awan: Locality-aware Resource Manager for Geo-distributed Data-intensive Applications
(2016-03-10) Jonathan, Albert; Chandra, Abhishek; Weissman, Jon
Today, many organizations need to operate on data that is distributed around the globe. This is inevitable due to the nature of data that is generated in different locations such as video feeds from distributed cameras, log files from distributed servers, and many others. Although centralized cloud platforms have been widely used for data-intensive applications, such systems are not suitable for processing geo-distributed data due to high data transfer overheads. An alternative approach is to use an Edge Cloud which reduces the network cost of transferring data by distributing its computations globally. While the Edge Cloud is attractive for geo-distributed data-intensive applications, extending existing cluster computing frameworks to a wide-area environment must account for locality. We propose Awan : a new locality-aware resource manager for geo-distributed data- intensive applications. Awan allows resource sharing between multiple computing frameworks while enabling high locality scheduling within each framework. Our experiments with the Nebula Edge Cloud on PlanetLab show that Awan achieves up to a 28% increase in locality scheduling which reduces the average job turnaround time by approximately 18% compared to existing cluster management mechanisms.
Awan: Locality-aware Resource Manager for Geo-distributed Data-intensive Applications
(2015-11-18) Jonathan, Albert; Chandra, Abhishek; Weissman, Jon
Today, many organizations need to operate on data that is distributed around the globe. This is inevitable due to the nature of data that is generated in different locations such as video feeds from distributed cameras, log files from distributed servers, and many others. Although centralized cloud platforms have been widely used for data-intensive applications, such systems are not suitable for processing geo-distributed data due to high data transfer overheads. An alternative approach is to use an Edge Cloud which reduces the network cost of transferring data by distributing its computations globally. While the Edge Cloud is attractive for geo-distributed data-intensive applications, extending existing cluster computing frameworks to a wide-area environment must account for locality. We propose Awan: a new locality-aware resource manager for geo-distributed data-intensive applications. Awan allows resource sharing between multiple computing frameworks while enabling high locality scheduling within each framework. Our experiments with the Nebula Edge Cloud on PlanetLab show that Awan achieves up to a 28% increase in locality scheduling which reduces the average job turnaround time by approximately 20% compared to existing cluster management mechanisms.
dSENSE: Data-driven Stochastic Energy Management for Wireless Sensor Platforms
(2005-05-09) Liu, Haiyang; Chandra, Abhishek; Srivastava, Jaideep
Wireless sensor networks are being widely deployed for providing physical measurements to diverse applications. Energy is a precious resource in such networks as nodes in wireless sensor platforms are typically powered by batteries with limited power and high replacement cost. This paper presents dSENSE: a data-driven approach for energy management in sensor platforms. dSENSE is a node-level power management approach that utilizes knowledge of the underlying data streams as well as application data quality requirements to conserve energy on a sensor node. dSENSE employs sense-on-change---a sampling strategy that aggressively conserves power by reducing sensing activity on the sensor node. Unlike most existing energy management techniques, this strategy enables explicit control of the sensor along with the CPU and the radio. Our approach uses an efficient statistical data stream model to predict future sensor readings. These predictions are coupled with a stochastic scheduling algorithm to dynamically control the operating modes of the sensor node components. Using experimental results obtained on PowerTOSSIM with a real-world data trace, we demonstrate that our approach reduces energy consumption by 29-36% while providing strong statistical guarantees on data quality.
Dynamic Outsourcing Mobile Computation to the Cloud
(2011-03-14) Mei, Chonglei; Shimek, James; Wang, Chenyu; Chandra, Abhishek; Weissman, Jon
Mobile devices are becoming the universal interface to online services and cloud computing applications. Since mobile phones have limited computing power and battery life, there is a potential to migrate computation intensive application components to external computing resources. The Cloud is an attractive platform for offloading due to elastic resource provisioning and the ability to support large scale service deployment. In this paper,we discuss the potential for offloading mobile computation for different computation patterns and analyze their tradeoffs. Experiments show that the we can achieve a 27 fold speedup for an image manipulation application and 1.47 fold speedup for a face recognition application. In addition, this outsourcing model can also result in power saving depending on the computation pattern, offloading configuration, and execution environment.
Early Experience with Mobile Computation Outsourcing
(2010-08-25) Ramakrishnan, Siddharth; Reutiman, Robert; Iverson, Harlan; Lian, Shimin; Chandra, Abhishek; Weissman, Jon
End-user mobile devices such as smart phones, PDAs, and tablets, offer the promise of anywhere, anytime computing and communication. Users increasingly expect their mobile device to support the same activities (and same performance) as their desktop counterparts: seamless multitasking, ubiquitous data access, social networking, game playing, and real work, while on-the-go. To support mobility, these devices are typically constrained in terms of their compute power, energy, and network bandwidth. Supporting the full-array of desired desktop applications and behavior yet retaining the flexibility of the mobile cannot always be met using local resources alone. To achieve this vision, additional resources must be easily harnessed on-demand. In this paper, we describe opportunities for mobile outsourcing using external proxy resources. We present several application scenarios and preliminary results.
Exploiting Heterogeneity for Collective Data Downloading in Volunteer-based Networks
(2006-11-29) Kim, Jinoh; Chandra, Abhishek; Weissman, Jon
Scientific computing is being increasingly deployed over volunteer-based distributed computing environments consisting of idle resources on donated user machines. A fundamental challenge in these environments is the dissemination of data to the computation nodes, with the successful completion of jobs being driven by the efficiency of collective data download across compute nodes, and not only the individual download times. This paper considers the use of a data network consisting of data distributed across a set of data servers, and focuses on the server selection problem: how do individual nodes select a server for downloading data to minimize the communication makespan - the maximal download time for a data file. Through experiments conducted on a Pastry network running on PlanetLab, we demonstrate that nodes in a volunteer-based network are heterogeneous in terms of several metrics, such as bandwidth, load, and capacity, which impact their download behavior. We propose new server selection heuristics that incorporate these metrics, and demonstrate that these heuristics outperform traditional proximity-based server selection, reducing average makespans by at least 30%. We further show that incorporating information about download concurrency avoids overloading servers, and improves performance by about 17-43% over heuristics considering only proximity and bandwidth.
Exploiting Spatio-Temporal Tradeoffs for Energy Efficient MapReduce in the Cloud
(2010-04-07) Cardosa, Michael; Singh, Aameek; Pucha, Himabindu; Chandra, Abhishek
MapReduce is a distributed computing paradigm that is being widely used for building large-scale data processing applications like content indexing, data mining and log file analysis. Offered in the cloud, users can construct their own virtualized MapReduce clusters using virtual machines (VMs) managed by the cloud service provider. However, to maintain low costs for such cloud services, cloud operators are required to optimize the energy consumption of these applications. In this paper, we describe a unique spatio-temporal tradeoff for achieving energy efficiency for MapReduce jobs in such virtualized environments. The tradeoff includes efficient spatial fitting of VMs on servers to achieve high utilization of machine resources, as well as balanced temporal fitting of servers with VMs having similar runtimes to ensure that a server runs at a high utilization throughout its uptime. To study this tradeoff, we propose a set of metrics that quantify the different sources of resource wastage. We then propose VM placement algorithms that explicitly incorporate these spatio-temporal tradeoffs, by combining a recipe placement algorithm for spatial fitting with a temporal binning algorithm for time balancing. We also propose an incremental time balancing algorithm (ITB) that can improve the energy efficiency even further by transparently increasing the cluster size for MapReduce jobs, while improving their performance at the same time. Our simulation results show that our spatio-temporal placement algorithms achieve energy savings between 20-35% over existing spatially-efficient placement techniques, and within 12% of a baseline lower-bound algorithm. Further, the ITB algorithm achieves additional savings of up to 15% over the spatio-temporal algorithms by reducing job runtimes by 5-35%.
Exploring the Throughput-Fairness Tradeoff of Deadline Scheduling in Heterogeneous Computing Environments
(2008-01-31) Sundaram, Vasumathi; Chandra, Abhishek; Weissman, Jon
The scalability and computing power of large-scale computational platforms that harness processing cycles from distributed nodes has made them attractive for hosting compute-intensive time-critical applications. Many of these applications are composed of computational tasks that require specific deadlines to be met for successful completion. In scheduling such tasks, replication becomes necessary due to the heterogeneity and dynamism inherent in these computational platforms. In this paper, we show that combining redundant scheduling with deadline-based scheduling in these systems leads to a fundamental tradeoff between throughput and fairness. We propose a new scheduling algorithm called Limited Resource Earliest Deadline (LRED) that couples redundant scheduling with deadline-driven scheduling in a flexible way by using a simple tunable parameter to exploit this tradeoff. Our evaluation of LRED using trace-driven and synthetic simulations shows that LRED provides a powerful mechanism to achieve desired throughput or fairness under high loads and low timeliness environments, where these tradeoffs are most critical.
Extracting the Textual and Temporal Structure of Supercomputing Logs
(2009-06-01) Jain, Sourabh; Singh, Inderpreet; Chandra, Abhishek; Zhang, Zhi-Li; Bronevetsky, Greg
Supercomputers are prone to frequent faults that adversely affect their performance, reliability and functionality. System logs collected on these systems are a valuable resource of information about their operational status and health. However, their massive size, complexity, and lack of standard format makes it difficult to automatically extract information that can be used to improve system management. In this work we propose a novel method to succinctly represent the contents of supercomputing logs, by using textual clustering to automatically find the syntactic structures of log messages. This information is used to automatically classify messages into semantic groups via an online clustering algorithm. Further, we describe a methodology for using the temporal proximity between groups of log messages to identify correlated events in the system. We apply our proposed methods to two large, publicly available supercomputing logs and show that our technique features nearly perfect accuracy for online log-classification and extracts meaningful structural and temporal message patterns that can be used to improve the accuracy of other log analysis techniques.
Failure Classification and Inference in Large-Scale Systems: A Systematic Study of Failures in PlanetLab
(2008-04-24) Jain, Sourabh; Prinja, Rohini; Chandra, Abhishek; Zhang, Zhi-Li
Large-scale distributed systems are prone to frequent failures, which could be caused by a variety of factors related to network, hardware, and software problems. Any downtime due to failures, whatever the cause, can lead to large disruptions and huge losses. Identifying the location and cause of a failure is critical for the reliability and availability of such systems. However, identifying the actual cause of failures in such systems is a challenging task due to their large scale and variety of failure causes. In this work, we try to understand failures in a large-scale system through a two-step methodology: (i) classifying failures based on their statistical properties, and (ii) using additional monitoring data to explain these failures. We illustrate our methodology through a systematic study of failures in PlanetLab over a 3-month period. Our results show that most of the failures that required restarting a node were of small size and lasted for long durations. We also found that incorporating geographic information into our analysis enabled us to find site-wise correlated failures. We were also able to explain some failures by using error-message information collected by the monitoring nodes, and some of short-lived failures by transient CPU overloads on machines.
Heterogeneity-Aware Workload Distribution in Donation Based Grids
(2005-12-28) Trivedi, Rahul; Chandra, Abhishek; Weissman, Jon
This paper aims to explore the opportunities in porting a highthroughput Grid computing middleware to a high-performance service oriented environment. It exposes the limitations of the Grid computing middleware when operating in such a performance sensitive environment and presents ways of overcoming these limitations. We focus on exploiting the heterogeneity of the Grid resources to meet the performance requirements of services and present several approaches of work distribution to deal with this heterogeneity. We present a heuristic for finding the optimum decomposition of work and present algorithms for each of the approaches which we evaluate on a real live testbed. The results validate the heuristic and compare the performance of the different workload distribution strategies. Our results indicate that a significant improvement in performance can be achieved by making the Grid computing middleware aware of the heterogeneity in the underlying infrastructure. The results also provide some useful insights into deciding a work distribution policy depending on the status of the Grid computing environment.
HiDRA: Statistical Multi-dimensional Resource Discovery for Large-scale Systems
(2008-12-05) Cardosa, Michael; Chandra, Abhishek
Resource discovery enables applications deployed in heterogeneous large-scale distributed systems to find resources that meet their execution requirements. In particular, most applications need resource requirements to be satisfied simultaneously for multiple resources (such as CPU, memory and network bandwidth). Due to the inherent dynamism in many large-scale systems caused by factors such as load variations, network congestion, and churn, providing statistical guarantees on such resource requirements is important to avoid application failures and overheads. However, existing resource discovery techniques either provide statistical guarantees only for individual resources, or take a static or memoryless approach to meeting resource requirements along multiple dimensions. In this paper, we present HiDRA, a resource discovery technique providing statistical guarantees for resource requirements spanning multiple dimensions simultaneously. Our technique takes advantage of the multivariate normal distribution for the probabilistic modeling of resource capacity over multiple dimensions. Through analysis of PlanetLab traces, we show that HiDRA performs nearly as well as a fully-informed algorithm, showing better precision and having recall within 3% of such an algorithm. We have also deployed HiDRA on a 307-machine PlanetLab testbed, and our live experiments on this testbed demonstrate that HiDRA is a feasible, low-overhead approach to statistical resource discovery in a distributed system.
Hierarchical Scheduling for Symmetric Multiprocessors
(2006-05-22) Chandra, Abhishek; Shenoy, Prashant
Hierarchical scheduling has been proposed as a scheduling technique to achieve aggregate resource partitioning among related groups of threads and applications in uniprocessor and packet scheduling environments. Existing hierarchical schedulers are not easily extensible to multiprocessor environments because (i) they do not incorporate the inherent parallelism of a multiprocessor system while resource partitioning, and (ii) they can result in unbounded unfairness or starvation if applied to a multiprocessor system in a naive manner. In this paper, we present a novel hierarchical scheduling algorithm designed specifically for multiprocessor environments that overcomes the limitations of existing algorithms in several ways. We present a generalized weight feasibility constraint that specifies the limit on the achievable CPU bandwidth partitioning in a multiprocessor hierarchical framework, and propose a hierarchical weight readjustment algorithm designed to transparently satisfy this feasibility constraint. We then present hierarchical multiprocessor scheduling (H-SMP): a hierarchical CPU scheduling algorithm designed for a symmetric multiprocessor (SMP) platform. The novelty of this algorithm lies in its combination of space- and time-multiplexing to achieve desired bandwidth partition among the nodes of the hierarchical scheduling tree. This algorithm is also characterized by its ability to incorporate existing proportional-share algorithms as auxiliary schedulers to achieve efficient hierarchical CPU partitioning. We evaluate the properties of this algorithm using hierarchical surplus fair scheduling (HSFS): an instantiation of H-SMP that employs surplus fair scheduling (SFS) as an auxiliary algorithm. This evaluation is carried out through a simulation study that shows that H-SFS provides better fairness properties in multiprocessor environments as compared to existing algorithms and their naive extensions.
Hosting Services on the Grid: Challenges and Opportunities
(2005-07-06) Chandra, Abhishek; Trivedi, Rahul; Weissman, Jon
In this paper, we present the challenges to service hosting on the Grid using a measurement study on a prototype Grid testbed. For this experimental study, we have deployed the bioinformatics service BLAST from NCBI on PlanetLab, a set of widely distributed nodes, managed by the BOINC middleware. Our results indicate that the stateless nature of BOINC presents three major challenges to service hosting on the Grid: the need to deal with substantial computation and communication heterogeneity, the need to handle tight data and computation coupling, and the necessity for handling distinct response-sensitive service requests. We present experimental results that illuminate these challenges and discuss possible remedies to make Grids viable for hosting emerging services.
MESH: A Flexible Distributed Hypergraph Processing System
(2017-10-16) Heintz, Benjamin; Singh, Shivangi; Hong, Rankyung; Khandelwal, Guarav; Tesdahl, Corey; Chandra, Abhishek
With the rapid growth of large online social networks, the ability to analyze large-scale social structure and behavior has become critically important, and this has led to the development of several scalable graph processing systems. In reality, however, social interaction takes place not just between pairs of individuals as in the graph model, but rather in the context of multi-user groups. Research has shown that such group dynamics can be better modeled through a more general hypergraph model, resulting in the need to build scalable hypergraph processing systems. In this paper, we present MESH, a flexible distributed framework for scalable hypergraph processing. MESH provides an easy-to-use and expressive application programming interface that naturally extends the “think like a vertex” model common to many popular graph processing systems. Our framework provides a flexible implementation based on an underlying graph processing system, and enables different design choices for the key implementation issues of hypergraph representation and partitioning. We implement MESH on top of the popular GraphX graph processing framework in Apache Spark. Using a variety of real datasets, we experimentally demonstrate that MESH provides flexibility based on data and application characteristics, examine its scaling with cluster size, and show that it is competitive in performance to HyperX, another hypergraph processing system based on Spark, showing that simplicity and flexibility need not come at the cost of performance.
MESH: A Flexible Distributed Hypergraph Processing System
(2019-03-31) Heintz, Benjamin; Hong, Rankyung; Singh, Shivangi; Khandelwal, Guarav; Tesdahl, Corey; Chandra, Abhishek
With the rapid growth of large online social networks, the ability to analyze large-scale social structure and behavior has become critically important, and this has led to the development of several scalable graph processing systems. In reality, however, social interaction takes place not only between pairs of individuals as in the graph model, but rather in the context of multi-user groups. Research has shown that such group dynamics can be better modeled through a more general hypergraph model, resulting in the need to build scalable hypergraph processing systems. In this paper, we present MESH, a flexible distributed framework for scalable hypergraph processing. MESH provides an easy-to-use and expressive application programming interface that naturally extends the “think like a vertex” model common to many popular graph processing systems. Our framework provides a flexible implementation based on an underlying graph processing system, and enables different design choices for the key implementation issues of partitioning a hypergraph representation. We implement MESH on top of the popular GraphX graph processing framework in Apache Spark. Using a variety of real datasets and experiments conducted on a local 8-node cluster as well as a 65-node Amazon AWS testbed, we demonstrate that MESH provides flexibility based on data and application characteristics, as well as scalability with cluster size. We further show that it is competitive in performance to HyperX, another hypergraph processing system based on Spark, while providing a much simpler implementation (requiring about 5X fewer lines of code), thus showing that simplicity and flexibility need not come at the cost of performance.
MESH: A Flexible Distributed Hypergraph Processing System
(2016-12-05) Heintz, Benjamin; Singh, Shivangi; Tesdahl, Corey; Chandra, Abhishek
With the rapid growth of large online social networks, the ability to analyze large-scale social structure and behavior has become critically important, and this has led to the development of several scalable graph processing systems. In reality, however, social interaction takes place not just between pairs of individuals as in the graph model, but rather in the context of multi-user groups. Research has shown that such group dynamics can be better modeled through a more general hypergraph model, resulting in the need to build scalable hypergraph processing systems. In this paper, we present MESH, a flexible distributed framework for scalable hypergraph processing. MESH provides an easy-to-use and expressive application programming interface that naturally extends the "think like a vertex" model common to many popular graph processing systems. Our framework provides a flexible implementation that enables different design choices for the key implementation issues of hypergraph representation and partitioning. We implement MESH on top of the popular GraphX graph processing framework in Apache Spark. Using a variety of real datasets, we experimentally demonstrate that MESH provides flexibility based on data and application characteristics, and is competitive in performance to HyperX, another hypergraph processing system based on Spark, showing that simplicity and flexibility need not come at the cost of performance.
Mobilizing the Cloud: Enabling Multi-User Mobile Outsourcing in the Cloud
(2011-11-21) Mei, Chonglei; Taylor, Daniel; Wang, Chenyu; Chandra, Abhishek; Weissman, Jon
Mobile devices, such as smartphones and tablets, are becoming the universal interface to online services and applications. However, such devices have limited computational power and battery life, which limits their ability to execute rich, resource-intensive applications. Mobile computation outsourcing to external resources has been proposed as a technique to alleviate this problem. Most existing work on mobile outsourcing has focused either on single application optimization, or outsourcing to fixed, local resources, with the assumption that wide-area latency is prohibitively high. In this paper, we present the design and implementation of an Android/Amazon EC2-based mobile application outsourcing platform, leveraging the cloud for scalability, elasticity, and multi-user code/data sharing. Using this platform, we empirically demonstrate that the cloud is not only feasible but desirable as an offloading platform for latency tolerant applications despite wide-area latencies. Our platform is designed to dynamically scale to support a large number of mobile users concurrently by utilizing the elastic provisioning capabilities of the cloud, as well as by allowing reuse of common code components across multiple users. Additionally, we have developed techniques for detecting data sharing across multiple applications, and proposed novel scheduling algorithms that exploit such data sharing for better scalability and user performance. Experiments with our offloading platform show that our proposed techniques and algorithms substantially improve application performance, while achieving high efficiency in terms of resource and network usage.
Nebula: Data Intensive Computing over Widely Distributed Voluntary Resources
(2013-03-14) Ryden, Mathew; Chandra, Abhishek; Weissman, Jon
Centralized cloud infrastructures have become the de-facto platform for data-intensive computing today. However, they suffer from inefficient data mobility due to the centralization of cloud resources, and hence, are highly unsuited for dispersed-data-intensive applications, where the data may be spread at multiple geographical locations. In this paper, we present Nebula: a dispersed cloud infrastructure that uses voluntary edge resources for both computation and data storage. We describe the lightweight Nebula architecture that enables distributed data-intensive computing through a number of optimizations including location-aware data and computation placement, replication, and recovery. We evaluate Nebula's performance on an emulated volunteer platform that spans over 50 PlanetLab nodes distributed across Europe, and show how a common data-intensive computing framework, MapReduce, can be easily deployed and run on Nebula. We show Nebula MapReduce is robust to a wide array of failures and substantially outperforms other wide-area versions based on emulated BOINC.

University Digital Conservancy

Browse by Author

Browsing by Author "Chandra, Abhishek"