Browsing by Author "Cardosa, Michael"

Now showing 1 - 4 of 4

Exploiting Spatio-Temporal Tradeoffs for Energy Efficient MapReduce in the Cloud
(2010-04-07) Cardosa, Michael; Singh, Aameek; Pucha, Himabindu; Chandra, Abhishek
MapReduce is a distributed computing paradigm that is being widely used for building large-scale data processing applications like content indexing, data mining and log file analysis. Offered in the cloud, users can construct their own virtualized MapReduce clusters using virtual machines (VMs) managed by the cloud service provider. However, to maintain low costs for such cloud services, cloud operators are required to optimize the energy consumption of these applications. In this paper, we describe a unique spatio-temporal tradeoff for achieving energy efficiency for MapReduce jobs in such virtualized environments. The tradeoff includes efficient spatial fitting of VMs on servers to achieve high utilization of machine resources, as well as balanced temporal fitting of servers with VMs having similar runtimes to ensure that a server runs at a high utilization throughout its uptime. To study this tradeoff, we propose a set of metrics that quantify the different sources of resource wastage. We then propose VM placement algorithms that explicitly incorporate these spatio-temporal tradeoffs, by combining a recipe placement algorithm for spatial fitting with a temporal binning algorithm for time balancing. We also propose an incremental time balancing algorithm (ITB) that can improve the energy efficiency even further by transparently increasing the cluster size for MapReduce jobs, while improving their performance at the same time. Our simulation results show that our spatio-temporal placement algorithms achieve energy savings between 20-35% over existing spatially-efficient placement techniques, and within 12% of a baseline lower-bound algorithm. Further, the ITB algorithm achieves additional savings of up to 15% over the spatio-temporal algorithms by reducing job runtimes by 5-35%.
HiDRA: Statistical Multi-dimensional Resource Discovery for Large-scale Systems
(2008-12-05) Cardosa, Michael; Chandra, Abhishek
Resource discovery enables applications deployed in heterogeneous large-scale distributed systems to find resources that meet their execution requirements. In particular, most applications need resource requirements to be satisfied simultaneously for multiple resources (such as CPU, memory and network bandwidth). Due to the inherent dynamism in many large-scale systems caused by factors such as load variations, network congestion, and churn, providing statistical guarantees on such resource requirements is important to avoid application failures and overheads. However, existing resource discovery techniques either provide statistical guarantees only for individual resources, or take a static or memoryless approach to meeting resource requirements along multiple dimensions. In this paper, we present HiDRA, a resource discovery technique providing statistical guarantees for resource requirements spanning multiple dimensions simultaneously. Our technique takes advantage of the multivariate normal distribution for the probabilistic modeling of resource capacity over multiple dimensions. Through analysis of PlanetLab traces, we show that HiDRA performs nearly as well as a fully-informed algorithm, showing better precision and having recall within 3% of such an algorithm. We have also deployed HiDRA on a 307-machine PlanetLab testbed, and our live experiments on this testbed demonstrate that HiDRA is a feasible, low-overhead approach to statistical resource discovery in a distributed system.
Resource Bundles: Using Aggregation for Statistical Wide-Area Resource Discovery and Allocation
(2007-11-20) Cardosa, Michael; Chandra, Abhishek
Resource discovery is an important process for finding suitable nodes that satisfy application requirements in large loosely-coupled distributed systems. Besides inter-node heterogeneity, many of these systems also show high degree of intra-node dynamism, so that selecting nodes based only on their recently observed resource capacities can lead to poor deployment decisions resulting in application failures or migration overheads. However, most existing resource discovery mechanisms rely only on recent observations to achieve scalability in large systems. In this paper, we propose the notion of a resource bundle - a representative resource usage distribution for a group of nodes with similar resource usage patterns - that employs two complementary techniques to overcome the limitations of existing techniques: resource usage histograms to provide statistical guarantees for resource capacities, and clustering-based resource aggregation to achieve scalability. Using tracedriven simulations and data analysis of a month-long PlanetLab trace, we show that resource bundles are able to provide high accuracy for statistical resource discovery (up to 56% better precision than using only recent values), while achieving high scalability (up to 55% fewer messages than a non-aggregation algorithm). We also show that resource bundles are ideally suited for identifying group-level characteristics such as finding load hotspots and estimating total group capacity (within 8% of actual values).
STEAMEngine: Driving MapReduce Provisioning in the Cloud
(2010-09-28) Cardosa, Michael; Narang, Piyush; Chandra, Abhishek; Pucha, Himabindu; Singh, Aameek
MapReduce has gained in popularity as a distributed data analysis paradigm, particularly in the cloud, where MapReduce jobs are run on virtual clusters. The provisioning of MapReduce jobs in the cloud is an important problem for optimizing several user as well as provider-side metrics, such as runtime, cost, throughput, energy, and load. In this paper, we present a provisioning framework called STEAMEngine that consists of provisioning algorithms to optimize these metrics through a set of common building blocks. These building blocks enable spatio-temporal tradeoffs unique to MapReduce provisioning: along with their resource requirements (spatial component), a MapReduce job runtime (temporal component) is a critical element for any resource provisioning algorithm. We also describe two novel provisioning algorithms - a user-driven performance optimization and a provider-driven energy optimization - that leverage these building blocks. Our experimental results based on an Amazon EC2 cluster and a local 6-node Xen/Hadoop cluster show the benefits of STEAMEngine through improvements in performance and energy via the use of these algorithms and building blocks.

University Digital Conservancy

Browse by Author

Browsing by Author "Cardosa, Michael"