Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems

Big Data touches every aspect of our lives, from the way we spend our free time to the way we make scientific discoveries. Netflix streamed more than 42 billion hours of video in 2015, and in the process recorded massive volumes of data to inform video recommendations and plan investments in new content. The CERN Large Hadron Collider produces enough data to fill more than one billion DVDs every week, and this data has led to the discovery of the Higgs boson particle. Such large scale computing is challenging because no one machine is capable of ingesting, storing, or processing all of the data. Instead, applications require distributed systems comprising many machines working in concert. Adding to the challenge, many data streams originate from geographically distributed sources. Scientific sensors such as LIGO span multiple sites and generate data too massive to process at any one location. The machines that analyze these data are also geo-distributed; for example Netflix and Facebook users span the globe, and so do the machines used to analyze their behavior. Many applications need to process geo-distributed data on geo-distributed systems with low latency. A key challenge in achieving this requirement is determining where to carry out the computation. For applications that process unbounded data streams, two performance metrics are critical: WAN traffic and staleness (i.e., delay in receiving results). To optimize these metrics, a system must determine when to communicate results from distributed resources to a central data warehouse. As an additional challenge, constrained WAN bandwidth often renders exact computation infeasible. Fortunately, many applications can tolerate inaccuracy, albeit with diverse preferences. To support diverse applications, systems must determine what partial results to communicate in order to achieve the desired staleness-error tradeoff. This thesis presents answers to these three questions--where to compute, when to communicate, and what partial results to communicate--in two contexts: batch computing, where the complete input data set is available prior to computation; and stream computing, where input data are continuously generated. We also explore the challenges facing emerging programming models and execution engines that unify stream and batch computing.

Keywords

Data-intensive Computing

Distributed Systems

Geo-distributed

Stream computing

Description

University of Minnesota Ph.D. dissertation. December 2016. Major: Computer Science. Advisor: Abhishek Chandra. 1 computer file (PDF); xv, 194 pages.

Collections

Dissertations

Suggested citation

Heintz, Benjamin. (2016). Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/185171.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University of Minnesota Twin Cities

University Digital Conservancy

Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation