Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems
2016-12
Loading...
View/Download File
Persistent link to this item
Statistics
View StatisticsJournal Title
Journal ISSN
Volume Title
Title
Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems
Alternative title
Authors
Published Date
2016-12
Publisher
Type
Thesis or Dissertation
Abstract
Big Data touches every aspect of our lives, from the way we spend our free time to the way we make scientific discoveries. Netflix streamed more than 42 billion hours of video in 2015, and in the process recorded massive volumes of data to inform video recommendations and plan investments in new content. The CERN Large Hadron Collider produces enough data to fill more than one billion DVDs every week, and this data has led to the discovery of the Higgs boson particle. Such large scale computing is challenging because no one machine is capable of ingesting, storing, or processing all of the data. Instead, applications require distributed systems comprising many machines working in concert. Adding to the challenge, many data streams originate from geographically distributed sources. Scientific sensors such as LIGO span multiple sites and generate data too massive to process at any one location. The machines that analyze these data are also geo-distributed; for example Netflix and Facebook users span the globe, and so do the machines used to analyze their behavior. Many applications need to process geo-distributed data on geo-distributed systems with low latency. A key challenge in achieving this requirement is determining where to carry out the computation. For applications that process unbounded data streams, two performance metrics are critical: WAN traffic and staleness (i.e., delay in receiving results). To optimize these metrics, a system must determine when to communicate results from distributed resources to a central data warehouse. As an additional challenge, constrained WAN bandwidth often renders exact computation infeasible. Fortunately, many applications can tolerate inaccuracy, albeit with diverse preferences. To support diverse applications, systems must determine what partial results to communicate in order to achieve the desired staleness-error tradeoff. This thesis presents answers to these three questions--where to compute, when to communicate, and what partial results to communicate--in two contexts: batch computing, where the complete input data set is available prior to computation; and stream computing, where input data are continuously generated. We also explore the challenges facing emerging programming models and execution engines that unify stream and batch computing.
Description
University of Minnesota Ph.D. dissertation. December 2016. Major: Computer Science. Advisor: Abhishek Chandra. 1 computer file (PDF); xv, 194 pages.
Related to
Replaces
License
Collections
Series/Report Number
Funding information
Isbn identifier
Doi identifier
Previously Published Citation
Other identifiers
Suggested citation
Heintz, Benjamin. (2016). Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/185171.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.