Optimizing Aggregation and Join Queries in Geo-Distributed Data Analytics

Loading...
Thumbnail Image

Persistent link to this item

Statistics
View Statistics

Journal Title

Journal ISSN

Volume Title

Title

Optimizing Aggregation and Join Queries in Geo-Distributed Data Analytics

Published Date

2022-03

Publisher

Type

Thesis or Dissertation

Abstract

Large-scale data analytics services require collection and analysis of data from end-user applications and devices distributed around the globe. These services are increasingly deployed on a geographically distributed infrastructure comprising a multi-tier topology of edge servers and cloud data centers (DCs). Such geo-distributed analytics (GDA) involves data transfer over the wide area network (WAN) links connecting the various processing sites (edges and DCs). These WAN links are highly constrained and heterogeneous in nature, making the data transfer over the WAN slow and costly. Additionally, the edge nodes can also be constrained in terms of compute capacity. While the prior work on GDA has tried to address these challenges to some degree, this thesis identifies and solves a number of unidentified challenges associated with two fundamental operations in any GDA system: data aggregation and relational joins. Real-time aggregation and processing of geo-distributed data streams continuously over time often has two competing requirements: first, the results be available at the center within a certain acceptable delay bound and second, the WAN traffic needs to be minimized due to constrained and expensive WAN bandwidth. This delay-traffic tradeoff forms a fundamental component of streaming analytics. This thesis proposes a Time-To-Live (TTL-) based aggregation model which provides a theoretical basis for understanding the aforementioned delay-traffic tradeoff. The TTL-based aggregation model is then utilized to solve a variety of optimization problems such as jointly minimizing the delay and traffic costs, minimizing delay subject to a traffic bound and minimizing traffic subject to a delay bound in the context of hub-and-spoke like edge-cloud infrastructure where multiple edges are connected to a central cloud data center. Next, this thesis also proposes aggregation networks to efficiently perform continuous aggregation over a general multi-tiered distributed edge-cloud infrastructure. In doing so, it identifies a number of less studied tradeoffs such as tradeoff between traffic and traffic cost. The identified tradeoffs are then utilized to propose AggNet, a cost-aware system for minimizing traffic cost across aggregation networks while satisfying the resource constraints in the network as well as the delay sensitivity of the streaming aggregation queries. Computing joins in a geo-distributed setting remains a challenging problem, as joins often form the most heavyweight component in an analytics query, both in terms of compute and data shuffle over the WAN. This thesis first looks at queries comprising both join and aggregation operators. It proposes AggFirstJoin, an approach to minimize the cost of geo-distributed joins using a theoretically sound query transformation technique. The optimization approach takes a combined view of the join and aggregation operations which are often part of the same query, and pushes (a transformed) aggregation before join so as to produce the same results as the original query. The query transformation technique is further augmented with a WAN-aware task placement and a Bloom filtering approach to further reduce query execution time and WAN usage respectively. Next, this thesis studies queries with join operators on their own. Computing exact results for such queries is much more challenging since there are no aggregation operators which could have reduced the data shuffle over WAN. Hence, this thesis proposes a geo-distributed join sampling approach which can efficiently generate random samples from geo-distributed tables in order to finally produce a random sample of the joined result. All of the proposed techniques in this thesis are implemented on top of popular data analytics engines such as Apache Spark and Apache Flink. Evaluations are carried out using both real and synthetic traces on a real geo-distributed testbed on AWS as well as an emulated test-bed. The proposed techniques show remarkable improvements over the existing state-of-the-art.

Description

University of Minnesota Ph.D. dissertation. 2022. Major: Computer Science. Advisor: Abhishek Chandra. 1 computer file (PDF); 171 pages.

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation

Kumar, Dhruv. (2022). Optimizing Aggregation and Join Queries in Geo-Distributed Data Analytics. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/227914.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.