Optimizing Aggregation and Join Queries in Geo-Distributed Data Analytics

Kumar, Dhruv2022-06-082022-06-082022-03https://hdl.handle.net/11299/227914University of Minnesota Ph.D. dissertation. 2022. Major: Computer Science. Advisor: Abhishek Chandra. 1 computer file (PDF); 171 pages.Large-scale data analytics services require collection and analysis of data from end-user applications and devices distributed around the globe. These services are increasingly deployed on a geographically distributed infrastructure comprising a multi-tier topology of edge servers and cloud data centers (DCs). Such geo-distributed analytics (GDA) involves data transfer over the wide area network (WAN) links connecting the various processing sites (edges and DCs). These WAN links are highly constrained and heterogeneous in nature, making the data transfer over the WAN slow and costly. Additionally, the edge nodes can also be constrained in terms of compute capacity. While the prior work on GDA has tried to address these challenges to some degree, this thesis identifies and solves a number of unidentified challenges associated with two fundamental operations in any GDA system: data aggregation and relational joins. Real-time aggregation and processing of geo-distributed data streams continuously over time often has two competing requirements: first, the results be available at the center within a certain acceptable delay bound and second, the WAN traffic needs to be minimized due to constrained and expensive WAN bandwidth. This delay-traffic tradeoff forms a fundamental component of streaming analytics. This thesis proposes a Time-To-Live (TTL-) based aggregation model which provides a theoretical basis for understanding the aforementioned delay-traffic tradeoff. The TTL-based aggregation model is then utilized to solve a variety of optimization problems such as jointly minimizing the delay and traffic costs, minimizing delay subject to a traffic bound and minimizing traffic subject to a delay bound in the context of hub-and-spoke like edge-cloud infrastructure where multiple edges are connected to a central cloud data center. Next, this thesis also proposes aggregation networks to efficiently perform continuous aggregation over a general multi-tiered distributed edge-cloud infrastructure. In doing so, it identifies a number of less studied tradeoffs such as tradeoff between traffic and traffic cost. The identified tradeoffs are then utilized to propose AggNet, a cost-aware system for minimizing traffic cost across aggregation networks while satisfying the resource constraints in the network as well as the delay sensitivity of the streaming aggregation queries. Computing joins in a geo-distributed setting remains a challenging problem, as joins often form the most heavyweight component in an analytics query, both in terms of compute and data shuffle over the WAN. This thesis first looks at queries comprising both join and aggregation operators. It proposes AggFirstJoin, an approach to minimize the cost of geo-distributed joins using a theoretically sound query transformation technique. The optimization approach takes a combined view of the join and aggregation operations which are often part of the same query, and pushes (a transformed) aggregation before join so as to produce the same results as the original query. The query transformation technique is further augmented with a WAN-aware task placement and a Bloom filtering approach to further reduce query execution time and WAN usage respectively. Next, this thesis studies queries with join operators on their own. Computing exact results for such queries is much more challenging since there are no aggregation operators which could have reduced the data shuffle over WAN. Hence, this thesis proposes a geo-distributed join sampling approach which can efficiently generate random samples from geo-distributed tables in order to finally produce a random sample of the joined result. All of the proposed techniques in this thesis are implemented on top of popular data analytics engines such as Apache Spark and Apache Flink. Evaluations are carried out using both real and synthetic traces on a real geo-distributed testbed on AWS as well as an emulated test-bed. The proposed techniques show remarkable improvements over the existing state-of-the-art.enAggregationCloudEdgeGeo-distributed systemsRelational joinsStream ProcessingOptimizing Aggregation and Join Queries in Geo-Distributed Data AnalyticsThesis or Dissertation