Kim, Jinoh2010-03-172010-03-172010-02https://hdl.handle.net/11299/59575University of Minnesota Ph.D. dissertation. February 2010. Major: Computer Science. Advisors: Prof. Jon B. Weissman, Prof. Abhishek Chandra. 1 computer file (PDF); x, 129 pages. Ill. (some col.)Large-scale distributed systems provide an attractive scalable infrastructure for network applications. However, the loosely-coupled nature of this environment can make data access unpredictable, and in the limit, unavailable. This thesis strives to provide predictability in data access for data-intensive computing in large-scale computational infrastructures. A key requirement for achieving predictability in data access is the ability to estimate network performance for data transfer so that computation tasks can take advantage of the estimation in their deployment or data source selection. This thesis develops a framework called OPEN (Overlay Passive Estimation of Network Performance) for scalable network performance estimation. OPEN provides an estimation of end-to-end accessibility for applications by utilizing past measurements without the use of explicit probing. Unlike existing passive approaches, OPEN is not restricted to pairwise or a single network in utilizing historical information; instead, it shares measurements between nodes without any restrictions. As a result, it achieves n2 estimations by O(n) measurements. In addition, this thesis considers data dissemination in two specific environments. First, we consider a parallel data access environment in which multiple replicated servers can be utilized to download a single data file in parallel. To improve both performance and fault tolerance, we present a new parallel data retrieval algorithm and explore a broad set of resource selection heuristics. Second, we consider collective data access in applications for which group performance is more important than individual performance. In this work, we employ communication makespan as a group performance metric and propose server selection heuristics to maximize collective performance.en-USData disseminationData-intensive computingDistributed computingDistributed systemsHigh performance computingNetwork performanceComputer ScienceData dissemination for distributed computing.Thesis or Dissertation