Browsing by Author "Kim, Jinoh"
Now showing 1 - 5 of 5
- Results Per Page
- Sort Options
Item A Security-Enabled Grid System for Distributed Data Mining(2007-01-10) Kim, Seonho; Kim, Jinoh; Weissman, JonIn this paper, we present a Grid system that enables distributed data mining, exploration and sharing to address issues described above which involve distributed data analysis and multiple ownerships. The system addresses the three main requirements of distributed data mining on Grid: 1) exploitation of geographically and organizationally distributed computing resources to solve data-intensive data mining problem, 2) ensuring the security and privacy of sensitive data, 3) supporting seamless data/computing resource sharing. We present the system architecture, specification of the component services, security consideration, a prototype of the system, and performance evaluation.Item Accessibility-based Resource Selection in Loosely-coupled Distributed Systems(2007-11-20) Kim, Jinoh; Chandra, Abhishek; Weissman, JonLarge-scale distributed systems provide an attractive scalable infrastructure for network applications. However, the loosely-coupled nature of this environment can make data access unpredictable, and in the limit, unavailable. Availability is normally characterized as a binary property, yes or no, often with an associated probability. However, availability conveys little in terms of expected data access performance. Using availability alone, jobs may suffer intolerable response time, or even fail to complete, due to poor data access. We introduce the notion of accessibility, a more general concept, to capture both availability and performance. An increasing number of data-intensive applications require not only considerations of node computation power but also accessibility for adequate job allocations. For instance, selecting a node with intolerably slow connections can offset any benefit to running on a fast node. In this paper, we present accessibility-aware resource selection techniques by which it is possible to choose nodes that will have efficient data access to remote data sources. We have that the local data access observations collected from a node's neighbors are sufficient to characterize accessibility for that node. We then present resource selection heuristics guided by this principle, and show that they significantly outperform standard techniques. We also investigate the impact of churn in which nodes change their status of participation such that they lose their memory of prior observations. Despite this level of unreliability, we show that the suggested techniques yield good results.Item Data dissemination for distributed computing.(2010-02) Kim, JinohLarge-scale distributed systems provide an attractive scalable infrastructure for network applications. However, the loosely-coupled nature of this environment can make data access unpredictable, and in the limit, unavailable. This thesis strives to provide predictability in data access for data-intensive computing in large-scale computational infrastructures. A key requirement for achieving predictability in data access is the ability to estimate network performance for data transfer so that computation tasks can take advantage of the estimation in their deployment or data source selection. This thesis develops a framework called OPEN (Overlay Passive Estimation of Network Performance) for scalable network performance estimation. OPEN provides an estimation of end-to-end accessibility for applications by utilizing past measurements without the use of explicit probing. Unlike existing passive approaches, OPEN is not restricted to pairwise or a single network in utilizing historical information; instead, it shares measurements between nodes without any restrictions. As a result, it achieves n2 estimations by O(n) measurements. In addition, this thesis considers data dissemination in two specific environments. First, we consider a parallel data access environment in which multiple replicated servers can be utilized to download a single data file in parallel. To improve both performance and fault tolerance, we present a new parallel data retrieval algorithm and explore a broad set of resource selection heuristics. Second, we consider collective data access in applications for which group performance is more important than individual performance. In this work, we employ communication makespan as a group performance metric and propose server selection heuristics to maximize collective performance.Item Exploiting Heterogeneity for Collective Data Downloading in Volunteer-based Networks(2006-11-29) Kim, Jinoh; Chandra, Abhishek; Weissman, JonScientific computing is being increasingly deployed over volunteer-based distributed computing environments consisting of idle resources on donated user machines. A fundamental challenge in these environments is the dissemination of data to the computation nodes, with the successful completion of jobs being driven by the efficiency of collective data download across compute nodes, and not only the individual download times. This paper considers the use of a data network consisting of data distributed across a set of data servers, and focuses on the server selection problem: how do individual nodes select a server for downloading data to minimize the communication makespan - the maximal download time for a data file. Through experiments conducted on a Pastry network running on PlanetLab, we demonstrate that nodes in a volunteer-based network are heterogeneous in terms of several metrics, such as bandwidth, load, and capacity, which impact their download behavior. We propose new server selection heuristics that incorporate these metrics, and demonstrate that these heuristics outperform traditional proximity-based server selection, reducing average makespans by at least 30%. We further show that incorporating information about download concurrency avoids overloading servers, and improves performance by about 17-43% over heuristics considering only proximity and bandwidth.Item OPEN: Passive Network Performance Estimation for Data-intensive Applications(2008-11-24) Kim, Jinoh; Chandra, Abhishek; Weissman, JonDistributed computing applications are increasingly utilizing distributed data sources. However, the unpredictable cost of data access in large-scale computing infrastructures can lead to severe performance bottlenecks. Providing predictability in data access is thus essential to accommodate the large set of newly emerging large-scale, data-intensive computing applications. In this regard, accurate estimation of network performance is crucial to meeting the performance goals of such applications. Passive estimation based on past measurements is attractive for its relatively small overhead compared to relying on explicit probing. In this paper, we take a passive approach for network performance estimation. Our approach is different from existing passive techniques that rely either on past direct measurements of pairs of nodes or on topological similarities. Instead, we exploit secondhand measurements collected by other nodes without any topological restrictions. OPEN (Overlay Passive Estimation of Network performance) is a scalable framework providing end-to-end network performance estimation based on secondhand measurements. Using actual downloading traces collected for 10 months in PlanetLab, we show that OPEN provides low-overhead, accurate estimation for replica and resource selection problems common to distributed computing. Results from our simulation study show that OPEN significantly outperforms selection techniques based on statistical pairwise estimations as well as random and latency-based selections in diverse experimental settings.