Browsing by Subject "Data Management"
Now showing 1 - 5 of 5
- Results Per Page
- Sort Options
Item Data Management Needs Assessment - Surveys in CLA, AHC, CSE, and CFANS(2015) Hofelich Mohr, Alicia; Bishoff, Josh; Johnston, Lisa R; Braun, Steven; Storino, Christine; Bishoff, CarolynResearcher's data management needs were assessed at four colleges with in the University of Minnesota: The College of Liberal Arts (CLA), the Academic Health Center (AHC), the College of Science and Engineering (CSE), and the College of Food, Agriculture, and Natural Resource Sciences (CFANS). The initial survey was designed in CLA and featured a branched design that presented researchers one of two versions of the questions based on how respondents described the products of their scholarship - as "data" or "research materials". The survey was then customized for the other colleges, adding or editing questions based on feedback from disciplinary experts, while maintaining comparability across surveys. Surveys were run between September 2013 and and February 2015.Item Efficient Data Management and Processing in Big Data Applications(2017-05) Cao, XiangIn today's Big Data applications, huge amount of data are being generated. With the rapid growth of data amount, data management and processing become essential. It is important to design efficient approaches to manage and process data. In this thesis, data management and processing are investigated for Big Data applications. Key-value store (KVS) is widely used in many Big Data applications by providing flexible and efficient performance. Recently, a new Ethernet accessed disk drive for key-value pairs called "Kinetic Drive" was developed by Seagate. It can reduce the management complexity, especially in large-scale deployment. It is important to manage the key-value pairs and store them in Kinetic Drives in an organized way. In this thesis, we present data allocation schemes on a large-scale key-value store system using Kinetic Drives. We investigate key indexing schemes and allocate data on drives accordingly. We propose efficient approaches to migrate data among drives. Also, it is necessary to manage huge amount of key-value pairs to provide attributes search for users. In this thesis, we design a large-scale searchable key-value store system based on Kinetic Drives. We investigate an indexing scheme to map data to the drives. We propose a key generation approach to reflect metadata information of the actual data and support users' attributes search requests. Nowadays, MapReduce has become a very popular framework to process data in many applications. Data shuffling usually accounts for a large portion of the entire running time of MapReduce jobs. In recent years, scale-up computing architecture for MapReduce jobs has been developed. With multi-processor, multi-core design connected via NUMAlink and large shared memories, NUMA architecture provides a powerful scale-up computing capability. In this thesis, we focus on the optimization of data shuffling phase in MapReduce framework in NUMA machine. We concentrate on the various bandwidth capacities of NUMAlink(s) among different memory locations to fully utilize the network. We investigate the NUMAlink topology and propose a topology-aware reducer placement algorithm to speed up the data shuffling phase. We extend our approach to a larger computing environment with multiple NUMA machines.Item Improving Data Management and Data Movement Efficiency in Hybrid Storage Systems(2017-07) GE, XIONGZIIn the big data era, large volumes of data being continuously generated drive the emergence of high performance large capacity storage systems. To reduce the total cost of ownership, storage systems are built in a more composite way with many different types of emerging storage technologies/devices including Storage Class Memory (SCM), Solid State Drives (SSD), Shingle Magnetic Recording (SMR), Hard Disk Drives (HDD), and even across off-premise cloud storage. To make better utilization of each type of storage, industries have provided multi-tier storage through dynamically placing hot data in the faster tiers and cold data in the slower tiers. Data movement happens between devices on one single device and as well as between devices connected via various networks. Toward improving data management and data movement efficiency in such hybrid storage systems, this work makes the following contributions: To bridge the giant semantic gap between applications and modern storage systems, passing a piece of tiny and useful information (I/O access hints) from upper layers to the block storage layer may greatly improve application performance or ease data management in heterogeneous storage systems. We present and develop a generic and flexible framework, called HintStor, to execute and evaluate various I/O access hints on heterogeneous storage systems with minor modifications to the kernel and applications. The design of HintStor contains a new application/user level interface, a file system plugin and a block storage data manager. With HintStor, storage systems composed of various storage devices can perform pre-devised data placement, space reallocation and data migration polices assisted by the added access hints. Each storage device/technology has its own unique price-performance tradeoffs and idiosyncrasies with respect to workload characteristics they prefer to support. To explore the internal access patterns and thus efficiently place data on storage systems with fully connected (i.e., data can move from one device to any other device instead of moving tier by tier) differential pools (each pool consists of storage devices of a particular type), we propose a chunk-level storage-aware workload analyzer framework, simplified as ChewAnalyzer. With ChewAnalzyer, the storage manager can adequately distribute and move the data chunks across different storage pools. To reduce the duplicate content transferred between local storage devices and devices in remote data centers, an inline Network Redundancy Elimination (NRE) process with Content-Defined Chunking (CDC) policy can obtain a higher Redundancy Elimination (RE) ratio but may suffer from a considerably higher computational requirement than fixed-size chunking. We build an inline NRE appliance which incorporates an improved FPGA based scheme to speed up CDC processing. To efficiently utilize the hardware resources, the whole NRE process is handled by a Virtualized NRE (VNRE) controller. The uniqueness of this VNRE that we developed lies in its ability to exploit the redundancy patterns of different TCP flows and customize the chunking process to achieve a higher RE ratio.Item Reimagining the institutional repository as an open data archive(7th International Digital Curation Conference, 2011-12) Johnston, Lisa RInstitutional repositories (IR) have sprung up in academic institutions over the last decade to provide archival storage and dissemination services for locally-authored digital scholarship, primarily in the form of the traditional peer-reviewed article. However the implementation of IR’s has not rapidly changed the landscape of scholarly communication as expected (1) and, without institutional deposit mandates, many remain underused for their primary purpose. Today, a shift is occurring in academia that has signaled an increased need for the stewardship of digital research data, for example, the expectation by federal funding agencies that researchers share their data and plan for preservation and long-term access. The IR provides academic libraries a ready opportunity to assist researchers with digital data preservation using their established repository services, particularly where national and disciplinary data centers are not available. At the University of Minnesota, our IR is undergoing a replatforming shift from DSpace to Fedora software. This poster will describe the policy decisions, user-needs assessments, and technical infrastructure plans for reimagining the IR to meet data archiving needs across campus.Item Scalable Spatial Predictive Query Processing for Moving Objects(2015-08) Hendawi, AbdeltawabA fundamental category of location based services relies on predictive queries which consider the anticipated future locations of users. Predictive queries at- tracted the researchers' attention as they are widely used in several applications including traffic management, routing, location-based advertising, and ride shar- ing. This thesis aims to present a generic and scalable system for predictive query processing on moving objects, e.g., vehicles. Inside the proposed system, two frameworks are provided to work on two different environments, (1) Panda framework for Euclidean space, and (2) iRoad framework for road network. In- side the iRoad system, a novel data structure named Predictive Tree (P-Tree) is proposed to index the anticipated future locations of objects on road networks. Unlike previous work in supporting predictive queries, the target of the proposed system is to: (a) support long-term query prediction as well as short term predic- tion, (b) scale up to large number of moving objects, and (c) efficiently support different types of predictive queries, e.g., predictive range, KNN, and aggregate queries.