Efficient Data Management and Processing in Big Data Applications

Cao, Xiang2017-07-182017-07-182017-05https://hdl.handle.net/11299/188863University of Minnesota Ph.D. dissertation.May 2017. Major: Computer Science. Advisor: David Du. 1 computer file (PDF); x, 95 pages.In today's Big Data applications, huge amount of data are being generated. With the rapid growth of data amount, data management and processing become essential. It is important to design efficient approaches to manage and process data. In this thesis, data management and processing are investigated for Big Data applications. Key-value store (KVS) is widely used in many Big Data applications by providing flexible and efficient performance. Recently, a new Ethernet accessed disk drive for key-value pairs called "Kinetic Drive" was developed by Seagate. It can reduce the management complexity, especially in large-scale deployment. It is important to manage the key-value pairs and store them in Kinetic Drives in an organized way. In this thesis, we present data allocation schemes on a large-scale key-value store system using Kinetic Drives. We investigate key indexing schemes and allocate data on drives accordingly. We propose efficient approaches to migrate data among drives. Also, it is necessary to manage huge amount of key-value pairs to provide attributes search for users. In this thesis, we design a large-scale searchable key-value store system based on Kinetic Drives. We investigate an indexing scheme to map data to the drives. We propose a key generation approach to reflect metadata information of the actual data and support users' attributes search requests. Nowadays, MapReduce has become a very popular framework to process data in many applications. Data shuffling usually accounts for a large portion of the entire running time of MapReduce jobs. In recent years, scale-up computing architecture for MapReduce jobs has been developed. With multi-processor, multi-core design connected via NUMAlink and large shared memories, NUMA architecture provides a powerful scale-up computing capability. In this thesis, we focus on the optimization of data shuffling phase in MapReduce framework in NUMA machine. We concentrate on the various bandwidth capacities of NUMAlink(s) among different memory locations to fully utilize the network. We investigate the NUMAlink topology and propose a topology-aware reducer placement algorithm to speed up the data shuffling phase. We extend our approach to a larger computing environment with multiple NUMA machines.enBig DataData ManagementData ProcessingEfficient Data Management and Processing in Big Data ApplicationsThesis or Dissertation