Browsing by Subject "big data"
Now showing 1 - 6 of 6
- Results Per Page
- Sort Options
Item A Framework for Nonconvex Robust Subspace Recovery(2018-07) Maunu, TylerThis thesis consists of three works from my Ph.D.~research. The first part is an overview of the problem of Robust Subspace Recovery. Then, the next parts present two algorithms for solving this problem along with supporting mathematical analysis. Robust subspace recovery involves finding an underlying low-dimensional subspace in a dataset that is possibly corrupted with outliers. While this problem is easy to state, it has been difficult to develop optimal algorithms due to its underlying nonconvexity. We give a comprehensive review of the algorithms developed for RSR in the first chapter of this thesis. After this, we will discuss our proposed solutions to this problem. The first proposed algorithm, which we refer to as Fast Median Subspace (FMS), is designed to robustly determine the underlying subspace of such datasets, while having lower computational complexity than existing accurate methods. We prove convergence of the FMS iterates to a stationary point. Further, under two special models of data, FMS converges to a point that is near to the global minimum with overwhelming probability. Under these models, we show that the iteration complexity is globally sublinear and locally r-linear. For one of the models, these results hold for any fixed fraction of outliers (less than 1). Numerical experiments on synthetic and real data demonstrate its competitive speed and accuracy. Our second proposed algorithm involves geodesic gradient descent on the Grassmannian manifold. In the accompanying mathematical analysis, we prove that an underlying subspace is the only stationary point and local minimizer in a specified neighborhood if a deterministic condition holds for a dataset. We further show that if the deterministic condition is satisfied, the geodesic gradient descent method over the Grassmannian manifold can exactly recover the underlying subspace with proper initialization. Proper initialization by principal component analysis is guaranteed with a similar stability condition. Under slightly stronger assumptions, the gradient descent method with an adaptive step size scheme achieves linear convergence. The practicality of the deterministic condition is demonstrated on some statistical models of data, and the method achieves almost state-of-the-art recovery guarantees on the Haystack Model. We show that our gradient method can exactly recover the underlying subspace for any fixed fraction of outliers (less than 1) provided that the sample size is large enough.Item High-Performance and Cost-Effective Storage Systems for Supporting Big Data Applications(2020-07) Cao, ZhichaoComing into the 21st century, the world innovations are shifting from the IT-driven (information technology-driven) to the DT-driven (data technology-driven). With the fast development of social media, e-business, Internet of Things (IoT), and the usage of millions of applications, an extremely large amount of data is created every day. We are in a big data era. Storage systems are the keystone to achieving data persistence, reliability, availability, and flexibility in the big data era. Due to the diversified applications that generate and use an extremely large amount of data, the research of storage systems for big data is a blue sky. Fast but expensive storage devices can deliver high performance. However, storing an extremely large volume of data in the fast storage devices causes very high costs. Therefore, designing and developing high-performance and cost-effective storage systems for big data applications has a significant social and economic impact. In this thesis, we mainly focus on improving the performance and cost-effectiveness of different storage systems for big data applications. First, file systems are widely used to store the data generated by different big data applications. As the data scale and the requirements of performance increases, designing and implementing a file system with good performance and high cost-effectiveness is urgent but challenging. We propose and develop a tier-aware file system with data deduplication to satisfy the requirements and address the challenges. A fast but expensive storage tier and a slow tier with much larger capacity are managed by a single file system. Hot files are migrated to the fast tier to ensure the high performance and cold files are migrated to the slow tier to lower the storing costs. Moreover, data deduplication is applied to eliminate the content level redundancy in both tiers to save more storage space and reduce the data migration overhead. Due to the extremely large scale of storage systems that support big data applications, hardware failures, system errors, software failures, or even natural disasters happen more frequently and can cause serious social and economic damages. Therefore, improving the performance of recovering data from backups is important to lower the loss. In this thesis, two research studies focus on improving the restore performance of the deduplication data in backup systems from different perspectives. The main objective is to reduce the storage reads during the restore. In the first study, two different cache designs are integrated to restore the data chunks with different localities. The memory space boundary between the two cache designs is adaptively adjusted based on the data locality variations. Thus, the cache hits are effectively improved and it lowers the essential storage reads. In the second study, to further address the data chunk fragmentation issues which cause some of the unavoidable storage reads, a look-back window assisted data chunk rewrite scheme is designed to store the fragmented data chunks together during deduplication. With a very small space overhead, the rewriting scheme transfers the multiple storage reads of fragmented data chunks to a single storage read. This design can reduce the storage reads that cannot be lowered by the caching schemes. Third, in the big data infrastructure, compute and storage clusters are disaggregated to achieve high availability, flexibility, and cost-effectiveness. However, it also causes a huge amount of network traffic between the storage and compute clusters, which leads to potential performance penalties. To investigate and solve the performance issues, we conduct a comprehensive study of the performance issues of HBase, which is a widely used distributed key-value store for big data applications, in the compute-storage disaggregated infrastructure. Then, to address the performance penalties, we propose in-storage computing based architecture to offload some of the I/O intensive modules from compute clusters to the storage clusters to effectively reduce the network traffic. The observations and explorations can help other big data applications to relive the similar performance penalties in the new infrastructure. Finally, designing and optimizing storage systems for big data applications requires a deep understanding of real-world workloads. However, there are limited workload characterization studies due to the challenges of collecting and analyzing real-world workloads in big data infrastructure. To bridge the gap, we select three large scale big data applications, which use RocksDB as the persistent key-value storage engine, at Facebook to characterize, model, and benchmark the RocksDB key-value workloads. To our best knowledge, this is the first research on characterizing the workloads of persistent key-value stores in real big data systems. In this research, we provide deep insights of the workload characteristics and the correlations between storage system behaviors and big data application queries. We show the methodologies and technologies of making better tradeoffs between performance and cost-effectiveness for storage systems supporting big data applications. Finally, we investigate the limitations of the existing benchmarks and propose a new benchmark that can better simulate both application queries and storage behaviors.Item Journalism in an Era of Big Data: Cases, Concepts, and Critiques(Digital Journalism, 2015) Lewis, Seth C.“Journalism in an era of big data” is thus a way of seeing journalism as interpolated through the conceptual and methodological approaches of computation and quantification. It is about both the ideation and implementation of computational and mathematical mindsets and skill sets in newswork—as well as the necessary deconstruction and critique of such approaches. Taking such a wide-angle view of this phenomenon, including both practice and philosophy within this conversation, means attending to the social/cultural dynamics of computation and quantification—such as the grassroots groups that are seeking to bring pro-social “hacking” into journalism (Lewis and Usher 2013, 2014)—as well as the material/technological characteristics of these developments. It means recognizing that algorithms and related computational tools and techniques “are neither entirely material, nor are they entirely human—they are hybrid, composed of both human intentionality and material obduracy” (Anderson 2013, 1016). As such, we need a set of perspectives that highlight the distinct and interrelated roles of social actors and technological actants at this emerging intersection of journalism (Lewis and Westlund 2014a). To trace the broad outline of journalism in an era of big data, we need (1) empirical cases that describe and explain such developments, whether at the micro (local) or macro (institutional) levels of analysis; (2) conceptual frameworks for organizing, interpreting, and ultimately theorizing about such developments; and (3) critical perspectives that call into question taken-for-granted norms and assumptions. This special issue takes up this three-part emphasis on cases, concepts, and critiques.Item Large-scale Clustering using Random Sketching and Validation(2015-08) Traganitis, PanagiotisThe advent of high-speed Internet, modern devices and global connectivity has introduced the world to massive amounts of data, that are being generated, communicated and processed daily. Extracting meaningful information from this humongous volume of data is becoming increasingly challenging even for high-performance and cloud computing platforms. While critically important in a gamut of applications, clustering is computationally expensive when tasked with high-volume high-dimensional data. To render such a critical task affordable for data-intensive settings, this thesis introduces a clustering framework, named random sketching and validation (SkeVa). This framework builds upon and markedly broadens the scope of random sample and consensus RANSAC ideas that have been used successfully for robust regression. Four main algorithms are introduced, which enable clustering of high-dimensional data, as well as subspace clustering for data generated by unions of subspaces and clustering of large-scale networks. Extensive numerical tests compare the SkeVa algorithms to their state-of-the-art counterparts and showcase the potential of the SkeVa frameworks.Item Learning Healthcare System enabled by Real-time Knowledge Extraction from Text data(2019-07) Kaggal, VinodWe have a critical void in the clinical informatics ecosystems in enabling information captured in the Electronic Health Record (EHR) to be transformed into actionable knowledge. Incorporating knowledge into clinical practice leveraging informatics based analytical tools is critical in delivering optimal clinical care and lead us toward an effective Learning Healthcare System (LHS). A robust infrastructure plays a very critical role in enabling such clinical informatics ecosystems. This robust infrastructure must guarantee the ability to manage data volume and velocity, variety and veracity. This thesis work accomplishes i) Proposal of a data model to support building a robust analytics framework to automatically compute the knowledge within the EHR ii) Infrastructure to scale-up analytics and knowledge delivery iii) Clinical and Research projects that utilize this infrastructure for near real-time analysis of text data to derive intuitive clinical inferences of patient’s multi-dimensional data.Item Standing out in a networked communication context: Toward a network contingency model of public attention(new media & society, 2020) Saffer, Adam JSocial media can offer strategic communicators cost-effective opportunities to reach millions of individuals. However, in practice it can be difficult to be heard in these crowded digital spaces. This study takes a strategic network perspective and draws from recent research in network science to propose the network contingency model of public attention. This model argues that in the networked social-mediated environment, an organization’s ability to attract public attention on social media is contingent on its ability to fit its network position with the network structure of the communication context. To test the model, we combine data mining, social network analysis, and machine-learning techniques to analyze a large-scale Twitter discussion network. The results of our analysis of Twitter discussion around the refugee crisis in 2016 suggest that in high core-periphery network contexts, “star” positions were most influential whereas in low core-periphery network contexts, a “community” strategy is crucial to attracting public attention.