Browsing by Subject "Data mining"

Now showing 1 - 20 of 25

Approximate search on massive spatiotemporal datasets.
(2012-08) Brugere, Ivan
Efficient time series similarity search is a fundamental operation for data exploration and analysis. While previous work has focused on indexing progressively larger datasets and has proposed data structures with efficient exact search algorithms, we motivate the need for approximate query methods that can be used in interactive exploration and as fast data analysis subroutines on large spatiotemporal datasets. This thesis formulates a simple approximate range query problem for time series data, and proposes a method that aims to quickly access a small number of high-quality results of the exact search resultset. We formulate an anytime framework, giving the user flexibility to return query results in arbitrary cost, where larger runtime incrementally improves search results. We propose an evaluation strategy on each query framework when the false dismissal class is very large relative to the query resultset and investigate the performance of indexing novel classes of time series subsequences.
A data collection, storage, and analysis framework for network security.
(2007-10) Eilertson, Eric
As the number, severity and sophistication of computer network attacks increase network administrators have an increasingly difficult time identifying and cleaning up compromised computers. In this thesis some of the areas where existing tools and techniques are deficient are identified, and possible solutions are proposed and evaluated on synthetic as well as real networks. This thesis has four major contributions. The first is a lightweight semi-stateful network data capture module. The second contribution is a framework for storing and accessing raw packet information as well as meta information, such as network sessions. The third contribution is a set of analysis routines for identifying computer network attacks, and computers that have been successfully compromised. The fourth contribution is a framework for iteratively building and analyzing the communication patterns of networked computers. This allows security analysts and researchers to identify compromised computers, as well as perform forensic analysis to answer questions like What computer compromised this computer? When did the compromise occur? How did the compromise happen? What data was stolen or modified?
Data Mining of Traffic Video Sequences
(University of Minnesota Center for Transportation Studies, 2009-09) Joshi, Ajay J.; Papanikolopoulos, Nikolaos
Automatically analyzing video data is extremely important for applications such as monitoring and data collection in transportation scenarios. Machine learning techniques are often employed in order to achieve these goals of mining traffic video to find interesting events. Typically, learning-based methods require significant amount of training data provided via human annotation. For instance, in order to provide training, a user can give the system images of a certain vehicle along with its respective annotation. The system then learns how to identify vehicles in the future - however, such systems usually need large amounts of training data and thereby cumbersome human effort. In this research, we propose a method for active learning in which the system interactively queries the human for annotation on the most informative instances. In this way, learning can be accomplished with lesser user effort without compromising performance. Our system is also efficient computationally, thus being feasible in real data mining tasks for traffic video sequences.
Data mining techniques for enhancing protein function prediction
(2010-04) Pandey, Gaurav
Proteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is crucial for obtaining a basic understanding of the cellular processes operating in an organism as well as for important applications in biotechnology, such as the development of new drugs, better crops, and synthetic biochemicals such as biofuels. Recent revolutions in biotechnology has given us numerous high-throughput experimental technologies that generate very useful data, such as gene expression and protein interaction data, that provide high-resolution snapshots of complex cellular processes and a novel avenue to understand their underlying mechanisms. In particular, several computational approaches based on the principle of Guilt by Association (GBA) have been proposed to predict the function(s) of the protein are inferred from those of other proteins that are "associated" to it in these data sets. In this thesis, we have developed several novel methods for improving the performance of these approaches by making use of the unutilized and under-utilized information in genomic data sets, as well as their associated knowledge bases. In particular, we have developed pre-processing methods for handling data quality issues with gene expression (microarray) data sets and protein interaction networks that aim to enhance the utility of these data sets for protein function prediction. We have also developed a method for incorporating the inter-relationships between functional classes, as captured by the ontologies in Gene Ontology, into classification-based protein function prediction algoriths, which enabled us to improve the quality of predictions made for several functional classes, particularly those with very few member proteins (rare classes). Finally, we have developed a novel association analysis-based biclustering algorithm to address two major challenges with traditional biclustering algorithms, namely an exhaustive search of all valid biclusters satisfying the definition specified by the algorithm, and the ability to search for small biclusters. This algorithm makes it possible to discover smaller sized biclusters that are more significantly enriched with specific GO terms than those produced by the traditional biclustering algorithms. Overall, the methods proposed in this thesis are expected to help uncover the functions of several unannotated proteins (or genes), as shown by specific examples cited in some of the chapters. To conclude, we also suggest several opportunities for further progress on the very important problem of protein function prediction
Data Science for Mining Patterns in Spatial Events
(2019-05) Tang, Xun
There has been an explosive growth of spatial data over the last decades thanks to the popularity of location-based services (e.g., Google Maps), affordable devices (e.g., mobile phone with GPS receiver), and fast development of data transfer and storage technologies. This significant growth as well as the emergence of new technologies emphasize the need for automated discovery of spatial patterns which can facilitate applications such as mechanical engineering, transportation engineering, and public safety. This thesis investigates novel and societally important patterns from various types of large scale spatial events such as spatial point events, spatio-temporal point events, and spatio-temporal linear events. Multiple novel approaches are proposed to address the statistical, computational, and mathematical challenges posed by different patterns. Specifically, a neighbor node filter and a shortest tree pruning algorithms are developed to discover linear hotspots on shortest paths, a bi-directional fragment-multi- graph traversal is proposed for discovering linear hotspots on all simple paths, and an apriori-graph traversal algorithm is proposed to detecting spatio-temporal intersection patterns. Extensive theoretical and experimental analyses show that the proposed approaches not only achieve substantial computational efficiency but also guarantee mathematical properties such as solution correctness and completeness. Case studies using real-world datasets demonstrate that the proposed approaches identify valuable patterns that are not detected by current state-of-the-art methods.
Identifying the combined effect of shared autonomous vehicles and congestion pricing on regional job accessibility
(Journal of Transport and Land Use, 2020) Zhong, Shaopeng; Cheng, Rong; Li, Xufeng; Wang, Zhong; Jiang, Yu
Most of the existing research on shared autonomous vehicles (SAVs) and road congestion pricing have studied the short-term impact on traffic flow. These types of studies focused on the influences on mobility and ignored the long-term impacts on regional job accessibility. Given this, from the perspective of land use and transportation integration, this study explored the long-term effects of SAVs and cordon-based congestion pricing on regional land use, transportation, and job accessibility. The contributions of this study have been summarized by the following three purposes. First, to the best of the authors' knowledge, this study was the first attempt to identify the long-term impact of the combination of these two technologies on regional job accessibility. Second, compared to the previous research methodology, this study adopted the land use and transportation integrated model (TRANUS model) and scenario planning to ensure the comprehensiveness and validity of the results. Third, this study analyzed the spatial heterogeneity of the impact of the combination of the two technologies on regional job accessibility in different areas with different built-environment attributes. To realize this purpose, this study quantitatively classified traffic analysis zones (TAZs) using data mining technology, i.e., factor analysis and clustering analysis. Results showed that the introduction of SAVs will contribute to job and population development in the charging zone and reduce the negative effect of road congestion pricing. From the perspective of reducing the average travel time between TAZs, the best strategy will be to implement SAVs alone, followed by integrated use of SAVs and road congestion pricing, while the worst strategy will be to implement the cordon-based congestion pricing policy alone. By comparison, from the perspective of improving regional job accessibility, the effect of introducing SAVs was better than that of road congestion pricing, while the combination of these two technologies was not superior to implementing SAVs alone.
The impact of data fragmentation on high-throughput clinical phenotyping.
(2012-02) Wei, Weiqi
Subject selection is essential and has become the rate-limiting step for harvesting knowledge to advance healthcare through clinical research. Present manual approaches inhibit researchers from conducting deep and broad studies and drawing confident conclusions. High-throughput clinical phenotyping (HTCP), a recently proposed approach, leverages the machine-processable content from electronic medical record (EMR) for this otherwise inefficient process making subject selections scalable and practical. However, the ability to capture a patient’s medical data is often limited because of commonly existing data fragmentation problems within current EMR systems, i.e. different data types (structured vs. unstructured), heterogeneous data sources (single medical center vs. multiple healthcare centers), and various time frames (short time frame vs. long time frame). The role of data fragmentation on HTCP remains unknown. In this dissertation, by taking advantage of the REP patient-record-linkage system and the richness of EMR data at Mayo Clinic, I provide a multidimensional and thorough demonstration of how data fragmentation affects HTCP. The predominant message that this dissertation delivered to the health informatics field can be summarized as data fragmentation of EMR has a remarkable influence on HTCP. This risk should be carefully considered and mitigated by clinical researchers for the secondary and meaningful use of EMR, especially when developing or executing an HTCP algorithm for subject selection.
Integrating Physics into Machine Learning for Monitoring Scientific Systems
(2020-07) Jia, Xiaowei
Machine learning (ML) has transformed all aspects of our life including how we make decisions, entertain ourselves, and interact with each other. The power of ML models lies in their ability to automatically extract useful patterns from complex data. Given the well-known success of ML in commercial domains, there is an increasing interest in using ML models for advancing scientific discovery. However, direct application of “black-box” ML models has met with limited success in scientific domains given that the data available for many scientific problems is far smaller than what is needed to effectively train advanced ML models. Moreover, in the absence of adequate information about the physical mechanisms of real-world processes, ML approaches are prone to false discoveries of patterns which look deceptively good on training data but cannot generalize to unseen scenarios. This thesis introduces a new generation of machine learning approaches which leverage accumulated scientific knowledge to solve problems of great scientific and societal relevance. We investigate multiple ways in which physical knowledge can be used in the design of ML models for effectively capturing underlying physical processes that are evolving and interacting and multiple scales. We also introduce new optimization strategies for ML models so that they can achieve higher accuracy with limited data and also preserve the correctness from a physical perspective. We will describe our technical innovations and show how they help address real-world challenges by focusing on applications from two disciplines: aquatic science and monitoring crops at scale. We first introduce a physics-guided machine learning framework, which explores a deep coupling of ML methods with scientific knowledge. We show this approach can significantly outperform the state-of-the-art physics-based models and machine learning models in monitoring lake systems and river networks using limited training data while also maintaining consistency to known physical laws. Also, we greatly advance existing deep learning methods so that they can learn patterns from real-world data of greater complexity. These techniques have shown a lot of success in detecting primary crops in US and tree crop plantations in Southeast Asia.
Intelligent tagging systems: machine learning for novel systems: machine learning for novel.
(2012-04) Vig, Jesse
The Web puts a vast repository of information at users' fingertips, but the size and complexity of this information space can easily overwhelm users. Recommender systems and tagging systems represent two very different approaches to addressing this information overload. Recommender systems use machine learning and statistical models to automatically retrieve the items of most interest to a particular user. Tagging systems leverage the community's collective knowledge to help users explore the information space themselves. While both approaches can be very effective, they each have limitations. Recommender systems require little effort from users, but they leave users with little control over the recommendation process. Tagging systems put control in the hands of the user, but -- because tags are applied by humans -- tagging systems often suffer from issues of tag sparsity. This thesis explores intelligent tagging systems that combine the machine intelligence of recommender systems with the user control and comprehensibility of tagging systems. We first present Tagsplanations, tag-based explanations that help users understand why an item was recommended to them. We then introduce the Tag Genome, a novel data structure that uses machine learning to augment human judgments of the relationships between tags and items. Next we discuss Movie Tuner, a conversational recommender system based on the Tag Genome that enables users to provide multifaceted feedback using tags. For each system, we outline the design space of the problem and discuss our design decisions. We evaluate each system using both offline analyses as well as field studies involving thousands of users from MovieLens, a movie recommender system that also supports tagging of movies. Finally, we draw conclusions for the broader space of related applications.
Large scale optimization for machine learning
(2014-12) Wang, Huahua
Over the last several decades, tremendous tools have been developed in machine learning, ranging from statistical models to scalable algorithms, from learning strategies to various tasks, having a far-reaching influence in broad applications ranging from image and speech recogni- tions to recommender systems, and from bioinformatics to robotics. In entering the era of big data, large scale machine learning tools become increasingly important in training a big model on big data. Since machine learning problems are fundamentally empirical risk minimization problems, large scale optimization plays a key role in building a large scale machine learning system. However, scaling optimization algorithms like stochastic gradient descent (SGD) in a distributed system raises some issues like synchronization since they were not designed for this purpose. Synchronization is required because consistency should be guaranteed, i.e., the parameters in different machines should be the same. Synchronization leads to blocking com- putation and performance degradation of a distributed system. Without blocking, overwriting may happen and consistency can not be guaranteed. Moreover, SGD may not be suitable for constrained optimization problems.To address the issues of scaling optimization algorithms, we develop several novel opti- mization algorithms suitable for distributed systems from two settings, i.e., unconstrained op- timization and equality-constrained optimization. First, building on SGD in the unconstrained optimization setting, we propose online randomized block coordinate descent which randomly updates some parameters using some samples and thus allows the overwriting in SGD. Second, instead of striving to maintain consistency at each iteration in the unconstrained optimization setting, we turn to the equality-constrained optimization which guarantees eventual consistency , i.e., the parameters in different machines are not the same at each iteration but will be the same eventually. The equality-constrained optimization also includes the cases that SGD can not be applied.The alternating direction method of multipliers (ADMM) provides a suitable framework for equality-constrained optimziation but raises some issues: (1) it does not provide a systematic way to solve subproblems; (2) it requires to solve all subproblems and synchronization; (3) it is a batch method which can not process data online. For the first issue, we propose Bregman ADMM which provides a unified framework to solve subproblems efficiently. For the second issue, we propose parallel direction method of multipliers (PDMM), which randomly picks some subproblems to solve and does asynchronous aggregation. Finally, we introduce online ADMM so that the algorithm can process partial data at each iteration.To validate the effectiveness and scalability of the proposed algorithms, we particularly apply them to a variety of applications, including sparse structure learning and maximum a posterior (MAP) inference in probabilistic graphical models, and online dictionary learning. We also implement the proposed methods on various architectures, including hundreds to thousands CPU cores in clusters and GPUs. Experimental results show that the proposed methods can scale gracefully with the number of cores and perform better than state-of-the-art methods.
Machine learning methods for recommender systems
(2015-02) Kabbur, Santosh
This thesis focuses on machine learning and data mining methods for problems in the area of recommender systems. The presented methods represent a set of computational techniques that produce recommendation of items which are interesting to the target users. These recommendations are made from a large collection of such items by learning preferences from their interactions with the users. This thesis addresses the two primary tasks in recommender systems, namely top-N recommendation and rating prediction. Following methods are developed, (i) an item-based method (FISM) for generating top-N recommendations that learn the item-item similarity matrix as the product of two low dimensional latent factor matrices. These matrices are learned using a structural equation modeling approach, wherein the value being estimated is not used for its own estimation. Since, the effectiveness of existing top-N recommendation methods decreases as the sparsity of the datasets increases, FISM is developed to alleviate the problem of data sparsity, (ii) a new user modeling approach (MPCF), that models the users preference as a combination of global preference and local preference components. Using this user modeling approach, two different methods are proposed based on the manner in which the global preference and local preferences components interact. In the first approach, the global component models the user's common strong preferences on a subset of item features, while the local preferences component models the tradeoffs the users are willing to take on the rest of the item features. In the second approach, the global preference component models the user's common overall preferences on all the item features and the local preferences component models the different tradeoffs the users have on all the item features, thereby helping to fine tune the global preferences. An additional advantage of MPCF is that, the user's global preferences are estimated by taking into account all the observations, thus it can handle sparse data effectively, (iii) a new method called ClustMF which is designed to combine the benefits of the neighborhood models and the latent factor models in a computationally efficient manner. The benefits of latent factor models are utilized by modeling the users and items similar to the standard MF based methods and the benefit of neighborhood models are brought into the model, by introducing biases at the cluster level. That is, the biases for users are modeled at the item cluster level and the biases for items are modeled at the user cluster level. The item-cluster user biases model the baseline score of the user for the items similar to the active item and similarly, the user-cluster item biases model the baseline score of the item from the users similar to the active user._
Mining high-dimensional bioprocess and gene expression data for enhanced process performance
(2012-07) Le, Huong Thi Ngoc
Over the past few decades, recombinant protein therapeutics produced in cultured mammalian cells have fundamentally transformed modern medicine and improved millions of patients' lives. The drastic increase in product concentration and the number of products approved by the US Food and Drug Administration (FDA) have been attributed largely to the relentless efforts of the entire pharmaceutical community on multiple technological fronts. The remarkable advances of high-throughput genomic and process analytical tools in recent years have allowed us to extensively characterize almost all steps along a typical cell culture process. The massive amount of data generated by these technologies harbors vital information about the process, yet presents substantial challenges due to its exceptionally high dimensionality. This thesis research has applied advanced multivariate approaches to explore these sets of data and comprehend profound cellular changes during various development and manufacturing stages.Through mining a large set of manufacturing data, we uncovered a "memory" effect, suggesting that the final outcome of a production culture is primarily affected by the early seed culture. Several parameters related to lactate metabolism and cell growth were identified as having a pivotal influence on process performance. Furthermore, transcriptome analysis of cells undergoing selection and amplification was performed using multiple statistical, clustering, and functional analysis methods. Profound transcriptional changes were discerned, upon which a combined hyper-productivity gene set involving cell cycle control, signaling, and protein processing and secretion was derived. These differentially expressed genes present promising targets for cellular modulation to enhance process performance. We further developed a novel genetic tool to engineer the expression dynamics of these genes. A large number of genes with time dynamic expression trends were identified through mining time-series transcriptome data. The promoters of these genes offer effective means to drive the expression profiles of the targets in a dynamic manner. The systems approaches outlined in this research thus hold promise to deepen our understanding of process characteristics and open new avenues for process improvement.
Mining relationships in spatio-temporal datasets
(2013-01) Kawale, Jaya
MnROAD Data Mining, Evaluation and Quantification – Phase I
(Minnesota Department of Transportation Research Services Section, 2010-07) Barnes, Randal J.
A data filtering system for the MnROAD temperature database was designed and implemented. Fourteen inter-dependent quantitative tests were developed to identify and flag erroneous, questionable, or exceptional data. Four of the tests identify missing and intermittent data streams. Three of the tests analyze the time series from individual sensors and identify outliers. Three of the tests compare data streams of similar sensors; “similar” implies identical pavement type, general location, and sensor depth. The remaining four tests are summary tests that identify periods of unreliable data. The specific analysis and quantitative results are based upon the 471,178,324 data records from 1,313 thermocouple sensors in 48 MnROAD test cells collected from 1 January 1996 through October 2007. The considered test cells include both hot mix asphalt and Portland cement concrete sections from both the Mainline and Low Volume Road. The majority of the sensors performed very well: 714 of the 1,282 operational sensors produced reliable data more than 99 percent of the time. Only 18 of 1,282 operational sensors produce reliable data less than 50 percent of the time. Only 31 of the original 1,313 sensor were wholly non-operational. A wide variety of statistical tables and graphical representation were produced in a digital format for the considered data. Although this project focuses on a particular set of data, the concepts and tools developed in this project are designed to be extensible to accommodate the filtering of the ongoing and future data collection efforts at MnROAD.
Modeling Spatial and Spatio-temporal Co-occurrence Patterns
(2008-07) Celik, Mete
As the volume of spatial and spatio-temporal data continues to increase significantly due to both the growth of database archives and the increasing number and resolution of spatio-temporal sensors, automated and semi-automated pattern analysis becomes more essential. Spatial and spatio-temporal (ST) data analyses have emerged in recent decades to develop understanding of the spatial and spatio-temporal characteristics and patterns. However, in the last decade, the growth in variety and volume of observational data, notably spatial and spatio-temporal data, has out-paced the capabilities of analytical tools and techniques. Major limitations of existing classical data mining models and techniques include the following. First, these do not adequately model richer temporal semantics of data observations (e.g. co-occurrence patterns of moving objects, emerging and vanishing patterns, multi-scale cascade patterns, periodic patterns). Second, these do not take into account time dimension of the data observations. Third, these do not provide sufficient interest measures and computationally efficient algorithms to discover spatial and spatio-temporal co-occurrence patterns. These limitations represent critical barriers in several application domains that require to analyze huge datasets. In this dissertation, I proposed addressed these limitations by i) providing a framework to model the rich semantics of the ST patterns of data observations by developing a taxonomy of spatial and ST co-occurrence patterns, ii) designing new techniques that are taking into account the time dimension of the data, and iii) developing new monotonic composite interest measures and scalable algorithms. The proposed approaches reduced the manual effort by reducing the plausible set of hypotheses. Major focus would be on developing scalable algorithms to mine spatial and ST co-occurrence patterns.
Numerical linear algebra techniques for effective data analysis.
(2010-09) Chen, Jie
Data analysis is a process of inspecting and obtaining useful information from the data, with the goal of knowledge and scientific discovery. It brings together several disciplines in mathematics and computer science, including statistics, machine learning, database, data mining, and pattern recognition, to name just a few. A typical challenge with the current era of information technology is the availability of large volumes of data, together with ``the curse of dimensionality''. From the computational point of view, such a challenge urges efficient algorithms that can scale with the size and the dimension of the data. Numerical linear algebra lays a solid foundation for this task via its rich theory and elegant techniques. There are a large amount of examples which show that numerical linear algebra consists of a crucial ingredient in the process of data analysis. In this thesis, we elaborate on the above viewpoint via four problems, all of which have significant real-world applications. We propose efficient algorithms based on matrix techniques for solving each problem, with guaranteed low computational costs and high quality results. In the first scenario, a set of so called Lanczos vectors are used as an alternative to the principal eigenvectors/singular vectors in some processes of reducing the dimension of the data. The Lanczos vectors can be computed inexpensively, and they can be used to preserve the latent information of the data, resulting in a quality as good as by using eigenvectors/singular vectors. In the second scenario, we consider the construction of a nearest-neighbors graph. Under the framework of divide and conquer and via the use of the Lanczos procedure, two algorithms are designed with sub-quadratic (and close to linear) costs, way more efficient than existing practical algorithms when the data at hand are of very high dimension. In the third scenario, a matrix blocking algorithm for reordering and finding dense diagonal blocks of a sparse matrix is adapted, to identify dense subgraphs of a sparse graph, with broad applications in community detection for social, biological and information networks. Finally, in the fourth scenario, we visit the classical problem of sampling a very high dimensional Gaussian distribution in statistical data analysis. A technique of computing a function of a matrix times a vector is developed, to remedy traditional techniques (such as via the Cholesky factorization of the covariance matrix) that are limited to mega-dimensions in practice. The new technique has a potential for sampling Guassian distributions in tera/peta-dimensions, which is typically required for large-scale simulations and uncertainty quantifications./
Scalable Kernel Learning, Tensors in Community Identification, and Robust Adversary Detection in Deep Neural Networks
(2019-08) Sheikholeslami, Fatemeh
The presence of ubiquitous sensors continuously recording massive amounts of information has lead to an unprecedented data collection, whose exploitation is expected to bring about scientific and social advancements in everyday lives. Along with the ever-increasing amount of data, incredible progress in the fields of Machine Learning, Pattern Recognition, and Optimization has also contributed to the growing expectations. Such progress however, has also brought to light certain limitations in state-of-the-art learning machines, manifesting the roadblocks in the research path ahead. For instance, in addition to practical considerations pertaining to non-stationary, noisy and unsupervised settings, various applications often run on limited memory and stringent computational resources, thus requiring efficient and light-weight algorithms to cope with extreme volumes. Furthermore, certain characteristics such as presence of outliers or adversaries as well as the complex nature of real-world interactions call for robust algorithms, whose performance will be resilient in the face of deviations from nominal settings. The present thesis contributes to learning over unsupervised, complex, and adversarial data. Emphasis is laid on concocting online, scalable and robust algorithms, enabling streaming analytics of sequential measurements based on vector, matrix, and tensor-based views of supervised and unsupervised learning tasks. For online and scalable learning, a novel kernel-based feature extraction framework is put forth, in which limited memory and computational resources are accounted for via maintaining an affordable \emph{budget}. Furthermore, complex interactions of real-world networks are analyzed from a community identification point-of-view, in which a novel tensor-based representation along with provable optimization techniques robustify state-of-the-art alternatives. Finally, the performance of deep convolutional neural network based image classifiers is investigated when adversaries disturbing input images are modeled as imperceptible yet carefully-crafted perturbations. To this end, a general class of high-performance Bayesian detectors of adversaries is developed. Extensive experimentation on synthetic as well as numerous real datasets demonstrates the effectiveness, interpretability and scalability of the proposed learning, identification, and detection algorithms. More importantly, the process of design and experimentation sheds light on the behavior of different methods and the peculiarities of real-world data, while at the same time it generates new ideas and directions to be explored.
Spatial big data analytics for urban informatics
(2013-08) Evans, Michael Robert
Urban Informatics is the practice of using computer technology to support core city functions: planning, governance and operations. This technology consists of hardware, software, databases, sensors, and communication devices used to develop and sustain more livable and healthier cities. Urban Informatics provides governments with the tools to make data-driven decisions regarding long-term plans, predict and respond to current and upcoming situations, and even help with day-to-day tasks such as monitoring water use and waste processing. New and immense location-aware datasets formally defined in this thesis as Spatial Big Data are emerging from a variety of sources and can be used to find novel and interesting patterns for use in urban informatics. Spatial big data is the key component driving the emerging field of Urban Informatics at the intersection of people, places, and technology. However, spatial big data presents challenges for existing spatial computing systems to store, process, and analyze such large datasets. With these challenges come new opportunities in many fields of computer science research, such as spatial data mining and spatial database systems. This thesis contains original research on two types of spatial big data, each study focusing on a different aspect of handling spatial big data (storage, processing, and analysis). Below we describe each data type through a real-world problem with challenges, related work, novel algorithmic solutions, and experimental analysis. To address the challenge of analysis of spatial big data, we studied the problem of finding primary corridors in bicycle GPS datasets. Given a set of GPS trajectories on a road network, the goal of the All-Pair Network Trajectory Similarity (APNTS) problem is to calculate the similarity between all trajectories using the Network Hausdorff Distance. This problem is important for a variety of societal applications, such as facilitating greener travel via bicycle corridor identification. The APNTS problem is challenging due to the high cost of computing the exact Network Hausdorff Distance between trajectories in spatial big datasets. Previous work on the APNTS problem takes over 16 hours of computation time on a real-world dataset of bicycle GPS trajectories in Minneapolis, MN. In contrast, this work focuses on a scalable method for the APNTS problem using the idea of row-wise computation, resulting in a computation time of less than 6 minutes on the same datasets. We provide a case study for transportation services using a data-driven approach to identify primary bicycle corridors for public transportation by leveraging emerging GPS trajectory datasets. Experimental results on real-world and synthetic data show a two orders of magnitude improvement over previous work. To address the challenge of storage of spatial big data, we studied the problem of storing spatio-temporal networks in spatial database systems. Given a spatio-temporal network and a set of database query operators, the goal of the Storing Spatio-Temporal Networks (SSTN) problem is to produce an efficient data storage method that minimizes disk I/O access costs. Storing and accessing spatio-temporal networks is increasingly important in many societal applications such as transportation management and emergency planning. This problem is challenging due to strains on traditional adjacency list representations when storing temporal attribute values from the sizable increase in length of the time-series. Current approaches for the SSTN problem focus on orthogonal partitioning (e.g., snapshot, longitudinal, etc.), which may produce excessive I/O costs when performing traversal-based spatio-temporal network queries (e.g., route evaluation, arrival time prediction, etc) due to the desired nodes not being allocated to a common page. We propose a Lagrangian-Connectivity Partitioning (LCP) technique to efficiently store and access spatio-temporal networks that utilizes the interaction between nodes and edges in a network. Experimental evaluation using the Minneapolis, MN road network showed that LCP outperforms traditional orthogonal approaches. The work in this thesis the first step toward understanding the immense challenges and novel applications of Spatial Big Data Analytics for Urban Informatics. In this thesis, we define spatial big data and propose novel approaches for storing and analyzing two popular spatial big data types: GPS trajectories and spatio-temporal networks. We conclude the thesis by exploring future work in the processing of spatial big data.
Spatiotemporal Big Data Analytics: Change Footprint Pattern Discovery
(2014-05) Zhou, Xun
Recent years have seen the emergence of many new and valuable datasets such as global climate projection, GPS traces, and tweets. However, these Spatiotemporal Big Data (STBD) poses significant challenges for data analytics due to high data variety and candidate-pattern cardinality. One specific STBD analytics tasks is change footprint pattern discovery. Given a definition of change and a dataset about a spatiotemporal(ST) phenomenon, ST change footprint pattern discovery is the process of identifying the location and/or time of such changes in the data. This problem is of fundamental significance to a variety of applications such as understanding climate change, public safety, environmental monitoring, etc. This thesis formally defines the spatiotemporal change footprint as a new pattern family in STBD analytics, and examined footprint patterns and related discovery techniques across disciplines via a novel taxonomy. Methods for detecting change footprints have emerged from a diverse set of research areas, ranging from time series analysis and remote sensing to spatial statistics. Existing reviews focus on discovery methods for only one or a few types of change footprints. To facilitate sharing of insights across disciplines, we conduct a multi-disciplinary review of ST change patterns and their respective discovery methods. We develop a taxonomy of possible ST change footprints and classified our review findings accordingly. This exercise allows us to identify gaps in the research that we consider ripe for exploration, most notably change pattern discovery in vector ST datasets. To address the research gaps identified in the above analysis, this thesis further explores the computational solutions to the discovery of two specific change footprint patterns, namely, interesting sub-paths (e.g., change intervals) and persistent change windows. Given a spatiotemporal (ST) dataset and a path in its embedding spatiotemporal framework, the goal of the interesting sub-path discovery problem is to identify all interesting sub-paths defined by an interest measure. An important application domain of sub-path discovery is understanding climate change. This thesis formally defines the computational structure of interesting sub-path discovery as a Grid-based Directed Acyclic Graph (G-DAG). We propose a new algorithm, namely, the Row-wise Traversal (after leaf-evaluation) with Column Pruning (RTCP) which brings dramatically down the memory cost for G-DAG traversal in our earlier approaches while also reducing CPU cost. We also provide theoretical analyses of correctness, completeness and computational complexity of the RTCP algorithm. Experimental evaluation on both synthetic and real datasets show that the RTCP algorithm is always the fastest in computational time among all the proposed algorithms. The thesis finally explores a more complicated change footprint pattern, namely, the persistent change window. Given a region comprised of locations that each have a time series, the Persistent Change Windows (PCW) discovery problem aims to find all spatial window and temporal interval pairs that exhibit persistent change of attribute values over time. PCW discovery is important for critical societal applications such as detecting desertification, deforestation, and monitoring urban sprawl. The PCW discovery problem is challenging due to the large number of candidate patterns, the lack of monotonicity, and large datasets of detailed resolution and high volume. Previous approaches in ST change footprint discovery have focused on local spatial footprints for persistent change discovery and may not guarantee completeness. In contrast, we propose a space-time window enumeration and pruning (SWEP) approach that considers zonal spatial footprints when finding persistent change patterns. We provide theoretical analysis of SWEP's correctness, completeness, and space-time complexity. We also present a case study on vegetation data that demonstrates the usefulness of the proposed approach. Experimental evaluation on synthetic data show that the SWEP approach is orders of magnitude faster than the naive approach. The work in this thesis is the first step towards understanding the spatiotemporal change footprint discovery problem, including its formulation, computational challenges and solutions, and applications. In this thesis, we have explored automatic and efficient approaches to discovery raster-based ST change footprints, and applied our techniques on climate data in the context of understanding climate change. We conclude this thesis by exploring potential ST change patterns with new footprints (e.g., geographic feature based footprints), alternative computational paradigms (e.g., parallel and distributed STBD analytics), their challenges and solutions, and other future research directions.
A Study of Dimensionality Reduction Techniques and its Analysis on Climate Data
(2015-10) Kumar, Arjun
Dimensionality reduction is a significant problem across a wide variety of domains such as pattern recognition, data compression, image segmentation and clustering. Different methods exploit different features in the data to reduce dimensionality. Principle component Analysis is one such method that exploits the variance in data to embed data onto a lower dimensional space called the principle component space. These are linear techniques which can be expressed in the form B=TX where T is the transformation matrix that acts on the data matrix X to the reduced dimensionality representation B. Other linear techniques explored are Factor Analysis and Dictionary Learning. In many problems, the observations are high-dimensional but we may have reason to believe that the they lie near a lower-dimensional manifold. In other words, we may believe that high-dimensional data are multiple, indirect measurements of an underlying source, which typically cannot be directly measured. Learning a suitable low-dimensional manifold from high-dimensional data is essentially the same as learning this underlying source. Techniques such as ISOMAP, Locally Linear Embedding, Laplacian EigenMaps (LEMs) and many others try to embed the high-dimensional observations in the non-linear space onto a low dimensional manifold. We will explore these methods making comparative studies and their applications in the domain of climate science.

University Digital Conservancy

Browse by Subject

Browsing by Subject "Data mining"