Scalable data analytics techniques for summarizing spatial network-based observations

Oliver, Dev2014-08-212014-08-212014-05https://hdl.handle.net/11299/165053University of Minnesota Ph.D. dissertation. May 2014. Major: Computer Science. Advisor: Shashi Shekhar. 1 computer file (PDF); ix, 107 pages.Summarizing spatial network-based observations (SSNO) entails finding a compact description or representation of large spatial or spatio-temporal datasets. For example, transportation planners and engineers may need to identify road segments that pose risks for pedestrians and require redesign. However, SSNO is computationally challenging for the following reasons: (1) There may be a large number of k-subsets of connected components in the network, (2) Patterns may not obey the monotonicity property, and (3) There may be a large number of candidates. To address the challenge of the large number of k-subsets of connected components in the network}, we explored the problem of spatial network activity summarization. In spatial network activity summarization (SNAS), we are given a spatial network and a collection of activities (e.g., pedestrian fatality reports, crime reports) and the goal is to find k shortest paths that summarize the activities. SNAS is important for applications where observations occur along linear paths such as roadways, train tracks, etc. Previous work has focused on either geometry or subgraph-based approaches (e.g., only one path), and cannot summarize activities using multiple paths. This work proposes a K-Main Routes (KMR) approach that discovers k shortest paths to summarize activities. To improve performance, KMR uses network Voronoi, divide and conquer, and pruning strategies. Experimental results on synthetic and real data show that KMR with our performance-tuning decisions yields substantial computational savings without reducing summary path coverage. We also present a case study comparing KMR with K-Means on real data.To address the challenge of patterns not obeying the monotonicity property, we studied the problem of geo-referenced time-series summarization. Given a set of regions with activity counts at each time instant (e.g., a listing of countries with number of mass protests or disease cases over time) and a spatial neighbor relation, geo-referenced time-series summarization (GTS) finds k-full trees that maximize activity coverage. GTS has important potential societal applications such as understanding the spread of political unrest, disease, crimes, etc. Previous approaches for spatio-temporal data mining detect anomalous or unusual areas and do not summarize activities. We propose a k-full tree (kFT) approach for GTS which features an algorithmic refinement (i.e., voronoi partition assignment) that leads to computational savings without affecting result quality. Analytical and experimental results show that the algorithmic refinement substantially reduces computational cost. We also present a case study that shows the output of our approach on Arab Spring data. To address the challenge of a large number of candidates, we tackled the problem of significant linear hotspot discovery (SLHD). Given a spatial network and a collection of activities (e.g., pedestrian fatality reports, crime reports), SLHD finds shortest paths in the spatial network where the concentration of activities is unusually high (i.e., statistically significant). SLHD is important for societal applications in transportation safety, public safety, or public health such as finding paths with significant concentrations of accidents, crimes, or diseases. SaTScan may miss many significant paths since a large fraction of the area bounded by circles for activities on a path will be empty. Previous network-based approaches only consider a small fraction of the network and only one significant network component (e.g., path). We propose novel algorithms for discovering statistically significant linear hotspots using the ideas of likelihood pruning, Monte Carlo speedup, and dynamic segmentation. We present a case study comparing the proposed approach with existing techniques on real data. Experimental results show that the proposed algorithm, with our algorithmic refinements, yields substantial computational savings without reducing result quality.The work in this thesis is the first step towards understanding the immense challenges and novel applications of summarizing spatial network-based observations for transportation safety, public safety, disaster response, etc. In this thesis, we have formally modeled spatial and spatio-temporal network datasets and have begun to explore new data analytics techniques (e.g., K-Main Routes) that may have many uses in real-world applications. We conclude this thesis by exploring additional data summarization techniques for spatial networks, their computational challenges, and possible directions for future work.en-USScalable data analytics techniques for summarizing spatial network-based observationsThesis or Dissertation