Browsing by Author "Li, Yan"
Now showing 1 - 8 of 8
- Results Per Page
- Sort Options
Item An Introduction to Spatial Data Mining(2018-08-08) Golmohammadi, Jamal; Xie, Yiqun; Gupta, Jayant; Li, Yan; Cai, Jiannan; Detor, Samantha; Roh, Abigail; Shekhar, ShashiThe goal of spatial data mining is to discover potentially useful, interesting, and non-trivial patterns from spatial datasets. Spatial data mining is important for societal applications in public health, public safety, agriculture, environmental science, climate etc. For example,in epidemiology, spatial data mining helps to find areas with a high concentrations of disease incidents to manage disease outbreaks. Computerized methods are needed to discover spatial patterns since the volume and velocity of spatial data exceeds the number of human experts available to analyze it. In addition, spatial data has unique characteristics like spatial autocorrelation and spatial heterogeneity which violate the i.i.d (Independent and Identically Distributed data samples) assumption of traditional statistics and data mining methods. So, using traditional methods may miss patterns or may yield spurious patterns which are costly (e.g., stigmatization) in spatial applications. Also, there are other intrinsic challenges such as MAUP (Modifiable Areal Unit Problem) as illustrated by a current court case debating gerrymandering in elections. Spatial data mining considers the unique characteristics, and challenges of spatial data and domain knowledge of the target application to discover more accurate and interesting patterns.In this article, we discuss tools and computational methods of spatial data mining, focusing on the primary spatial pattern families: hotspot detection, colocation detection, spatial prediction and spatial outlier detection. Hotspot detection methods use domain information to model accurately more active and high density areas. Colocation detection methods find objects whose instances are in proximity of each other in a location. Spatial prediction approaches explicitly model neighborhood relationship of locations to predict target variables from input features. The goal of spatial outlier detection methods is to find data that are different from their neighbors.Item GeoAI for Emerging Spatial Datasets(2022-05) Li, YanGeospatial artificial intelligence (GeoAI) is the generalization of conventional artificial intelligence (AI) to meet the challenges posed by spatial data. Spatial data, i.e., data annotated with spatial information such as locations and shapes, has been growing available over the last decade and transformed lives by providing novel ways of observing the world, knowing places and the relations between them. For example, large amount of onboard diagnostics data from vehicles becomes available with the popularity of telematics devices equipped with GPS chips and makes monitoring vehicles’ real-world performance possible, which is valuable for domains such as vehicle mechanics, transportation science, and city planning. In many other domains such as smart city and public health, spatial data becomes critical as well. For example, during the Covid-19 pandemic period, mobile tracking data from devices with GPS chips has been used as an important way of contact tracing and traveling pattern surveying. A McKinsey Digital report estimates that personal spatial data could help save consumers about $600 billion by 2020.Recent years have witnessed significant advances in AI in both academia and industry. Its fast development is powered by big data and high-performance computing platforms that support the development, training, and deployment of AI methods with reasonable cost. Even though spatial data are critical, valuable, and collected in a large scale, and AI techniques have been applied to many problems such as computer vision and natural language processing successfully, spatial data pose great challenges to conventional AI techniques. The first challenge is the gap between AI techniques and domain knowledge. Conventional AI techniques rarely consider domain knowledge (e.g., physics laws and epidemiology models), making their results hard to interpret and susceptible to violate domain constraints even with large volumes of data. On the other hand, domain knowledge by itself is insufficient due to its reliance on simplifying assumptions that may not approximate the complex real-world scenarios well. The other challenges are caused by the properties of spatial data, namely, spatial autocorrelation, spatial heterogeneity, and spatial continuity. Spatial autocorrelation describes the fact that the data samples (e.g., temperature, precipitation) at different spatial locations are correlated with each other and are affected by their geographical neighbors, which violates the common i.i.d. (i.e., independent and identical distribution) assumption underlying many machine learning models. Spatial heterogeneity refers to the fact that the data samples at different spatial locations are different from each other, so there may not be universal models that are applicable globally. Spatial continuity refers to the fact that the conflict between the continuity of the geographic space and the discrete representation of spatial data. This thesis investigates novel and societally important GeoAI techniques for emerging spatial datasets such as multi-attributed trajectories and categorical point sets. Multiple novel approaches are proposed to address challenges posed by the datasets on conventional AI techniques. Specifically, a Quad-Grid Filter & Refine algorithm is introduced to detect local spatial colocation patterns, which consider the spatial heterogeneity property of colocation patterns. The algorithm can detect colocation patterns that may not be prevalent globally but are prevalent in local regions, and it is much more computationally efficient than the baseline algorithm. Second, the thesis investigate the problem of discovering contrasting spatial colocation patterns that have different prevalence in two groups of spatial datasets. It leverages the domain knowledge that neighborhood relationships between categorical spatial objects may convey important information, and introduces a filter & refine algorithm using the anti-monotone property of a proposed metric to measure the prevalence difference of any colocation patterns in the two groups. Third, the thesis discusses a point-set classification method for multiplexed pathology images. Inspired by the domain assumption that the spatial configuration of cells may vary under different health conditions, this thesis introduces a neural network architecture to capture the spatial configurations of categorical point sets through modeling pairwise relationships. Last, the thesis introduces a physics-guided K-means algorithms to estimate the energy consumption for a vehicle to travel along a path, which is a combination of physics laws followed by vehicle energy consumption and a machine learning model. The thesis also proposes a path-centric path selection algorithm using the proposed energy consumption estimation model considering the spatial autocorrelation property of the data.Item An Introduction to Spatial Data Mining(University Consortium for Geographic Information Science, 2020) Golmohammadi, Jamal; Xie, Yiqun; Gupta, Jayant; Farhadloo, Majid; Li, Yan; Cai, Jiannan; Detor, Samantha; Roh, Abigail; Shekhar, ShashiThe goal of spatial data mining is to discover potentially useful, interesting, and non-trivial patterns from spatial data-sets (e.g., GPS trajectory of smartphones). Spatial data mining is societally important having applications in public health, public safety, climate science, etc. For example, in epidemiology, spatial data mining helps to find areas with a high concentration of disease incidents to manage disease outbreaks. Computational methods are needed to discover spatial patterns since the volume and velocity of spatial data exceed the ability of human experts to analyze it. Spatial data has unique characteristics like spatial autocorrelation and spatial heterogeneity which violate the i.i.d (Independent and Identically Distributed) assumption of traditional statistic and data mining methods. Therefore, using traditional methods may miss patterns or may yield spurious patterns, which are costly in societal applications. Further, there are additional challenges such as MAUP (Modifiable Areal Unit Problem) as illustrated by a recent court case debating gerrymandering in elections. In this article, we discuss tools and computational methods of spatial data mining, focusing on the primary spatial pattern families: hotspot detection, colocation detection, spatial prediction, and spatial outlier detection. Hotspot detection methods use domain information to accurately model more active and high-density areas. Colocation detection methods find objects whose instances are in proximity to each other in a location. Spatial prediction approaches explicitly model the neighborhood relationship of locations to predict target variables from input features. Finally, spatial outlier detection methods find data that differ from their neighbors. Lastly, we describe future research and trends in spatial data mining.Item Physics-Guided Anomalous Trajectory Detection: Technical Report(2020) Shrinivasa Nairy, Divya; Adila, Dyah; Li, Yan; Shekhar, ShashiGiven ship trajectory data for a region, this paper proposes a physics-guided approach to detect anomalous trajectories. This problem is important for detection of illegal fishing or cargo transfer, which cause environmental and societal damage. This problem is challenging due to the presence of gaps in trajectories. Current state-of-the-art approaches either ignore the gaps or fill them using simple linear interpolation, which underestimates the ship’s possible locations during the gap. This paper proposes a novel physics-guided gap-aware anomaly detection test that incorporates physical constraints using a space-time prism. The proposed approach is evaluated with a case study using Marine Cadastre data of ships traversing in the Aleutian Islands region of Alaska in October 2017. A trajectory that could have traversed a marine protected area is correctly flagged by the proposed approach for investigation.Item Structural and chemical characterization data for Ir and Ru metal/metal-oxide thin films showing strain dependence of metal oxidation(2023-04-05) Nair, Sreejith T; Yang, Zhifei; Lee, Dooyong; Guo, Silu; Sadowski, Jerzy T; Johnson, Spencer; Saboor, Abdul; Li, Yan; Zhou, Hua; Comes, Ryan B; Jin, Wencan; Mkhoyan, Andre K; Janotti, Anderson; Jalan, Bharat; nair0074@umn.edu; Nair, Sreejith T; University of Minnesota Jalan MBE LabIn this work, the authors uncover a previously unexplored effect of substrate imposed epitaxial strain on the formation energy of a crystalline epitaxial metal oxide thin film, thereby revealing an additional tuning knob to engineer synthesis of oxide thin films of hard-to-oxidize metals.Item Transdisciplinary Foundations of Geospatial Data Science(2017-12-05) Xie, Yiqun; Eftelioglu, Emre; Ali, Reem Y.; Tang, Xun; Li, Yan; Doshi, Ruhi; Shekhar, ShashiRecent developments in data mining and machine learning approaches have brought lots of excitement in providing solutions for challenging tasks (e.g., computer vision). However, many approaches have limited interpretability, so their success and failure modes are difficult to understand and their scientific robustness is difficult to evaluate. Thus, there is an urgent need for better understanding of the scientific reasoning behind data mining and machine learning approaches. This requires taking a transdisciplinary view of data science and recognizing its foundations in mathematics, statistics, and computer science. Focusing on the geospatial domain, we apply this crucial transdisciplinary perspective to five common geospatial techniques (hotspot detection, colocation detection, prediction, outlier detection and teleconnection detection). We also describe challenges and opportunities for future advancement.Item Understanding COVID-19 Effects on Mobility: A Community-Engaged Approach(2022) Sharma, Arun; Farhadloo, Majid; Li, Yan; Kulkarni, Aditya; Gupta, Jayant; Shekhar, ShashiGiven aggregated mobile device data, the goal is to understand the impact of COVID-19 policy interventions on mobility. This problem is vital due to important societal use cases, such as safely reopening the economy. Challenges include understanding and interpreting questions of interest to policymakers, cross-jurisdictional variability in choice and time of interventions, the large data volume, and unknown sampling bias. The related work has explored the COVID-19 impact on travel distance, time spent at home, and the number of visitors at different points of interest. However, many policymakers are interested in long-duration visits to high-risk business categories and understanding the spatial selection bias to interpret summary reports. We provide an Entity Relationship diagram, system architecture, and implementation to support queries on long-duration visits in addition to fine resolution device count maps to understand spatial bias. We closely collaborated with policymakers to derive the system requirements and evaluate the system components, the summary reports, and visualizations.Item Vehicle Emissions Prediction with Physics-Aware AI Models: Technical Report(2020) Panneer Selvam, Harish; Li, Yan; Wang, Pengyue; Northrop, William F; Shekhar, ShashiGiven an on-board diagnostics (OBD) dataset and a physics-based emissions prediction model, this paper aims to develop an accurate and computational-efficient AI (Artificial Intelligence) method that predicts vehicle emissions values. The problem is of societal importance because vehicular emissions lead to climate change and impact human health. This problem is challenging because the OBD data does not contain enough parameters needed by high-order physics models. Conversely, related work has shown that low-order physics models have poor predictive accuracy when using available OBD data. This paper uses a divergent window co-occurrence pattern detection method to develop a spatiotemporal variability-aware AI model for predicting emission values from the OBD datasets. We conducted a case-study using real-world OBD data from a local public transportation agency. Results show that the proposed AI method has approximately 65% improved predictive accuracy than a non-AI low-order physics model and is approximately 35% more accurate than a baseline model.