Browsing by Author "Shekhar, Shashi"
Now showing 1 - 20 of 73
- Results Per Page
- Sort Options
Item A Comparative Study on Web Prefetching(2001-05-31) Bhushan Pandey, Ajay; Vatsavai, Ranga R.; Ma, Xiaobin; Srivastava, Jaideep; Shekhar, ShashiThe growth of the World Wide Web has emphasized the need for improved user latency. Increasing use of dynamic pages, frequent changes in the site structure, and user access patterns on the internet have limited the efficacy of caching techniques and emphasized the need for prefetching. Since prefecthing increses bandwidth, it is important that the prediction model is highly accurate and computationally feasible. It has been observed that in a web environment, certain sets of pages exhibit stronger correlations than others, a fact which can be used to predict future requests. Previous studies on predictive models are mainly based on pair interactions of pages and TOP-N approaches. In this paper we study a model based on page interactions of higher order where we exploit set relationships among the pages of a web site. We also compare the performance of this approach with the models based on pairwise interaction and the TOP-N approach. We have conducted a comparative study of these models on a real server log and five synthetic logs with varying page frequency distributions to simulate different real life web sites and identified dominance zones for each of these models. We find that the model based on higher order page interaction is more robust and gives competitive performance in a variety of situations.Item A Framework for Discovering Co-location Patterns in Data Sets with Extended Spatial Objects(2003-09-22) Xiong, Hui; Shekhar, Shashi; Huang, Yan; Kumar, Vipin; Ma, Xiaobin; Soung Yoo, JinCo-location patterns are subsets of spatial features (e.g. freeways, frontage roads) usually located together in geographic space. Recent literature has provided a transaction-free approach to discover co-location patterns over spatial point data sets to avoid potential loss of proximity relationship information in partitioning continuous geographic space into transactions. This paper provides a more general transaction-free approach to mine data sets with extended spatial objects, e.g. line-strings and polygons. Key challenges include modeling of neighborhood and relationships among extended spatial objects as well as controlling of related geometric computation costs. Based on a buffer-based definition of neighborhoods, a new model of finding co-location patterns over extended spatial objects has been proposed. Furthermore, this paper presents two pruning approaches, namely a prevalence-based pruning approach and a geometric filter-and-refine approach. Experimental evaluation with a real data set (the roadmap of Minneapolis and St.~Paul metropolitan area) shows that the geometric filter-and-refine approach can speed up the prevalence-based pruning approach by a factor of 30 to 40. Finally, the extended co-location mining algorithm proposed in this paper has been used to select most challenging field test routes for a novel GPS-based approach to accessing road user charges.Item A Join-less Approach for Co-location Pattern Mining: A Summary of Results(2005-12-29) Yoo, Jin Soung; Shekhar, Shashi; Celik, MeteSpatial co-location patterns represent the subsets of features whose instances are frequently located together in geographic space. For example, MacDonald's and Burger Kings are likely co-located in a local business map. Co-location pattern discovery presents challenges since the instances of spatial features are embedded in a continuous space and share a variety of spatial relationships. A large fraction of the computation time is devoted to identifying the instances of co-location patterns. We propose a novel join-less approach for co-location pattern mining, which materializes spatial neighbor relationships with no loss of co-location instances and reduces the computational cost of identifying the instances. The join-less co-location mining algorithm is efficient since it uses an instance-lookup scheme instead of an expensive spatial or instance join operation for identifying co-location instances. We prove the join-less algorithm is correct and complete in finding co-location rules. The experimental evaluations using synthetic datasets and real world datasets show the join-less algorithm performs more efficiently than a current join-based algorithm and is scalable in dense spatial datasets.Item A Partial Join Approach for Mining Co-location Patterns: A Summary of Results(2005-12-29) Yoo, Jin Soung; Shekhar, ShashiSpatial co-location patterns represent the subsets of events whose instances are frequently located together in geographic space. We identified the computational bottleneck in the execution time of a current co-location mining algorithm. A large fraction of the join-based co-location miner algorithm is devoted to computing joins to identify instances of candidate co-location patterns. We propose a novel partial-join approach for mining co-location patterns efficiently. It transactionizes continuous spatial data while keeping track of the spatial information not modeled by transactions. It uses a transaction-based Apriori algorithm as a building block and adopts the instance join method for residual instances not identified in transactions. We show that the algorithm is correct and complete in finding all co-location rules which have prevalence and conditional probability above the given thresholds. An experimental evaluation using synthetic datasets and a real dataset shows that our algorithm is computationally more efficient than the join-based algorithm.Item A Spatial Semi-supervised Learning Method for Mining Multi-spectral Remote Sensing Imagery(2004-03-01) Vatsavai, Ranga R.; Shekhar, Shashi; Burk, Thomas E.Supervised learning, which is often used in land cover (thematic) classification of remote sensing imagery, has two limitations: first these techniques require large amounts of accurate training data to accurately estimate underlying statistical model parameters and secondly, the independent and identically distributed (i.i.d) assumptions made by these techniques do not hold true in the case of high-resolution satellite images. Recently, semi-supervised learning techniques that utilize large unlabeled training samples in conjunction with small labeled training data are becoming popular in machine learning, especially in text data mining. These techniques provide a viable solution to small training dataset problems; however, the techniques do not exploit spatial context. In this paper we explore methods that utilize unlabeled samples in supervised learning for classification of multi-spectral remote sensing imagery, while also taking into account the spatial context in the learning process. We extended the classical Expectation-Maximization (EM) technique to model spatial context via Markov Random Fields (MRF). We have conducted several experiments on real data sets and our classification procedure shows an improvement of 10% in overall classification accuracy. Further studies are necessary to assess the true potential and usefulness of this technique in varying geographic settings. Keywords: MAP, MLE, EM, Spatial Context, Auto-correlation, MRF, semi-supervised learning, mixture modelsItem A Unified Approach to Spatial Outlier Detection(2001-12-10) Shekhar, Shashi; Lu, Chang-tien; Zhang, PushengSpatial outliers represent locations which are significantly different from their neighborhoods even though they may not be significantly different from the entire population. Identification of spatialoutliers can lead to the discovery of unexpected, interesting, and implicit knowledge, such as local instability. In this paper, we first provide a general definition of $S$-outliers for spatial outliers. This definition subsumes the traditional definitions of spatial outliers. Second, we characterize the computation structure of spatial outlier detection methods andpresent scalable algorithms. Third, we provide a cost model of the proposed algorithms. Finally, we provide experimental evaluations of our algorithms using a Minneapolis-St. Paul(Twin Cities) traffic data set.Item An Introduction to Spatial Data Mining(2018-08-08) Golmohammadi, Jamal; Xie, Yiqun; Gupta, Jayant; Li, Yan; Cai, Jiannan; Detor, Samantha; Roh, Abigail; Shekhar, ShashiThe goal of spatial data mining is to discover potentially useful, interesting, and non-trivial patterns from spatial datasets. Spatial data mining is important for societal applications in public health, public safety, agriculture, environmental science, climate etc. For example,in epidemiology, spatial data mining helps to find areas with a high concentrations of disease incidents to manage disease outbreaks. Computerized methods are needed to discover spatial patterns since the volume and velocity of spatial data exceeds the number of human experts available to analyze it. In addition, spatial data has unique characteristics like spatial autocorrelation and spatial heterogeneity which violate the i.i.d (Independent and Identically Distributed data samples) assumption of traditional statistics and data mining methods. So, using traditional methods may miss patterns or may yield spurious patterns which are costly (e.g., stigmatization) in spatial applications. Also, there are other intrinsic challenges such as MAUP (Modifiable Areal Unit Problem) as illustrated by a current court case debating gerrymandering in elections. Spatial data mining considers the unique characteristics, and challenges of spatial data and domain knowledge of the target application to discover more accurate and interesting patterns.In this article, we discuss tools and computational methods of spatial data mining, focusing on the primary spatial pattern families: hotspot detection, colocation detection, spatial prediction and spatial outlier detection. Hotspot detection methods use domain information to model accurately more active and high density areas. Colocation detection methods find objects whose instances are in proximity of each other in a location. Spatial prediction approaches explicitly model neighborhood relationship of locations to predict target variables from input features. The goal of spatial outlier detection methods is to find data that are different from their neighbors.Item Archival of Traffic Data: Phase II(1998-12) Shekhar, Shashi; Tan, XinhongTraffic centers gather information from traffic sensors at regular intervals, but storing the data for future analysis becomes an issue. This report details work to improve the speed and effectiveness of traffic databases. In this project phase, researchers redesigned the data model based on the previous phase's data model and decreased the storage requirements by one-third. Researchers developed a web-based Graphical User Interface (GUI) for users to specify the query of interest; the outcome of the performance tuning gave users reasonable response time. The beneficiaries of this effective database would include the driving public, traffic engineers, and researchers, who are generally not familiar with the query language used in the database management system. This report summarizes the detailed reference, such as benchmark query, sample data, table schema, conversion code, and other information.Item Capacity Constrained Routing Algorithms for Evacuation Planning : A Summary of Results(2005-05-31) Lu, Qingsong; George, Betsy; Shekhar, ShashiEvacuation planning is critical for numerous important applications, e.g. disaster emergency management and homeland defense preparation. Efficient tools are needed to produce evacuation plans that identify routes and schedules to evacuate affected populations to safety in the event of natural disasters or terrorist attacks. The existing linear programming approach uses time-expanded networks to compute the optimal evacuation plan and requires a user-provided upper bound on evacuation time. It suffers from high computational cost and may not scale up to large transportation networks in urban scenarios. In this paper we present a heuristic algorithm, namely Capacity Constrained Route Planner(CCRP), which produces sub-optimal solution for the evacuation planning problem. CCRP models capaci ty as a time series and uses a capacity constrained routing approach to incorporate route capacity constraints. It addresses the limitations of linear programming approach by using only the original evacuation network and it does not require prior knowledge of evacuation time. Performance evaluation on various network configurations shows that the CCRP algorithm produces high quality solutions, and significantly reduces the computational cost compared to linear programming approach that produces optimal solutions. CCRP is also scalable to the number of evacuees and the size of the network. We also provide a discussion on the formulation of a new optimal algorithm that uses A* search to find the optimal solution for evacuation planning. We prove that the heuristic function used in this A* formulation is monotone and admissible.Item Capacity Constrained Routing Algorithms for Evacuation Route Planning(2006-05-04) Lu, Qingsong; George, Betsy; Shekhar, ShashiEvacuation route planning identifies paths in a given transportation network to minimize the time needed to move vulnerable populations to safe destinations. Evacuation route planning is critical for numerous important applications like disaster emergency management and homeland defense preparation. It is computationally challenging because the number of evacuees often far exceeds the capacity, (ie.) the number of people that can move along the road segments in a unit time. Linear Programming(LP) based methods using time expanded networks can take hours to days of computation for metropolitan sized problems. In this paper, we propose a new approach, namely a capacity constrained routing planner which models capacity as a time series and generalizes shortest path algorithms to incorporate capacity constraints. We characterize the design space for reducing the computational cost. Analytical cost model and experiment results show that the proposed algorithm is faster than the LP based algorithms and requires less memory. Experimental evaluation using various network configurations also shows that the proposed algorithm produces solutions that are comparable to those produced by LP based algorithms while significantly reducing the computational cost.Item Cascading spatio-temporal pattern discovery(2011-05-02) Mohan, Pradeep; Shekhar, Shashi; Shine, James A.; Rogers, James P.Given a collection of Boolean spatio-temporal (ST) event-types, the cascading spatio-temporal pattern (CSTP) discovery process finds partially ordered subsets of these event-types whose instances are located together and occur serially. For example, analysis of crime datasets may reveal frequent occurrence of misdemeanors and drunk driving after and near bar closings on weekends, as well as after and near large gatherings such as football games. Discovering CSTPs from ST datasets is important for application domains such as public safety (e.g. identifying crime attractors and generators) and natural disaster planning(e.g. preparing for hurricanes). However, CSTP discovery presents multiple challenges; three important ones are (1) the exponential cardinality of candidate patterns with respect to the number of event types, (2) computationally complex ST neighborhood enumeration required to evaluate the interest measure and (3) the difficulty of balancing computational complexity and statistical interpretation. Current approaches for ST data mining focus on mining totally ordered sequences or unordered subsets. In contrast, our recent work explores partially ordered patterns. Recently, we represented CSTPs as directed acyclic graphs; proposed a new interest measure, the cascade participation index; outlined the general structure of a cascading spatio-temporal pattern miner (CSTPM); evaluated filtering strategies to enhance computational savings using a real world crime dataset and proposed a nested loop based CSTPM to address the challenge posed by exponential cardinality of candidate patterns. This paper adds to our recent work by offering a new computational insight, namely, that the computational bottleneck for CSTP discovery lies in the interest measure evaluation. With this insight, we propose a new CSTPM based on spatio-temporal partitioning that significantly lowers the cost of interest measure evaluation. Analytical evaluation shows that our new CSTPM is correct and complete. Results from significant amount of new experimental evaluation with both synthetic and real data show that our new ST partitioning based CSTPM outperforms the CSTPM from our previous work. We also present a case study that verifies the applicability of CSTP discovery process.Item Cascading Spatio-temporal pattern discovery: A summary of results(2010-01-14) Mohan, Pradeep; Shekhar, Shashi; Shine, James A.; Rogers, James P.Given a collection of Boolean spatio-temporal(ST) event types, the cascading spatio-temporal pattern (CSTP) discovery process finds partially ordered subsets of event-types whose instances are located together and occur in stages. For example, analysis of crime datasets may reveal frequent occurrence of misdemeanors and drunk driving after bar closings on weekends and after large gatherings such as football games. Discovering CSTPs from ST datasets is important for application domains such as public safety (e.g. crime attractors and generators) and natural disaster planning(e.g. hurricanes). However, CSTP discovery is challenging for several reasons, including both the lack of computationally efficient, statistically meaningful metrics to quantify interestingness, and the large cardinality of candidate pattern sets that are exponential in the number of event types. Existing literature for ST data mining focuses on mining totally ordered sequences or unordered subsets. In contrast, this paper models CSTPs as partially ordered subsets of Boolean ST event types. We propose a new CSTP interest measure (the Cascade Participation Index) that is computationally cheap (O(n2) vs. exponential, where n is the dataset size) as well as statistically meaningful. We propose a novel algorithm exploiting the ST nature of datasets and evaluate filtering strategies to quickly prune uninteresting candidates. We present a case study to find CSTPs from real crime reports and provide a statistical explanation. Experimental results indicate that the proposed multiresolution spatio-temporal(MST) filtering strategy leads to significant savings in computational costs.Item Consistency Checking for Euclidean Spatial Constraints: A Dimension Graph Approach(2000-07-11) Liu, Xuan; Shekhar, Shashi; Chawla, SanjayIn this paper, we address the problem of consistency checking for Euclidean spatial constraints. A dimension graph representation is proposed to maintain the Euclidean spatial constraints among objects. The basic idea is to project the spatial constraints on both X and Y dimensions, and the dimension graph is constructed on each dimension. By using the dimension graph representation, the problem of consistency checking is then transformed to a graph cycle detection problem. The consistency checking can be achieved with O(N+E) time as well as space complexity, where N is the number of spatial object, and E is the number of spatial predicates in the constraint. The proposed approach to consistency checking for spatial constraints is faster than O(N2) when the number of predicates is much smaller than N2 and there are few disjunctions in the spatial constraint. The dimension graph and consistency checking algorithm can be used for points, intervals and polygons in 2 dimensional space. The algorithm can guarantee the global consistency.Item Context Inclusive Function Evaluation: A Case Study with EM-Based Multi-Scale Multi-Granular Image Classification(2008-07-30) Gandhi, Vijay; Kang, James; Shekhar, Shashi; Ju, Junchang; Kolaczyk, Eric D.; Gopal, SucharitaMany statistical queries such as maximum likelihood estimation involve finding the best candidate model given a set of candidate models and a quality estimation function. This problem is common in important applications like land-use classification at multiple spatial resolutions from remote sensing raster data. Such a problem is computationally challenging due to the significant computation cost to evaluate the quality estimation function for each candidate model. For example, a recently proposed method of multi-iscale, multi-granular classification has high computational overhead of function evaluation for various candidate models independently before comparison. In contrast, we propose an upper bound based context-inclusive approach that reduces computational overhead based on the context, i.e. the value of the quality estimation function for the best candidate model so far. We also prove that an upper bound exists for each candidate model and the proposed algorithm is correct. Experimental results using land-use classification at multiple spatial resolutions from satellite imagery show that the proposed approach reduces the computational cost significantly.Item Contraflow Transportation Network Reconfiguration for Evacuation Route Planning(2006-06-01) Shekhar, Shashi; Kim, SanghoGiven a transportation network having source nodes with evacuees and destination nodes, we want to find a contraflow network configuration, i.e., ideal direction for each edge, to minimize evacuation time. Contraflow is considered a potential remedy to reduce congestion during evacuations in the context of homeland security and natural disasters (e.g., hurricanes). This problem is computationally challenging because of the very large search space and the expensive calculation of evacuation time on a given network. To our knowledge, this paper presents the first macroscopic approaches for the solution of contraflow network reconfiguration incorporating road capacity constraints, multiple sources, congestion factor, and scalability. We formally define the contraflow problem based on graph theory and provide a framework of computational structure to classify our approaches. A Greedy heuristic is designed to produce high quality solutions with significant performance. A Bottleneck Relief heuristic is developed to deal with large numbers of evacuees. We evaluate the proposed approaches both analytically and experimentally using real world datasets. Experimental results show that our contraflow approaches can reduce evacuation time by 40% or more.Item Correlation Analysis of Spatial Time Series Datasets: A Filter-and-Refine Approach(2002-12-03) Zhang, Pusheng; Huang, Yan; Shekhar, Shashi; Kumar, VipinA spatial time series dataset is a collection of time series, each referencing a location in a common spatial framework. Correlation analysis is often used to identify pairs of interacting elements from the cross product of two spatial time series datasets. However, the computational cost of correlation analysis is very high when the dimension of the time series and the number of locationsin the spatial frameworks are large. The key contribution of this paper is the use of spatial autocorrelation among spatial neighboring time series to reduce the computational cost. A filter-and-refine algorithm based on coning, i.e.group of locations, is proposed to reduce the cost of correlation analysis over a pair of spatial time series datasets. Cone-level correlation computation can be used to eliminate (filter out) a large number of element pairs whosecorrelation is clearly below (or above) a given threshold. Element pair correlation needs to be computed for remaining pairs. Using algebraic cost models and experimental studies with Earth science datasets, we show that the filter-and-refine approach can save a large fraction of the computational cost, particularly when the minimal correlation threshold is high.Item Crime Hotspot Detection: A Computational Perspective(2016-09-01) Eftelioglu, Emre; Tang, Xun; Shekhar, ShashiGiven a set of crime locations, a statistically significant crime hotspot is an area where the concentration of crimes inside is significantly higher than outside. The motivation of crime hotspot detection is twofold: detecting crime hotspots to focus the deployment of police enforcement and predicting the potential residence of a serial criminal. Crime hotspot detection is computationally challenging due to the difficulty of enumerating all potential hotspot areas, selecting an interest measure to compare these with the overall crime intensity, and testing for statistical significance to reduce chance patterns. This chapter focuses on statistical significant crime hotspots. First, the foundations of spatial scan statistics and its applications (i.e. SaTScan) to circular hotspot detection are reviewed. Next, ring-shaped hotspot detection is introduced. Third, linear hotspot detection is described since most crimes occur along a road network. The chapter concludes with future research directions in crime hotspot detection.Item Crime pattern analysis: A spatial frequent pattern mining approach(2012-05-10) Shekhar, Shashi; Mohan, Pradeep; Oliver, Dev; Zhou, XunCrime pattern analysis (CPA) is the process of analytical reasoning facilitated by an understanding about the nature of an underlying spatial framework that generates crime. For example, law enforcement agencies may seek to identify regions of sudden increase in crime activity, namely, crime outbreaks. Many analytical tools facilitate this reasoning process by providing support for techniques such as hotspot analysis. However, in practice, police departments are desirous of scalable tools for existing techniques and new insights including, interaction between different crime types. Identifying new insights using scalable tools may help reduce the human effort that may be required in CPA. Formally, given a spatial crime dataset and other information familiar to law enforcement agencies, the CPA process identifies interesting, potentially useful and previously unknown crime patterns. For example, analysis of an urban crime dataset might reveal that downtown bars frequently lead to assaults just after bar closing. However, CPA is challenging due to: (a) the large size of crime datasets, and (b) a potentially large collection of interesting crime patterns. This chapter explores, spatial frequent pattern mining (SFPM), which is a spatial data driven approach for CPA and describes SFPM in the context of one type of CPA, outbreak detection. We present a case study to discover interesting, useful and non-trivial crime outbreaks in a dataset from Lincoln, NE. A review of emerging trends and new research needs in CPA methods for study to discover interesting, useful and non-trivial crime outbreaks in a dataset from outbreak detection is also presented.Item Data Mining and Visualization of Twin-Cities Traffic Data(2001-03-08) Shekhar, Shashi; Lu, Chang-tien; Chawla, Sanjay; Zhang, PushengData Mining(DM) is the process of extracting implicit, valuable, and interesting information from large sets of data. As huge amounts of data have been stored in traffic and transportation databases, data warehouses, geographic information systems, and other information repositories, data mining is receiving substantial interest from both academia and industry. The Twin-Cities traffic archival stores sensor network measurements collected from the freeway system in the Twin-Cities metropolitan area. In this paper, we construct a traffic data warehousing model which facilitates on-line analytical processing(OLAP), employ current data mining techniques to analyze the Twin-Cities traffic data set, and visualize the discoveries on the highway map. We also discuss some research issues in mining traffic and transportation data.Item Detecting Graph-based Spatial Outliers: Algorithms and Applications(2001-03-08) Shekhar, Shashi; Lu, Chang-tien; Zhang, PushengIdentification of outliers can lead to the discovery of unexpected, interesting, and implicit knowledge. Existing methods are designed for detecting spatial outliers in multidimensional geometric data sets, where a distance metric is available. In this paper, we focus on detecting spatial outliers in graph structured data sets. We define tests for spatial outliers in graph structured data sets, analyze the statistical foundation underlying our approach, design a fast algorithm to detect spatial outliers, provide the cost model for outlier detection procedures. In addition, we provide experimental results from the application of our algorithm on a Minneapolis-St. Paul(Twin-Cities) traffic dataset to show its effectiveness and usefulness.