Browsing by Author "Srivastava, Jaideep"

Now showing 1 - 20 of 48

A Comparative Study on Web Prefetching
(2001-05-31) Bhushan Pandey, Ajay; Vatsavai, Ranga R.; Ma, Xiaobin; Srivastava, Jaideep; Shekhar, Shashi
The growth of the World Wide Web has emphasized the need for improved user latency. Increasing use of dynamic pages, frequent changes in the site structure, and user access patterns on the internet have limited the efficacy of caching techniques and emphasized the need for prefetching. Since prefecthing increses bandwidth, it is important that the prediction model is highly accurate and computationally feasible. It has been observed that in a web environment, certain sets of pages exhibit stronger correlations than others, a fact which can be used to predict future requests. Previous studies on predictive models are mainly based on pair interactions of pages and TOP-N approaches. In this paper we study a model based on page interactions of higher order where we exploit set relationships among the pages of a web site. We also compare the performance of this approach with the models based on pairwise interaction and the TOP-N approach. We have conducted a comparative study of these models on a real server log and five synthetic logs with varying page frequency distributions to simulate different real life web sites and identified dominance zones for each of these models. We find that the model based on higher order page interaction is more robust and gives competitive performance in a variety of situations.
A Hazard Based Approach to User Return Time Prediction
(2013-11-18) Kapoor, Komal; Sun, Mingxuan; Srivastava, Jaideep; Ye, Tao
In the competitive environment of the internet, retaining and growing one's user base is of major concern to most web services. Furthermore, the economic model of many web services is allowing free access to most content, and generating revenue through advertising. This unique model requires securing user time on a site rather than the purchase of good. Hence, it is crucially important to create new kinds of metrics and solutions for growth and retention efforts for web services. In this work, we first propose a new retention metric for web services concentrating on the rate of user return. Secondly, we apply predictive analysis to the proposed retention metric on a service. Finally, we set up a simple yet effective framework to evaluate a multitude of factors that contribute to user return. Specifically, we define the problem of return time prediction for free web services. Our solution is based on the Cox's proportional hazard model from survival analysis. The hazard based approach offers several benefits including the ability to work with censored data, to model the dynamics in user return rates, and to easily incorporate different types of covariates in the model. We compare the performance of our hazard based model in predicting the user return time and in categorizing users into buckets based on their predicted return time, against several baseline regression and classification methods and find the hazard based approach to far surpass our baselines.
A Multi-Step Framework for Detecting Attack Scenarios
(2006-02-21) Shaneck, Mark; Chandola, Varun; Liu, Haiyang; Choi, Changho; Simon, Gyorgy; Eilertson, Eric; Kim, Yongdae; Zhang, Zhi-Li; Srivastava, Jaideep; Kumar, Vipin
With growing dependence upon interconnected networks, defending these networks against intrusions is becoming increasingly important. In the case of attacks that are composed of multiple steps, detecting the entire attack scenario is of vital importance. In this paper, we propose an analysis framework that is able to detect these scenarios with little predefined information. The core of the system is the decomposition of the analysis into two steps: first detecting a few events in the attack with high confidence, and second, expanding from these events to determine the remainder of the events in the scenario. Our experiments show that we can accurately identify the majority of the steps contained within the attack scenario with relatively few false positives. Our framework can handle sophisticated attacks that are highly distributed, try to avoid standard pre-defined attack patterns, use cover traffic or "noisy" attacks to distract analysts and draw attention away from the true attack, and attempt to avoid detection by signature-based schemes through the use of novel exploits or mutation engines.
Churn Prediction in MMORPGs: A Social Influence Based Approach
(2009-05-20) Kawale, Jaya; Pal, Aditya; Srivastava, Jaideep
Massively Multiplayer Online Role Playing Games (MMORPGs) are computer based games in which players interact with one another in the virtual world. Worldwide revenues for MMORPGs have seen amazing growth in last few years and it is more than a 2 billion dollars industry as per current estimates. Huge amount of revenue potential has attracted several gaming companies to launch online role playing games. One of the major problems these companies suffer apart from ?erce competition is erosion of their customer base. Churn is a big problem for the gaming companies as churners impact negatively in the "word-of-mouth" reports for potential and existing customers leading to further erosion of user base. We study the problem of player churn in the popular MMORPG EverQuest II. The problem of churn prediction has been studied extensively in the past in various domains and social network analysis has recently been applied to the problem to understand the effects of the strength of social ties and the structure and dynamics of a social network in churn. In this paper, we propose a churn prediction model based on examining social in?uence among players and their personal engagement in the game. We hypothesize that social in?uence is a vector quantity, with components negative in?uence and positive in?uence. We propose a modified diffusion model to propagate the in?uence vector in the player's network which represents the social in?uence on the player from his network. We measure a player's personal engagement based on his activity patterns and use it in the modified diffusion model and churn prediction. Our method for churn prediction which combines social in?uence and player engagement factors has shown to improve prediction accuracy significantly for our dataset as compared to prediction using the conventional diffusion model or the player engagement factor, thus validating our hypothesis that combination of both these factors could lead to a more accurate churn prediction.
Clinical Variable Relationship Evaluation using Decision Tree Rule Extraction
(2014-11-10) Sarkar, Chandrima; Srivastava, Jaideep
Evaluating associations and relationships between variable is a very challenging and important problem in the domain of medicine and clinical data analysis. Not many classification methods have been tried in the literature to tackle this problem. Decision Tree is one of the most important data mining methods that can be used, due to their property of implicitly performing variable screening or feature selection and their requirement of relatively little effort from users for data preparation. However, a major drawback associated with the use of decision tree for decision making is their lack of interpret-able capability specially when using tools like Weka. Though decision trees can achieve a high predictive accuracy rate, the reasoning behind how they reach their decisions is not readily available. But this problem can be handled very easily if the decision tree can be utilized by extracting their rules and analyzing these rules. In this paper we present an approach for extracting rules from the decision tree which can be utilized for determining relationship between clinical variables. Furthermore, we also discuss how these rules can be visualized in a compact and intuitive tabular format that facilitates easy analysis. It is concluded that decision tree rule extraction can be considered as powerful analysis tools that allow us to facilitate analysis of clinical variables and its association.
Concept-Aware Ranking: Teaching an Old Graph New Moves
(2006-03-20) DeLong, Colin; Mane, Sandeep; Srivastava, Jaideep
Ranking algorithms for web graphs, such as PageRank and HITS, are typically utilized for graphs where a node represents a unique URL (Webpage) and an edge is an explicitly-defined link between two such nodes. In addition to the link structure itself, additional information about the relationship between nodes may be available. For example, anchor text in a Web graph is likely to provide information about the underlying concepts connecting URLs. In this paper, we propose an extension to the Web graph model to take into account conceptual information encoded by links in order to construct a new graph which is sensitive to the conceptual links between nodes. By extracting keywords and recurring phrases from the anchor tag data, a set of concepts is defined. A new definition of a node (one which encodes both an URL and concept) is then leveraged to create an entirely new Web graph, with edges representing both explicit and implicit conceptual links between nodes. In doing so, inter-concept relationships can be modeled and utilized when using graph ranking algorithms. This improves result accuracy by not only retrieving links which are more authoritative given a users' context, but also by utilizing a larger pool of web pages that are limited by concept-space, rather than keyword-space. This is illustrated using webpages from the University of Minnesota's College of Liberal Arts websites.
Coping with Excessive Memory Requirement Under I/O Bandwidth Congestion
(1997) Won, Youjip; Srivastava, Jaideep
In this paper, we investigate the buffer requirement in retrieving the continuous media streams from the disk subsystem. Memory buffer is used to synchronize asynchronous disk read operation and synchronous playback operation. In supporting a set of continuous media playbacks, as aggregate bandwidth required increases, larger amount of buffer needs to ibe allocated. This characteristics originates from the increase in cycle length. It is well known fact that as disk utilization approaches I 00%, total buffer to support the playbacks increases extremely fast. Conservative estimation on disk usage is thus advised in designing disk subsystem for continuous media server. However, in the practical situation aggregate bandwidth may increase over the expected value, and may consume excessive amount of buffer memory. The focus of this is to an algorithm coping with this excessive buffer requirement under bandwidth congestion. We argue that in a large scale continuous media server, where user access pattern is biased and there are frequent request arrivals, it is not necessary to maintain the playback directly from the disk for each request in supporting a set of timely interleaved playbacks. We define two mechanisms to service a playback request, namely disk mode and memory mode. In memory mode, request is supplied data blocks which was loaded by preceding request. We develop an efficient algorithm to determine the optimal service mode for a set of playback request minimizing overall buffer requirement.
Correlation based Feature Selection using Rank aggregation for an Improved Prediction of Potentially Preventable Events
(2013-06-12) Sarkar, Chandrima; Desikan, Prasanna; Srivastava, Jaideep
This paper presents a methodology for developing a novel feature selection model that will help in a more accurate and robust prediction of patients with the risk of Potentially Preventable Events (PPEs). PPEs are admissions, readmissions, complications and emergency department visits that could have been avoided if the patient had been given the appropriate interventions. Various clinical factors and patient health conditions can affect a patient's chance of developing the risk of PPE. We propose a robust Correlation based feature selection method using Rank Aggregation (CRA) which helps to identify the key contributing factors for the prediction of PPE. Unlike existing feature selection techniques that causes bias by using distinct statistical properties of data for feature evaluation, CRA uses rank aggregation thus reducing this bias. The result indicates that the proposed technique is more robust across a wide range of classifiers and has higher accuracy than other traditional methods.
Coverage based Proxy Placement for Content Distribution over the Internet
(2002-11-05) Varadarajan, Srivatsan; Harinath, Raja; Srivastava, Jaideep; Zhang, Zhi-Li
In an effort to differentiate service quality, service providers have resorted to employing Content Distribution Networks (CDNs) over the Internet. CDNs deploy geographically distributed proxy servers which manage content on behalf of the service provider's servers forbetter performance and enhanced availability. In this paper we explore the proxy placement problem for content distribution over the Internet. Its goal is to strategically place a number of proxies in the network to optimize certain criteria which improve performance of proxies. We motivate and illustrate the various necessary factors and constraints that need to be taken into account for a good placement of proxies over the Internet whichreflect real world scenario more accurately and which we claim hitherto has not been completely addressed. We introduce a novel concept of host coverage characterizing every Autonomous Systems (AS) and use this stable, coarse grained measure as a long-term estimate of the load being serviced by the proxy system. We validate its applicability through an Internet study. We then pose anoptimal formulation of the proxy placement problem taking into consideration all the relevant factors. We propose a couple of proxy placement algorithms that solve the above problem and analyze their behavior. Finally we present the performance of those algorithms against the optimal solution and other schemes proposed in literature. We also study the stability of the proposed algorithms through avariety of experiments. Keyword: Proxy Placement, Coverage, Internet, ContentDistribution Network (CDN)
Data Mining Based Predictive Models for Overall Health Indices
(2010-04-14) Rajkumar, Ridhima; Shim, Kyong Jin; Srivastava, Jaideep
In this study, we infer health care indices of individuals using their pharmacy medical and prescription claims. Specifically, we focus on the widely used Charlson Index. We use data mining techniques to formulate the problem of classifying Charlson Index (CI) and build predictive models to predict individual health index score. First, we present comparative analyses of several classification algorithms. Second, our study shows that certain ensemble algorithms lead to higher prediction accuracy in comparison to base algorithms. Third, we introduce cost-sensitive learning to the classification algorithms and show that the inclusion of cost-sensitive learning leads to improved prediction accuracy. The built predictive models can be used to allocate health care resources to individuals. It is expected to help reduce the cost of health care resource allocation and provisioning and thereby allow countries and communities lacking the ability to afford high health care cost to provide health indices (coverage), provide individuals with health index which takes into consideration their overall health and thereby improve quality of individual health assessment (quality), and improve reliability of decision making by focusing on a set of objective criteria for all individuals (reliability).
Discovery of Interesting Usage Patterns from Web Data
(1999-05-23) Cooley, Robert; Tan, Pang-ning; Srivastava, Jaideep
Web Usage Mining is the approach of applying data mining techniques to large Web data repositories in order to extract usage patterns. As with many data mining application domains, the identification of patterns that are considered interesting is a problem that must be solved in addition to simply generating them. A necessary step in identifying interesting results is quantifying what is considered uninteresting in order to form a basis for comparison. Several research efforts have relied on manually generated sets of uninteresting rules. However, manual generation of a comprehensive set of evidence about beliefs for a particular domain is impractical in many cases. Generally, domain knowledge can be used to automatically create evidence for or against a set of beliefs. This paper develops a quantitative model based on support logic for determining the interestingness of discovered patterns. For Web Usage Mining, there are three types of domain information available; usage, content, and structure. This paper also describes algorithms for using these three types of information to automatically identify interesting knowledge. These algorithms have been incorporated into the Web Site Information Filter (WebSIFT) system and examples of interesting frequent itemsets automatically discovered from real Web data are presented.
dSENSE: Data-driven Stochastic Energy Management for Wireless Sensor Platforms
(2005-05-09) Liu, Haiyang; Chandra, Abhishek; Srivastava, Jaideep
Wireless sensor networks are being widely deployed for providing physical measurements to diverse applications. Energy is a precious resource in such networks as nodes in wireless sensor platforms are typically powered by batteries with limited power and high replacement cost. This paper presents dSENSE: a data-driven approach for energy management in sensor platforms. dSENSE is a node-level power management approach that utilizes knowledge of the underlying data streams as well as application data quality requirements to conserve energy on a sensor node. dSENSE employs sense-on-change---a sampling strategy that aggressively conserves power by reducing sensing activity on the sensor node. Unlike most existing energy management techniques, this strategy enables explicit control of the sensor along with the CPU and the radio. Our approach uses an efficient statistical data stream model to predict future sensor readings. These predictions are coupled with a stochastic scheduling algorithm to dynamically control the operating modes of the sensor node components. Using experimental results obtained on PowerTOSSIM with a real-world data trace, we demonstrate that our approach reduces energy consumption by 29-36% while providing strong statistical guarantees on data quality.
Error Spreading: A Perception-Driven Approach Orthogonal to Error Handling in Continuous Media Streaming
(1999-07-21) Varadarajan, Srivatsan; Ngo, Hung Q.; Srivastava, Jaideep
With the growing popularity of the Internet, there is increasing interest in using it for audio and video transmission. Periodic network overloads, leading to bursty packet losses, have always been a key problem for network researchers. In a long-haul, heterogeneous network like the Internet, handling such an error becomes especially difficult. Perceptual studies of audio and video viewing have shown that bursty losses have the most annoying effect on people, and hence are critical issues to be addressed for applications such as Internet phone, video conferencing, distance learning, etc. Classical error handling techniques have focused on applications like FTP, and are geared towards ensuring that the transmission is correct, with no attention to timeliness. For isochronous traffic like audio and video, timeliness is a key criterion, and given the high degree of content redundancy, some loss of content is quite acceptable. In this paper we introduce the concept of error spreading, which is a transformation technique that takes the input sequence of packets (from an audio or video stream) and scrambles its packets before transmission. The packets are unscrambled at the receiving end. The transformation is designed to ensure that bursty losses in the transformed domain get spread all over the sequence in the original domain. Our error spreading idea deals with either cases where the stream has or does not have inter-frame dependencies. Perceptual studies have shown that users are much more tolerant of a uniformly distributed loss of low magnitude. We next describe a continuous media transmission protocol based on this idea. We also show that our protocol can be used complementary to other error handling protocols. Lastly, we validate its performance through a series of experiments and simulations. Keywords: Multimedia, network bursty error, permutation scheme.
Exploration of Classification Techniques as a Treatment Decision Support Tool for Patients with Uterine Fibroids
(2010-04-16) Campbell, Kevin; Thygeson, Marcus N.; Srivastava, Jaideep; Speedie, Stuart
Uterine fibroids are benign growths in the uterus, for which there are several possible treatment options. Patients and physicians generally approach the decision process based on a combination of the patient's degree of discomfort, patient preferences, and physician practice patterns. In this paper, we examine the use of classification algorithms in combination with meta-learning algorithms as a decision support tool to facilitate more systematic fibroid treatment decisions. A model constructed from both Naive Bayes (with Adaboost) and J48 (with bagging) algorithms gave the best results and could be a useful tool to patients making this decision.
From Clicks to Bricks: CRM Lessons from E-commerce
(2005-10-12) Mane, Sandeep; Desikan, Prasanna; Srivastava, Jaideep
E-commerce allows a level of closeness in customer-to-store interaction that is far greater than imaginable in the physical world, leading to unprecedented data collection, especially about the 'process of shopping'. The desire to understand individual customer's behavior and psychology at a deeper level by mining this data has led to significant advances in on-line customer relationship management (e-CRM). Services like real-time recommendations, faster checkouts, and price/feature comparisons of products across different e-stores or brands, have increased the general awareness of customers and made them more demanding. Web mining is the software technology that has made this possible by providing the means to automatically build sophisticated customer models from Web data collected at on-line stores. e-CRM has shown significant concrete benefits in customer experience and loyalty, leading to improved sales and profits. Physical stores have taken a note of these benefits of e-CRM, and are interested in exploring similar possibilities. A key barrier to applying e-CRM techniques to the physical world (p-CRM) has been the lack of ability to collect detailed customer data in the p-CRM world, at the same granularity and in real-time manner as in the e-CRM world. With new technologies like radio frequency identification (RFID) and handheld devices like personal digital assistants (PDA) becoming affordable, these technologies are now being used in major stores for inventory management and/or anti-theft purposes. Based on the confluence of these factors, we posit that "given such detailed knowledge of an individual customer's habits provides insight into his/her preferences and psychology, which can be used to develop a much higher level of trust in the customer-vendor relationship, the time is ripe for revisiting p-CRM to see what lessons learned from e-CRM are applicable." In this paper, we present a concrete proposal on how this can be done, and identify directions for future research.
Grouping Web Page References Into Transactions for Mining World Wide Web Browsing Patterns
(1997) Cooley, Robert; Mobasher, Bamshad; Srivastava, Jaideep
Human Perception of Media and Synchronization Losses
(1997) Wijesekera, Duminda; Srivastava, Jaideep; Foresti, Mark
Perception of multimedia quality, specified by quality of service (QoS) metrics can be used by system designers to optimize customer satisfaction within resource bounds enforced by general purpose computing platforms. Media losses, rate variations and transient synchronization losses have been speculated to affect human perception of multimedia quality. This paper presents metrics to measure such defects, and results of a series of user experiments that justify such speculations. Results of the study provide bounds on losses, rate variations and transient synchronization losses as a function of user satisfaction, in the form of Likert values. It is shown how these results can be used by algorithm designers of underlying multimedia systems.
I/O efficient computation of First Order Markov Measures for Large and Evolving Graphs
(2008-07-21) Desikan, Prasanna; Srivastava, Jaideep
First order Markov measures, such as PageRank, have gained significance as relevance measure in domains where data is represented as a graph. The large scale of such graphs in real world, such as the World Wide Web has given rise to computing challenges of such measures. In this paper, we address the challenges of computing such First-order Markov measure, considering PageRank as the example of such a measure. We address two challenging computational scenarios for PageRank: (a) computation for a large single graph at a given time instance and (b) incremental computation for large evolving graphs. We achieve efficiency by reducing the problem size and reducing the number of iterations to compute. For (a) we bin the nodes in different partitions and for the subgraph formed by each of these partitions we use the nature of the firstorder Markov model to break down the problem of computation. For (b) we propose a method to accommodate the changed edges and nodes into new partitions and existing partitions and identify the subset of partitions for which recomputation is necessary. For each identified partition we use an incremental approach to compute the measure in expedited manner. Our results show significant reduction in time for computing for our approaches to both these problems.
I/O-Scalable Bregman Co-clustering and Its Application to the Analysis of Social Media
(2011-10-10) Hsu, Kuo-Wei; Srivastava, Jaideep
Adoption of social media has experienced explosive growth in recent years, and this trend appears likely to continue. A natural consequence has been the creation of vast quantities of data being generated by social media applications, and hence increased interest from the database community. This data is also providing unique opportunities to understand the sociological and psychological aspects, human interaction, and media production/consumption, and hence the growth in areas such as user modeling, behavior analysis, and social network analysis, which together is being labeled as the emerging area of Computational Social Science (CSS) [37, 59]. These new types of data analysis are leading to the introduction of new computational techniques, e.g. p* modeling, ERGMs [62], co-clustering [6], etc. This paper focuses on a scalable implementation of Bregman co-clustering algorithm and its application to social media analysis. Bregman co-clustering algorithm performs two-way clustering and is theoretically scalable while we discuss an OLAP based implementation to achieve this goal. Principally, we demonstrate how aggregations required by the algorithm can be mapped naturally to summary statistics computed by an OLAP engine and stored in data cubes. Our OLAP based implementation of the algorithm is able to handle large-scale datasets, i.e. datasets that are too large for main memory based implementations. Further, we explore the suitability of the relational model for modeling social media data. Specifically, we argue that data cubes and the star schema are well suited for managing social media data. Our research is a step toward the increasing interest the research community has in connecting three research areas, namely database, data mining, and social media analysis.
Identifying Clusters in Marked Spatial Point Processes: A Summary of Results
(2006-03-20) Mane, Sandeep; Kang, James; Shekhar, Shashi; Srivastava, Jaideep; Murray, Carson; Pusey, Anne
Clustering of marked spatial point process is an important problem in many application domains (e.g. Behavioral Ecology). Classical clustering approaches handle homogeneous spatial points and hence cannot cluster marked spatial point process. In this paper, we propose a novel intuitive approach, Merge Algorithm, to hierarchically cluster marked spatial point process. This approach treats all spatial point processes in a dendrogram's sub-tree as a single spatial point process while clustering. The resulting dendrogram for marked spatial point process needs be analyzed by a domain expert to identify clusters. To remove the subjective nature of the clusters identified, we propose a novel statistical method, Cluster Identification Algorithm, to partition a dendrogram into clusters. This approach identifies (cuts) a dendrogram's sub-tree as a cluster if that subtree's intra-subtree similarity is significantly higher than inter-subtree similarity. Experiments with Jane Goodall Institute's chimpanzee ecological dataset from the Gombe National Park, Tanzania which shows that our proposed methods identified clusters which were compatible to those identified by domain experts.

University Digital Conservancy

Browse by Author

Browsing by Author "Srivastava, Jaideep"