Computer Science & Engineering (CS&E) Technical Reports
Persistent link for this collectionhttps://hdl.handle.net/11299/214969
Located in this collection are publications by the department, faculty, students, and visiting members.
Want to submit a Technical Report? See the Technical Report Submission Instructions.
Searching for a technical report
The search bar below supports general keyword search, but can also be used to locate a report by its title or report number. Use quotation marks around a phrase or title to return reports that use that exact phrase. To search for a specific report number, use the following search string: "Technical Report; [Report Number]" (e.g. "Technical Report; 19-001"). You must include the quotations when searching for a report number.
Browse
Browsing Computer Science & Engineering (CS&E) Technical Reports by Title
Now showing 1 - 20 of 749
- Results Per Page
- Sort Options
Item 3D Reconstruction of Periodic Motion from a Single View(2009-09-24) Ribnick, EvanPeriodicity has been recognized as an important cue for tasks like activity recognition and gait analysis. However, most existing techniques analyze periodic motions only in image coordinates, making them very dependent on the viewing angle. In this paper we show that it is possible to reconstruct a periodic trajectory in 3{D} given only its appearance in image coordinates from a single camera view. We draw a strong analogy between this problem and that of reconstructing an object from multiple views, which allows us to rely on well-known theoretical results from the multi-view geometry domain and obtain significant guarantees regarding the solvability of the estimation problem. We present two different formulations of the problem, along with techniques for performing the reconstruction in both cases, and an algorithm for estimating the period of motion from its image-coordinate trajectory. Experimental results demonstrate the feasibility of the proposed techniques.Item A Balanced Term-Weighting Scheme for Effective Document Matching(2001-02-07) Jung, Yunjae; Park, Haesun; Du, Ding-ZhuA new weighting scheme for vector space model is presented to improve retrieval performance for an information retrieval system. In addition, a dimension compression method is introduced to reduce the computational cost of the weighting approach. The main idea of this approach is to consider not only occurrence terms but also absent terms in finding similarity patterns among document and query vectors. With a basic information retrieval development system which we are now developing, we evaluate the effect of the balanced weighting scheme and compare it with various combinations of weighting schemes in terms of retrieval performance. The experimental results show that the proposed scheme produces similar recall-precision results to the cosine measure, but more importantly enhances retrieval effectiveness. Since the scheme is based on the cosine measure, it is certain that it has insensitivity to weight variance. The results have convincingly illustrated that the new approach is effective and applicable.Item A Cautionary Note on Decadal Sea Level Pressure Predictions from GCMs(2018-02-06) Liess, Stefan; Snyder, Peter K.; Kumar, Arjun; Kumar, VipinDecadal prediction of sea level pressure (SLP) plays an important role in regional climate prediction, because it shows changes in atmospheric behavior on time scales that are relevant for policy makers. These changes consist of a combination of externally forced and internally driven climate system characteristics. A comparison of SLP trends in a subset of seven Coupled Model Intercomparison Project (CMIP) phase 5 general circulation models (GCM) during the satellite-era to their CMIP3 counterparts reveals an unrealistically strong forecast skill in CMIP3 models for trend predictions for 2001-2011 when using the 1979-2000 period to train the forecast. Boreal-winter SLP trends over five high-, mid-, and low-latitude zones were calculated over a two-decade initialization period for each ensemble member and then ranked based on their performance relative to observations in all five zones over the same time period. The same method is used to rank the ensemble members during the following decade. In CMIP3, 17 out of 38 ensemble members retain their rank in the 2001-2011 hindcast period and 3 retain the neighboring rank. However, these numbers are much lower in more recent CMIP5 decadal predictions over a similar period with the same number of ensembles. The conclusion to consider the forecast skill in CMIP3 predictions during the 2001-2011 as unrealistic is corroborated by comparisons to earlier periods from the 1960s to the 1980s in both CMIP3 and CMIP5 simulations. Thus, although the 2001-2011 CMIP3 predictions show statistically significant forecast skill, this skill should be treated as a spurious result that is unlikely to be reproduced by newer more accurate GCMs.Item A Code Transformation Framework for Scientific Applications on Structured Grids(2011-09-29) Lin, Pei-Hung; Jayaraj, Jagan; Woodward, Paul; Yew, Pen-ChungThe combination of expert-tuned code expression and aggressive compiler optimizations is known to deliver the best achievable performance for modern multicore processors. The development and maintenance of these optimized code expressions is never trivial. Tedious and error-prone processes greatly decrease the code developer's willingness to adopt manually-tuned optimizations. In this paper, we describe a pre-compilation framework that will take a code expression with a much higher programmability and transform it into an optimized expression that would be much more difficult to produce manually. The user-directed, source-to-source transformations we implement rely heavily on the knowledge of the domain expert. The transformed output, in its optimized format, together with the optimizations provided by an available compiler is intended to deliver exceptionally high performance. Three computational fluid dynamics (CFD) applications are chosen to exemplify our strategy. The performance results show an average of 7.3x speedup over straightforward compilation of the input code expression to our pre-compilation framework. Performance of 7.1 Gflops/s/core on the Intel Nehalem processor, 30% of the peak performance, is achieved using our strategy. The framework is seen to be successful for the CFD domain, and we expect that this approach can be extended to cover more scientific computation domains.Item A Comparative Evaluation of Anomaly Detection Techniques for Sequence Data(2008-07-07) Chandola, Varun; Mithal, Varun; Kumar, VipinAnomaly detection has traditionally dealt with record or transaction type data sets. But in many real domains, data naturally occurs as sequences, and therefore the desire of studying anomaly detection techniques in sequential data sets. The problem of detecting anomalies in sequence data sets is related to but different from the traditional anomaly detection problem, because the nature of data and anomalies are different than those found in record data sets. While there are many surveys and comparative evaluations for traditional anomaly detection, similar studies are not done for sequence anomaly detection. We investigate a broad spectrum of anomaly detection techniques for symbolic sequences, proposed in diverse application domains. Our hypothesis is that symbolic sequences from different domains have distinct characteristics in terms of the nature of sequences as well as the nature of anomalies which makes it important to investigate how different techniques behave for different types of sequence data. Such a study is critical to understand the relative strengths and weaknesses of different techniques. Our paper is one such attempt where we have comparatively evaluated 7 anomaly detection techniques on 10 public data sets, collected from three diverse application domains. To gain further understanding in the performance of the techniques, we present a novel way to generate sequence data with desired characteristics. The results on the artificially generated data sets help us in experimentally verifying our hypothesis regarding different techniques.Item A Comparative Study on Web Prefetching(2001-05-31) Bhushan Pandey, Ajay; Vatsavai, Ranga R.; Ma, Xiaobin; Srivastava, Jaideep; Shekhar, ShashiThe growth of the World Wide Web has emphasized the need for improved user latency. Increasing use of dynamic pages, frequent changes in the site structure, and user access patterns on the internet have limited the efficacy of caching techniques and emphasized the need for prefetching. Since prefecthing increses bandwidth, it is important that the prediction model is highly accurate and computationally feasible. It has been observed that in a web environment, certain sets of pages exhibit stronger correlations than others, a fact which can be used to predict future requests. Previous studies on predictive models are mainly based on pair interactions of pages and TOP-N approaches. In this paper we study a model based on page interactions of higher order where we exploit set relationships among the pages of a web site. We also compare the performance of this approach with the models based on pairwise interaction and the TOP-N approach. We have conducted a comparative study of these models on a real server log and five synthetic logs with varying page frequency distributions to simulate different real life web sites and identified dominance zones for each of these models. We find that the model based on higher order page interaction is more robust and gives competitive performance in a variety of situations.Item A Comparison of Document Clustering Techniques(2000-05-23) Steinbach, Michael; Karypis, George; Kumar, VipinThis paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document data.Item A Comparison of Generalized LDA Algorithms for Undersampled Problems(2003-12-11) Park, Cheonghee; Park, HaesunLinear Discriminant Analysis (LDA) is a dimension reduction method which finds an optimal linear transformation that maximizes the between-class scatter and minimizes the within-class scatter. In undersampled problems where the number of samples is smaller than the dimension of data space, it is difficult to apply the LDA due to the singularity of scatter matrices caused by high dimensionality. In order to make the LDA applicable for undersampled problems, several generalizations of the LDA have been proposed recently. In this paper, we present the theoretical and algorithmic relationships among several generalized LDA algorithms and compare their computational complexities and performances in text classification and face recognition. Towards a practical dimension reduction method for high dimensional data, an efficient algorithm is also proposed.Item A Comparitive Study of Linear and Nonlinear Feature Extraction Methods(2004-11-17) Park, Cheonghee; Park, Haesun; Pardalos, PanosLinear Discriminant Analysis (LDA) is a dimension reduction method which finds an optimal linear transformation that maximizes the between-class scatter and minimizes the withinclass scatter. However, in undersampled problems where the number of samples is smaller than the dimension of data space, it is difficult to apply the LDA due to the singularity of scatter matrices caused by high dimensionality. In order to make the LDA applicable, several generalizations of the LDA have been proposed. This paper presents theoretical and algorithmic relationships among several generalized LDA algorithms. Utilizing the relationships among them, computationally efficient approaches to these algorithms are proposed. We also present nonlinear extensions of these LDA algorithms. The original data space is mapped to a feature space by an implicit nonlinear mapping through kernel methods. A generalized eigenvalue problem is formulated in the transformed feature space and generalized LDA algorithms are applied to solve the problem. Performances and computational complexities of these linear and nonlinear discriminant analysis algorithms are compared theoretically and experimentally.Item A Compiler Framework for Recovery Code Generation in General Speculative Optimizations(2003-12-28) Lin, Jin; Hsu, Wei-Chung; Yew, Pen-Chung; Dz-ching Ju, Roy; Ngai, Tin-fookData speculation is an effective way to overcome imprecise alias analysis for many compiler optimizations. A general framework that integrates both control and data speculation using alias profiling and/or compiler heuristic rules has shown to improve SPEC2000 performance on Itanium systems [16]. However, speculative optimizations require check instructions and recovery code to ensure correct execution when speculation fails. How to generate check instructions and their associated recovery code efficiently and effectively is an issue yet to be well studied. Also, it is very important to avoid inhibiting later optimizations after the recovery code is generated in the earlier phases. This paper proposes a framework that uses an if-block structure to facilitate check instructions and recovery code generation for general speculative optimizations. It allows speculative and recovery code generated early in the compiler phases to be integrated effectively with the subsequent compiler optimization phases. It also allows multi-level speculation for multi-level pointers and multi-level expression trees to be handled with no additional complexity. The proposed recovery code generation framework has been implemented on the Open Research Compiler (ORC).Item A Computationally Efficient and Statistically Powerful Framework for Searching High-order Epistasis with Systematic Pruning and Gene-set Constraints(2010-06-21) Fang, Gang; Haznadar, Majda; Wang, Wen; Steinbach, Michael; Van Ness, Brian; Kumar, VipinThis paper has not yet been submitted.Item A Data Mining Framework for Forest Fire Mapping(2012-03-29) Karpatne, Anuj; Chen, Xi; Chamber, Yashu; Mithal, Varun; Lau, Michael; Steinhaeuser, Karsten; Boriah, Shyam; Steinbach, Michael; Kumar, VipinForests are an important natural resource that support economic activity and play a significant role in regulating the climate and the carbon cycle, yet forest ecosystems are increasingly threatened by fires caused by a range of natural and anthropogenic factors. Mapping these fires, which can range in size from less than an acre to hundreds of thousands of acres, is an important task for supporting climate and carbon cycle studies as well as informing forest management. There are two primary approaches to fire mapping: field and aerial-based surveys, which are costly and limited in their extent; and remote sensing-based approaches, which are more cost-effective but pose several interesting methodological and algorithmic challenges. In this paper, we introduce a new framework for mapping forest fires based on satellite observations. Specifically, we develop spatio-temporal data mining methods for Moderate Resolution Imaging Spectroradiometer (MODIS) data to generate a history of forest fires. A systematic comparison with alternate approaches across diverse geographic regions demonstrates that our algorithmic paradigm is able to overcome some of the limitations in both data and methods employed by other prior efforts.Item A Decomposition-Based Approach to Layered Manufacturing(2000-10-10) Ilinkin, Ivaylo; Janardan, Ravi; Majhi, Jayanth; Schwerdt, Jörg; Smid, Michiel; Sriram, RamThis paper introduces a new approach for improving the performance and versatility of Layered Manufacturing (LM), which is an emerging technology that makes it possible to build physical prototypes of 3D parts directly from their CAD models using a relatively small and inexpensive "3D printer" attached to a personal computer. LM provides the designer with an additional level of physical verification that makes it possible to detect and correct design flaws that may have, otherwise, gone unnoticed in the virtual model.Current LM processes work by viewing the CAD model as a single, monolithic unit. By contrast, the approach proposed here decomposes the model into a small number of pieces, by intersecting it with a suitably chosen plane, builds each piece separately using LM, and then glues the pieces together to obtain the physical prototype. This approach allows large models to be built quickly in parallel and also lends itself naturally to applications where the model needs to be built as several pieces, such as in the manufacture of mold halves for injection molding. Furthermore, this approach is very efficient in its use of so-called support structures that are generated by the LM process.This paper presents the first provably correct and efficient geometric algorithms to decompose polyhedral models so that the support requirements (support volume and area of contact) are minimized. Algorithms based on the plane-sweep paradigm are first given for convex polyhedra. These algorithms run in O(n log n) time for n -vertex convex polyhedra and work by generating expressions for the support volume and contact-area as a function of the height of the sweep plane, and optimizing them during the sweep. Experimental results are given for randomly-generated convex polyhedra with up to 200,000 vertices. These algorithms are then generalized to non-convex polyhedra, which are considerably more difficult due to the complex structure of the supports. It is shown that, surprisingly, non-convex polyhedra can be handled by first identifying certain critical facets using a technique called cylindrical decomposition, and then applying the algorithm for convex polyhedra to these critical facets. The resulting algorithms run in O(n2log n) time.Item A Distributed Approximation for Minamum Connecting Dominated Sets in Wireless Networks(2003-05-02) Min, Manki; Lanka, Raghuram; Huang, Xiao; Du, Ding-ZhuA wireless ad-hoc network is a wireless, multihop network without an infrastructure. And due to the limited resources of hosts in the network, the resource utilization becomes more critical. As a result, construction of virtual backbones in wireless ad-hoc networks gets more attractive. In this paper, we present a distributed approximation scheme for minimum connected dominating sets in unit-disk graphs which can serve as a virtual backbone in the network. Our algorithm has the performance ratio of 7 and the message complexity of $O(n log n)$, which are the best among the schemes in the literature.Item A Fast Dimension Reduction Algorithm with Applications on Face Recognition and Text Classification(2003-12-19) Park, Cheonghee; Park, HaesunIn undersampled problems where the number of samples is smaller than the dimension of data space, it is difficult to apply Linear Discriminant Analysis (LDA) due to the singularity of scatter matrices caused by high dimensionality. We propose a fast dimension reduction method based on a simple modification of Principal Component Analysis (PCA) and the orthogonal decomposition. The proposed algorithm is an efficient way to perform LDA for undersampled problems. Our experimental results in face recognition and text classification demonstrate the effectiveness of our proposed method.Item A Framework for Discovering Co-location Patterns in Data Sets with Extended Spatial Objects(2003-09-22) Xiong, Hui; Shekhar, Shashi; Huang, Yan; Kumar, Vipin; Ma, Xiaobin; Soung Yoo, JinCo-location patterns are subsets of spatial features (e.g. freeways, frontage roads) usually located together in geographic space. Recent literature has provided a transaction-free approach to discover co-location patterns over spatial point data sets to avoid potential loss of proximity relationship information in partitioning continuous geographic space into transactions. This paper provides a more general transaction-free approach to mine data sets with extended spatial objects, e.g. line-strings and polygons. Key challenges include modeling of neighborhood and relationships among extended spatial objects as well as controlling of related geometric computation costs. Based on a buffer-based definition of neighborhoods, a new model of finding co-location patterns over extended spatial objects has been proposed. Furthermore, this paper presents two pruning approaches, namely a prevalence-based pruning approach and a geometric filter-and-refine approach. Experimental evaluation with a real data set (the roadmap of Minneapolis and St.~Paul metropolitan area) shows that the geometric filter-and-refine approach can speed up the prevalence-based pruning approach by a factor of 30 to 40. Finally, the extended co-location mining algorithm proposed in this paper has been used to select most challenging field test routes for a novel GPS-based approach to accessing road user charges.Item A Framework for Information Visualization Spreadsheets(1999-03-09) ChiHuai-hsin, EdInformation has become interactive. Information visualization is the design and creation of interactive graphic depictions of information by combining principles in the disciplines of graphic design, cognitive science, and interactive computer graphics. As the volume and complexity of the data increases, users require more powerful visualization tools that allow them to more effectively explore large abstract datasets. This thesis seeks to make information visualization more accessible to potential users by creating a ``Visualization Spreadsheet'', where each cell can contain an entire set of data represented using interactive graphics. Just as a numeric spreadsheet enables exploration of numbers, a visualization spreadsheet enables exploration of visual forms of information. Unlike numeric spreadsheets, which store only simple data elements and formulas in each cell, a cell in the Visualization Spreadsheet can hold an entire abstract data set, selection criteria, viewing specifications, and other information needed for a full-fledged information visualization. Similarly, intra-cell and inter-cell operations are far more complex, stretching beyond simple arithmetic and string operations to encompass a range of domain-specific operators.Item A GA-based Approach for scheduling Decomposable Data Grid Applications(2004-02-18) Kim, Seonho; Weissman, JonData Grid technology promises geographically distributed scientists to access and share physically distributed resources such as compute resource, networks, storage, and most importantly data collections for large-scale data intensive problems. The massive size and distributed nature of these datasets poses challenges to data intensive applications. Scheduling Data Grid applications must consider communication and computation simultaneously to achieve high performance. In many Data Grid applications, Data can be decomposed into multiple independent sub datasets and distributed for parallel execution and analysis. In this paper, we exploit this property and propose a novel Genetic Algorithm based approach that automatically decomposes data onto communication and computation resources. The proposed GA-based scheduler takes advantage of the parallelism of decomposable Data Grid applications to achieve the desired performance level. We evaluate the proposed approach comparing with other algorithms. Simulation results show that the proposed GA-based approach can be a competitive choice for scheduling large Data Grid applications in terms of both scheduling overhead and the quality of solutions as compared to other algorithms.Item A General Compiler Framework for Data Speculation Using DSCM(2004-03-01) Dai, Xiaoru; Hsu, Wei-Chung; Yew, Pen-ChungGetting precise alias information in a language that allows pointers, such as C, is expensive. One reason is that alias analysis should generate conservative (safe) alias information. Alias analysis assumes possible aliases when it can’t prove there are no aliases. The conservativealias information may greatly affect compiler optimizations. In this paper, we present a general framework to allow speculative (unsafe) alias information to be used in several compiler optimizations such as redundancy elimination, copy propagation, dead store elimination and code scheduler. Speculative alias information is not guaranteed to be correct. Run-time verification and recovery code are generated to guarantee the correctness of the optimizations that use speculative alias information. In the framework, the key idea is to use data speculative code motion (DSCM) to move the instruction that will be optimized away to the instruction that causesthe optimization. During DSCM, verification and recovery code are generated. After the DSCM, the optimization becomes non-speculative and the moved instruction can be optimized away.Item A Generalized Framework for Protein Sequence Annotation(2007-10-15) Rangwala, Huzefa; Kauffman, Christopher; Karypis, GeorgeOver the last decade several data mining techniques have been developed for determining structural and functional properties of individual protein residues using sequence and sequence-derived information. These protein residue annotation problems are often formulated as either classification or regression problems and solved using a common set of techniques. We develop a generalized protein sequence annotation toolkit (prosat) for solving classification or regression problems using support vector machines. The key characteristic of our method is its effective use of window-based information to capture the local environment of a protein sequence residue. This window information is used with several kernel functions available within our framework. We show the effectiveness of using the previously developed normalized second order exponential kernel function and experiment with local window-based information at different levels of granularity. We report empirical results on a diverse set of classification and regression problems: prediction of solvent accessibility, secondary structure, local structure alphabet, transmembrane helices, DNA-protein interaction sites, contact order, and regions of disorder are all explored. Our methods show either comparable or superior results to several state-of-the-art application tuned prediction methods for these problems. prosat provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems. The results of some of these predictions can be used to assist in solving the overarching 3D structure prediction problem.