Browsing by Subject "Computer science"
Now showing 1 - 20 of 46
Results Per Page
Sort Options
Item Adaptive tile coding methods for the generalization of value functions in the RL state space.(2012-03) Siginam, BharatThe performance of a Reinforcement Learning (RL) agent depends on the accuracy of the approximated state value functions. Tile coding (Sutton and Barto, 1998), a function approximator method, generalizes the approximated state value functions for the entire state space using a set of tile features (discrete features based on continuous features). The shape and size of the tiles in this method are decided manually. In this work, we propose various adaptive tile coding methods to automate the decision of the shape and size of the tiles. The proposed adaptive tile coding methods use a random tile generator, the number of states represented by features, the frequencies of observed features, and the difference between the deviations of predicted value functions from Monte Carlo estimates to select the split points. The RL agents developed using these methods are evaluated in three different RL environments: the puddle world problem, the mountain car problem and the cart pole balance problem. The results obtained are used to evaluate the efficiencies of the proposed adaptive tile coding methods.Item A biclustering method for extracting keyphrases to describe groups of yeast genes.(2010-08) Kasireddy, VivekWith the growing capabilities of high-throughput gene methods, one of the critical issues in using these methods is how to interpret the results. For example, it is possible to evaluate all of the genes for yeast (saccharomyces cerevisiae) at once to see how they react to a particular chemical. As a result of such an experiment a researcher might get a list of genes that all respond similarly. The question then becomes how to understand what these genes have in common to explain their response. Complicating this issue is the real possibility that there may be more than one explanation. In this work we look at a method for automatically annotating groups of genes with keyphrases (i.e., short groups of words to describe the genes) to help a user understand what the genes might have in common. As part of this process we want to consider how to deal with cases where there is more than one explanation. To address this problem we make use of a biclustering method called SAMBA (Tanay et al., 2002) which was developed to solve a similar problem for genes and measured conditions. We generate keyphrases by considering possible keyphrases as conditions and attempt to bicluster the genes of interest with keyphrases that are strongly associated with subgroups of those genes. We perform experiments using genes associated with known terms to see if our method can extract useful keyphrases and to separate out subgroups of the genes.Item Computational analysis of genome-scale growth-interaction data in Saccharomyces cerevisiae(2014-08) VanderSluis, Benjamin JamesIn just two decades, advances in the experimental mapping and computational analysis of DNA sequences have resulted in complete reference genomes for thousands of different species. We therefore have a nearly complete "parts list" (that is, genes) for each of these organisms, but the task remains to discover the individual function of each of these genes, as well as characterize the organization and evolution of these individual genes into the many sub-systems at work inside the cell. Perturbation analysis is a crucial tool in identifying gene function and genetic relationships. In perturbation analysis, genes are selectively deleted or mutated, and any change in the resulting phenotype—for example, growth rate—can give an indication of gene function. We can then obtain a more complete functional map by systematically changing or combining genetic perturbations, and/or varying the environment under which we observe the phenotype. The focus of this dissertation is the development of computational methods to enable genome-scale perturbation analyses in yeast.We begin the dissertation with a discussion of the first computational analysis of growth rate data for a comprehensive collection of deletion mutants in a wide variety of truly minimal environments. This analysis revealed how sources of nitrogen and carbon in the environment interact to determine growth rate, both in the context of wild-type strains, and in the context of individual single-mutants. We also discuss comparisons between experimental observation and in silico growth rate predictions which serve as a benchmark for current constraint-based modeling methods. Secondly, we discuss our efforts to map the complete genetic interaction network in yeast through a comprehensive set of double-mutant experiments. We explore the ability of genetic interactions and high-dimensional interaction profiles in to predict gene function, and describe both local and global properties of the genetic interaction network, which may reasonably be expected to be conserved to other organisms, such as humans. Lastly, we describe local properties of the genetic interaction network surrounding genes which have undergone ancient duplication. Using networks derived from both double- and triple-mutant experiments, we explore the consequences of duplication, divergence, and retained common functionality, and speculate about the evolutionary process, and the constraints on that process which govern the fates of duplicate gene pairs.Functional capabilities of genes are conserved across species to a surprising extent. Determining the functions of the remaining uncharacterized genes in yeast, will assist in the functional characterization of the thousands of remaining uncharacterized genes in human. Further, the mapping of the first complete eukaryotic genetic interaction network has direct impact on the study of complex, multi-genic phenotypes, including many human diseases. Meanwhile, the study of genetic interaction network structure, yields fundamental insights into the nature of cellular robustness, redundancy, and the evolutionary processes which give rise to them.In just two decades, advances in the experimental mapping and computational analysis of DNA sequences have resulted in complete reference genomes for thousands of different species. We therefore have a nearly complete "parts list" (that is, genes) for each of these organisms, but the task remains to discover the individual function of each of these genes, as well as characterize the organization and evolution of these individual genes into the many sub-systems at work inside the cell. Perturbation analysis is a crucial tool in identifying gene function and genetic relationships. In perturbation analysis, genes are selectively deleted or mutated, and any change in the resulting phenotype—for example, growth rate—can give an indication of gene function. We can then obtain a more complete functional map by systematically changing or combining genetic perturbations, and/or varying the environment under which we observe the phenotype. The focus of this dissertation is the development of computational methods to enable genome-scale perturbation analyses in yeast.We begin the dissertation with a discussion of the first computational analysis of growth rate data for a comprehensive collection of deletion mutants in a wide variety of truly minimal environments. This analysis revealed how sources of nitrogen and carbon in the environment interact to determine growth rate, both in the context of wild-type strains, and in the context of individual single-mutants. We also discuss comparisons between experimental observation and in silico growth rate predictions which serve as a benchmark for current constraint-based modeling methods. Secondly, we discuss our efforts to map the complete genetic interaction network in yeast through a comprehensive set of double-mutant experiments. We explore the ability of genetic interactions and high-dimensional interaction profiles in to predict gene function, and describe both local and global properties of the genetic interaction network, which may reasonably be expected to be conserved to other organisms, such as humans. Lastly, we describe local properties of the genetic interaction network surrounding genes which have undergone ancient duplication. Using networks derived from both double- and triple-mutant experiments, we explore the consequences of duplication, divergence, and retained common functionality, and speculate about the evolutionary process, and the constraints on that process which govern the fates of duplicate gene pairs.Functional capabilities of genes are conserved across species to a surprising extent. Determining the functions of the remaining uncharacterized genes in yeast, will assist in the functional characterization of the thousands of remaining uncharacterized genes in human. Further, the mapping of the first complete eukaryotic genetic interaction network has direct impact on the study of complex, multi-genic phenotypes, including many human diseases. Meanwhile, the study of genetic interaction network structure, yields fundamental insights into the nature of cellular robustness, redundancy, and the evolutionary processes which give rise to them.Item Consumers, editors, and power editors at work: diversity of users in Online peer production communities(2014-09) Panciera, Katherine AnneMany people rely on open collaboration projects to run their computer (Linux), browse the web (Mozilla Firefox), and get information (Wikipedia). Open content web sites are peer production communities which depend on users to produce content. In this thesis, we analyze three types of users in peer production communities: consumers, contributors, and core contributors. Consumers don't edit or add content while contributors add some content. Core contributors edit or contribute much more content than others on the site. The three types of users each serve a different role in the community, receive different benefits from the community, and are important to the survival of a community.We look at users in two communities: Wikipedia and Cyclopath. Wikipedia is the largest and most well-known peer production community. The majority of the work in this dissertation is from Cyclopath, a geowiki for bicyclists developed by GroupLens. Since we built Cyclopath, we have access to data that allowed us to delve much deeper into the divide between the three types of users. First, we wanted to understand what the quantitative differences between core contributors and contributors were. On Wikipedia and Cyclopath, core contributors start editing more intensely from their first day on the site. On Cyclopath we were able to look at pre-registration activity and found equivocal evidence for "educational lurking". Building on this quantitative analysis, we turned to qualitative questions. By surveying and interviewing Cyclopath users, we learned what motivates them to participate and what benefits they derive from participating. While consumers and contributors both benefited by receiving routes, contributors were more likely to say they registered to edit. (Registration was not required to edit.) We also found that the Cyclopath core contributors aren't the most dedicated bicyclists, but they are committed to the values of open content. By providing a holistic view of users on Cyclopath and by looking at Wikipedia editors quantitatively, we discovered opportunities for new forms of participation, such as an outlet for subjective comments and annotations, as well a key to motivating people to contributing objective information, highlighting flaws and easy fixes in the system.Item Covering polyhedra by motifs with triangular fundamental regions.(2011-09) Pathi, Lakshmi RamyaIn geometry, a “net” of a polyhedron is a two-dimensional figure where all the polygons are joined by edges, which when folded becomes a three-dimensional polyhedron. A “subnet” is a subset of a net which is formed by the faces of the polyhedron. Technically, multiple nets can exist for a polyhedron and different polyhedrons can be obtained from a single net. The algorithm designed takes any arbitrary subnet of a polyhedron as an input and maps a triangular motif onto each of the polygon faces of the subnet. Each polygon face is assumed to be convex and will be triangulated from its centroid. The triangles of that triangulation will then be filled in with transformed versions of the motif. Currently, Dr Dunham's work creates a pattern on a specific polyhedron while my research aims at mapping a single pattern onto each of the possibly different polygons of a net that can be used to construct any patterned polyhedron.Item Creating repeating hyperbolic patterns based on regular tessellations.(2012-07) Becker, Christopher D.Repeating patterns have been used in art throughout history. In the middle of the 20th century, the noted Dutch artist M.C. Escher was the first to create repeating hyperbolic patterns that were artistic in nature. These patterns were very tedious to design and draw. Escher did all this work by hand, without the benefit of a computer. This paper discusses how, through the use of a computer program, the creation of repeating hyperbolic patterns is accomplished in a less tedious, more timely manner. The computer program enables a user to load or create a data file that defines the sub-pattern and other information about the design. The program will take that information and generate the repeating pattern for the user. The user is also able to modify the pattern. The computer program allows the user to precisely and quickly create repeating hyperbolic patterns which will be displayed on the screen. The repeating hyperbolic pattern is also saved as a PostScript file.Item Creating scalable, efficient and namespace independent routing framework for future networks.(2011-06) Jain, SourabhIn this thesis we propose VIRO -- a novel and paradigm-shifting approach to network routing and forwarding that is not only highly scalable and robust, but also is namespace- independent. VIRO provides several advantages over existing network routing architectures, including: i) VIRO directly and simultaneously addresses the challenges faced by IP networks as well as those associated with the traditional layer-2 technologies such as Ethernet -- while retaining its "plug-&-play" feature. ii) VIRO provides a uniform convergence layer that inte- grates and unifies routing and forwarding performed by the traditional layer-2 (data link layer) and layer-3 (network layer), as prescribed by the conventional local-area/wide-area network di- chotomy and layered architecture. iii) Perhaps more importantly, VIRO decouples routing from addressing, and thus is namespace-independent. Hence VIRO allows new (global or local) ad- dressing and naming schemes (e.g., HIP or flat-id namespace) to be introduced into networks without the need to modify core router/switch functions, and can easily and flexibly support inter-operability between existing and new addressing schemes/namespaces. In the second part of this thesis, we present Virtual Ethernet Id Layer, in short VEIL, a practical realization of VIRO routing protocol to create a large-scale Ethernet networks. VEIL is aimed at simplifying the management of large-scale enterprise networks by requiring minimal manual configuration overheads. It makes it tremendously easy to plug-in a new routing-node or a host-device in the network without requiring any manual configuration. It builds on top of a highly scalable and robust routing substrate provided by VIRO, and supports many advanced features such as seamless mobility support, built-in multi-path routing and fast-failure re-routing in case of link/node failures without requiring any specialized topologies. To demonstrate the feasibility of VEIL, we have built a prototype of VEIL, called veil-click, using Click Modular Router framework, which can be co-deployed with existing Ethernet switches, and does not require any changes to host-devices connecting to the network.Item Database management system support for collaborative filtering recommender systems(2014-08) Sarwat, MohamedRecommender systems help users identify useful, interesting items or content (data)from a considerably large search space. By far, the most popular recommendation technique used is collaborative filtering which exploits the users' opinions (e.g., movie ratings) and/or purchasing (e.g., watching, reading) history in order to extract a set of interesting items for each user. Database Management Systems (DBMSs) do not provide in-house support for recommendation applications despite their popularity. Existing recommender system architectures either do not employ a DBMS at all or only uses it as a data store whereas the recommendation logic is implemented in-full outside the database engine. Incorporating the recommendation functionality inside the DBMS kernel is beneficial for the following reasons: (1) Many recommendation algorithms take as input structured data (users, items, and user historical preferences) that could be adequately stored and accessed using a database system. (2) The In-DBMS approach facilitates applying the recommendation functionality and typical database operations(e.g., Selection, Join) side-by-side. That allows application developers to go beyond traditional recommendation applications, e.g., "Recommend to Alice ten movies", and flexibly define Arbitrary Recommendation scenarios like "Recommend ten nearby restaurants to Alice" and "Recommend to Bob ten movies watched by her friends". (3) Once the recommendation functionality lives inside the database kernel, the recommendation application takes advantage of the DMBS inherent features (e.g., query optimization, materialized views, indexing) provided by the storage manager and query execution engine.This thesis studies the incorporation of the recommendation functionality inside the core engine of a database management system. This is a major departure from existing recommender system architectures that are implemented on-top of a database engines using either SQL queries or stored procedures. The on-top approach does not harness the full power of the database engine (i.e., query execution engine, storage manager)since it always generates recommendations first and then performs other database operations. Ideas developed in this thesis are implemented inside RecDB ; an opensource recommendation engine built entirely inside PostgreSQL (open source relational database system).Item Designing an algorithm that transforms each pixel back to motif in a fundamental region.(2011-09) Chandarana, Dnyaneshwari SubodhCurrent algorithms to create repeating hyperbolic patterns transform the motif about the hyperbolic plane to points in the Poincaré circle model. This is inefficient near the bounding circle since the entire motif is drawn, even though it covers only few pixels. To avoid this shortcoming, we designed another algorithm that transforms each pixel in a motif in a fundamental region and then colors the original pixel using a color permutation of the color of the final pixel. This solves the inefficiency problems of the previous algorithms.Item Designing effective motion visualizations: elucidating scientific information by bringing aesthetics and design to bear on science(2014-11) Schroeder, DavidThe visual system is the highest-bandwidth pathway into the human brain, and visualization takes advantage of this pathway to allow users to understand datasets they are interested in. Recent scientific advances have led to the collection of larger and more complicated datasets, leading to new challenges in effectively visualizing these data. The focus of this dissertation is on addressing these challenges and enabling the next generation of visualization systems. We address these challenges through two complementary research thrusts: "Advanced Visualization Practice" and "Visualization Design Tools."In our Advanced Visualization Practice thrust, we take steps to extend the process of interactive visualization to work effectively with complicated multivariate motion datasets. We present brushing and filtering operations that allow users to perform complicated filtering operations in a linked-window visualization while maintaining context in complementary views, including two-dimensional plots, three-dimensional plots, and recorded video. We also present the concept of "trends," or patterns of motions that behave similarly over a period of time, and introduce visualization elements to allow users to examine, interact with, and navigate these trends. These contributions help to implement Shneiderman's information seeking mantra (Overview first, zoom and filter, then details-on-demand) in the context of collections of motion datasets.During our work in Advanced Visualization Practice, we realized that there were a lack of tools enabling visualization developers to rapidly and controllably create and evaluate these visualizations. We address this deficiency by our Visualization Design Tools thrust, introducing the idea of visualization creation interfaces where users draw directly on top of data in order to effect their desired changes to the current visualization. In an application of this idea to streamline visualizations, we present a sketch-based streamline visualization creation interface, allowing users to create accurate streamline visualizations by simply drawing the lines they want to appear. An underlying algorithm constrains the input to be accurate while still matching the user's intent. In a second application of this idea, we present a Photoshop-style interface, enabling users to create complicated multivariate visualizations without needing to program. A colormap painting and dabbing algorithm allows users to create complicated colormaps by drawing colors on top of a colormap; an algorithm determines the desired locality of the user's input and updates the colormap accordingly. These interfaces show the potential for future interfaces in this direction to expand the visualization design process to include users currently excluded, such as domain scientists and artists.Through these two complementary thrusts, we help to solve problems preventing newer datasets from being fully exploited. Our contributions in Advanced Visualization Practice solve problems that are impeding the visualization of motion datasets. Our contributions in Visualization Design Tools provide a blueprint for the creation of visualization interfaces that can enable all users instead of just programmers to contribute directly to the visualization design and creation process. Together, these set the stage for future visualization interfaces to better solve our biggest visualization challenges.Item Efficient learning in linearly solvable MDP models.(2012-06) Li, AngLinearly solvable Markov Decision Process (MDP) models are a powerful subclass of problems with a simple structure that allow the policy to be written directly in terms of the uncontrolled (passive) dynamics of the environment and the goals of the agent. However, there have been no learning algorithms for this class of models. In this research, inspired by Todorov’s way of computing optimal action, we showed how to construct passive dynamics from any transition matrix, use Bayesian updating to estimate the model parameters and apply approximate and efficient Bayesian exploration to speed learning. In addition, the computational cost of learning was reduced using intermittent Bayesian updating reducing the frequency of solving the Bellman equation (either the normal form or Todorov’s form). We also gave a polynomial theoretical time complexity bound for the convergence of the learning process of our new algorithm, and applied this directly to a linear time bound for the subclass of the reinforcement learning (RL) problem via MDP models with the property that the transition error depends only on the agent itself. Test results for our algorithm in a grid world were presented, comparing our algorithm with the BEB algorithm. The results showed that our algorithm learned more than the BEB algorithm without losing convergence speed, so that the advantage of our algorithm increased as the environment got more complex. We also showed that our algorithm’s performance is more stable after convergence.Item Energy transfer ray tracing with OptiX(2012-06) Halverson, ScotQUIC Energy is an energy modeling system for urban environments. Our research group has developed QUIC Energy as a part of a set of GPU-assisted tools with a common goal of increasing knowledge relating urban organization and design with environmental concerns. We hypothesize that it is possible to optimize urban or- ganization, building placement, and material selection minimizing building energy consumption for heating and cooling, as well as minimizing air pollution exposure. Our work focuses on the interactions between urban structures and surroundings. Us- ing this information, we are able to investigate potential strategies for optimization along a number of variables. With GPU-assisted computations, we are able to rapidly perform large numbers of simulations for our optimization algorithms. The focus of QUIC Energy is on energy transfer in urban environments. It accounts for radiant energy interactions between buildings, a ground layer, participating media including an atmosphere, airborne particulate, and vegetation, and incoming solar radiation. It is capable of modeling heat conditions for urban environments, including surface and volumetric temperatures. QUIC Energy performs its calculations by means of ray tracing methods implemented using NVIDIA's OptiX and CUDA frameworks for GPU-assisted computations. GPU based ray tracing allows QUIC Energy to rapidly model heat and energy ow in varied environments under a wide range of conditions. QUIC Energy is part of the Green Environmental Urban Simulations for Sustain- ability (GEnUSiS) project. The goal behind GEnUSiS is to present a set of tools which can be used to optimize urban infrastructure along a number of environmen- tally focused variables. GEnUSiS is being used to study the interactions of green infrastructure - including parks, green roofs, and environmentally friendly materials - with urban environments over a wide range of scales.Item Enhancing GPU programmability and correctness through transactional execution(2015-01) Holey, Anup PurushottamGraphics Processing Units (GPUs) are becoming increasingly popular not only across various scientific communities, but also as integrated data-parallel accelerators on existing multicore processors. Support for massive fine-grained parallelism in contemporary GPUs provides a tremendous amount of computing power. GPUs support thousands of lightweight threads to deliver high computational throughput. Popularity of GPUs is facilitated by easy-to-adopt programming models such as CUDA and OpenCL that aim to ease programmers' efforts while developing parallel GPU applications. However, designing and implementing correct and efficient GPU programs is still challenging since programmers must consider interaction between thousands of parallel threads. Therefore, addressing these challenges is essential for improving programmers' productivity as well as software reliability. Towards this end, this dissertation proposes mechanisms for improving programmability of irregular applications and ensuring correctness of compute kernels.Some applications possess abundant data-level parallelism, but are unable to take advantage of GPU's parallelism. They exhibit irregular memory access patterns to the shared data structures. Programming such applications on GPUs requires synchronization mechanisms such as locks, which significantly increase the programming complexity. Coarse-grained locking, where a single lock controls all the shared resources, although reduces programming efforts, can substantially serialize GPU threads. On the other hand, fine-grained locking, where each data element is protected by an independent lock, although facilitates maximum parallelism, requires significant programming efforts. To overcome these challenges, we propose transactional memory (TM) on GPU that is able to achieve performance comparable to fine-grained locking, while requiring minimal programming efforts. Transactional execution can incur runtime overheads due to activities such as detecting conflicts across thousands of GPU threads and managing a consistent memory state. Thus, in this dissertation we illustrate lightweight TM designs that are capable of scaling to a large number of GPU threads. In our system, programmers simply mark the critical sections in the applications, and the underlying TM support is able to achieve performance comparable to fine-grained locking.Ensuring functional correctness on GPUs that are capable of supporting thousands of concurrent threads is crucial for achieving high performance. However, GPUs provide relatively little guarantee with respect to the coherence and consistency of the memory system. Thus, they are prone to a multitude of concurrency bugs related to inconsistent memory states. Many such bugs manifest as some form of data race condition at runtime. It is critical to identify such race conditions, and mechanisms that aid their detection at runtime can form the basis for powerful tools for enhancing GPU software correctness. However, relatively little attention has been given to explore such runtime monitors. Most prior works focus on the software-based approaches that incur significant overhead. We believe that minimal hardware support can enable efficient data race detection for GPUs. In this dissertation, we propose a hardware-accelerated data race detection mechanism for efficient and accurate data race detection in GPUs. Our evaluation shows that the proposed mechanism can accurately detect data race bugs in GPU programs with moderate runtime overheads.Item Extending the hirst and St-Onge measure of semantic relatedness for the unified medical language system.(2012-08) Choudhari, MugdhaItem A high performance framework for coupled urban microclimate models(2014-11) Overby, Matthew CharlesUrban form modifies the microclimate and may trap in heat and pollutants. This causes a rise in energy demands to heat and cool building interiors. Mitigating these effects is a growing concern due to the increasing urbanization of major cities. Researchers, urban planners, and city architects rely on sophisticated simulations to investigate how to reduce building and air temperatures. However, the complex interactions between urban form and the microclimate are not well understood. Many factors shape the microclimate, such as solar radiation, atmospheric convection, longwave interaction between nearby buildings, and more. As science evolves, new models are developed and existing ones are improved. More accurate and sophisticated models often impose higher computational overhead.This paper introduces QUIC EnvSim (QES), a scalable, high performance framework for coupled urban microclimate models. QES allows researchers to develop and modify such models, in which tools are provided to facilitate input/output communications, model interaction, and the utilization of computational resources for efficient simulations. Common functionality of urban microclimate modeling is optimally handled by the system. By employing Graphics Processing Units (GPUs), simulations within QES can be substantially accelerated. Models for computing view factors, surface temperatures, and radiative exchange between urban materials and vegetation have been implemented and coupled into larger, more sophisticated simulations.These models can be applied to complex domains such as large forests and dense cities. Visualizations, statistics, and analysis tools provide a detailed view of experimental results. Performance increases with additional GPUs and hardware availability. Several diverse examples have been implemented to provide details on utilizing the features of QES for a wide range of applications.Item Identifying candidate salivary oral cancer biomarkers:accurate protein quantification and analysis on LTQ type mass spectrometers.(2011-05) Onsongo, Getiria InnocentCancer is one of the leading causes of death worldwide accounting for around 13 % of all deaths. Oral cancer in one of the more common cancers occurring more frequently than leukemia, brain, stomach, or ovarian cancer. Unfortunately, the 5-year survival rate for oral cancer has not significantly improved in the past 30 years and remains at approximately 50 %, in part, due to lack of reliable diagnostic biomarkers for early detection. It is estimated, if diagnosed and treated early, survival rates for oral cancer would significantly improve to between 80 % and 90 %. We need reliable reliable biomarkers for diagnosis and early detection of oral cancer. Recent developments in high-throughput proteomics techniques have made it possible to detect and identify low abundance proteins in complex biological fluids such as saliva. These low-abundance proteins could be a source of the elusive reliable biomarkers needed to improve survival rates for oral cancer. Limiting the widespread use of these proteomics techniques is lack of an accurate protein relative quantification technique. A typical high-throughput experiment identifies several thousand proteins with several hundred differentially abundant proteins. The cost of validating candidate biomarkers prevents validation of each differentially abundant protein to identify promising candidate biomarker. We need computational techniques to identify promising candidate biomarkers. This two-part dissertation presents: 1) a new technique for accurate protein relative quantification implemented in freely-available, open-source software (LTQ-iQuant) and 2) relational database operators for analyzing differentially abundant proteins to identify promising candidate biomarkers. Linear ion trap mass spectrometers, such as the hybrid LTQ-Orbitrap, are a popular choice for isobaric-tags based shotgun proteomics because of their advantages in analyzing complex biological samples. Coupled with orthogonal fractionation techniques, they can be used to detect low abundance proteins extending the range for detecting possible biomarkers. Limiting the widespread use of this combination for quantitative proteomics studies is lack of a technique tailored to LTQ type instruments that accurately reports protein abundance ratios, and is implemented in an automated software pipeline. This thesis presents a new technique implemented in a freely-available, open source software that fulfills this need. A major limitation of existing computational techniques when using high-throughput techniques is results that are too broad to be practically useful. A lot of the `potential' disease-specific biomarkers discovered have been found not to be specific to the disease being studied. They either belong to biological categories that change in response to infection or tissue injury, or are proteins whose changes are induced by other stresses such as medication and diet. This thesis extends the relational database engine to enable use of biological pathways to identify promising candidate biomarkers. Using biological pathways to analyze high-throughput data avoids results that are too broad to be practically useful. Protein differential abundance often is the criteria used to identify candidate biomarkers in high-throughput discovery-based biomarker studies. However, protein quantity by itself might not be the salient marker parameter. Protein function is often dependent on post-translational modifications such as phosphorylation and gylcosylation. By only using differential abundance to identify candidate biomarkers, we are limiting our ability to identify reliable biomarkers. We further develop new operators that in addition to using user specified pathways, use post-translational modification information to analyze high-throughput data. For the first time, we demonstrate feasibility of using post-translational modifications with relational database operators to analyze high-throughput proteomics data. Collectively, this work will facilitate the search for reliable biomarkers. LTQ-iQuant will make LTQ instruments and isobaric peptide tagging accessible to more proteomics researchers providing a new window into complex biological fluids. Relation operators will provide a systematic way of bridging the gap between unbiased data driven approach and hypothesis driven approach to prioritize candidate biomarkers.Item Image classification with minimal supervision(2011-06) Joshi, Ajay JayantWith growing collections of images and video, it is imperative to have automated techniques for extracting information from visual data. A primary task that lies at the heart of information extraction is image classification, which refers to classifying images or parts of them as belonging to certain categories. Accurate and reliable image classification has diverse applications { web image and video search, content based image retrieval, medical image analysis, autonomous robotics, gesture-based human computer interfaces, etc. However, considering the large image variability and typically high-dimensional representations, training predictive models requires substantial amounts of annotated data, often provided through human supervision { supplying such data is expensive and tedious. This training bottleneck is the motivation for development of robust algorithms that can build powerful predictive models with little training or supervision. In this thesis, we propose new algorithms for learning with data, particularly focusing on active learning. Instead of passively accepting training data, the basic idea in active learning is to select the most informative data samples for the human to annotate. This can lead to extremely efficient allocation of resources, and results in predictive models that require far fewer training samples compared to the passive setting. We first propose an active sample selection criterion for training large multi-class classifiers with hundreds of categories. The criterion is easy to compute, and extends traditional two-class active learning to the multi-class setting. We then generalize the approach to handle only binary (yes / no) type feedback while still performing classification in the multi-class domain. The proposed modality provides substantial interactive simplicity, and makes it easy to distribute the training process across many users. Active learning has been studied from two different perspectives: selective sampling from a pool, and query synthesis; both perspectives o#11;er different tradeoffs. We propose a formulation that combines both approaches while leveraging their individual strengths resulting in a scalable and efficient multi-class active learning scheme.Experimental results show efficient training of classification systems with a pool of a few million images on a single computer. Active learning is intimately related to a large body of previous work on experiment design and optimal sensing { we discuss the similarities and key differences between the two. A new greedy batch-mode sample selection algorithm is proposed that shows substantial benefits over random batch selection, when iterative querying cannot be applied. We finally discuss two applications of active selection: i) active learning of compact hash codes for fast image search and classification, and ii) incremental learning of a classifier in a resource-constrained environment to handle changing scene conditions. Throughout the thesis, we focus on thorough experimental validation on a variety of image datasets to analyze strengths and weaknesses of the proposed methods.Item Improving MapReduce performance under widely distributed environments.(2012-06) Wang, ChenyuItem Improving results for the 2009 and 2010 INEX focused tasks.(2011-08) Acquilla, Natasha DeepakInformation retrieval systems aim to retrieve precise and relevant information in response to a user's query. In past years entire documents which were considered to be relevant or highly correlating were returned to users. However with growth of the web and large numbers of XML documents, smaller elements or passages can be returned to the user for more precise results. This thesis explains Flex, our system for dynamic element retrieval, where in XML elements rather than entire documents are retrieved and returned to the user. It also gives an overview of the process of generating highly correlating elements (from a large document collection) for a set of queries. The aim of this thesis is to improve the results for the INEX 2009 and 2010 Ad Hoc Focused Tasks. The Focused Tasks require that each query return a result set of non-overlapping elements. This thesis describes the techniques involved in producing such elements and compares the results produced.Item Improving results for the INEX 2009 and 2010 relevant in context tasks.(2011-08) Narendravarapu, Reena RachelInformation Retrieval Systems focus on retrieving information relevant to the user’s query. Many strategies have been developed to retrieve documents. Due to the increase in the data across the web, it is very important to retrieve relevant elements at the appropriate level of granularity. Our element retrieval system called Flexible retrieval (Flex) works with semi-structured documents to retrieve elements at run time. The goal of this thesis is to improve the results of INEX Ad Hoc 2009 and 2010 Relevant in Context (RiC) tasks. The RiC task returns a set of focused elements that are ordered by document. Snippets are incorporated as a part of the 2010 INEX RiC task. Snippets give an overview of the document and hence should be short with few irrelevant characters so as not to lose user interest. Appropriate retrieval techniques are developed to accommodate snippets. The Restricted Relevant in Context (RRiC) task, in which a snippet can have a maximum length of 500 characters, is also described.
- «
- 1 (current)
- 2
- 3
- »