Browsing by Subject "machine learning"
Now showing 1 - 20 of 44
- Results Per Page
- Sort Options
Item Aquifer and Stratigraphy Code Prediction Using a Random Forest Classifier: An Exploration of Minnesota’s County Well Index(2021-05) Thielsen, ChrisWe live in an era of big data, brought on by the advent of automatic large-scale data acquisition in many industries. Machine learning can be used to take advantage of large data sets, predicting otherwise unknown information from them. The Minnesota County Well Index (CWI) database contains information about wells and borings in Minnesota. While a plethora of information is recorded in CWI, some objective codes are missing. A random forest classifier is used to predict aquifer and stratigraphy codes in CWI based on the data provided in drillers’ logs; i.e., before the strata are interpreted by a geologist. We find that by learning from the information written down by the well driller, stratigraphic codes can be predicted with an accuracy of 92.15%. There are 2,600,000 strata recorded in CWI; these codes are not only useful in understanding the geologic history of Minnesota, but also directly inform groundwater models.Item Automated Detection And Quantification Of Pain Using Electroencephalography(2018-05) Vijayakumar, VishalEffective pain assessment and management strategies are needed to better manage pain. In addition to self-report, an objective pain assessment system to detect, quantify, and track the intensity of pain reduces the uncertainty of treatment outcome and provides a reliable benchmark for longitudinal evaluation of pain therapies. With electroencephalography (EEG) gaining traction as a reliable tool for characterizing brain regions active during pain, this work presents the development of robust and accurate machine learning algorithms on neuroimaging data using EEG to detect and quantify tonic thermal pain. To quantify pain, a random forest model was trained to using time-frequency wavelet representations of independent components obtained from EEG data. The mean classification accuracy for predicting pain on an independent test subject for a range of 1-10 is 89.45%, highest among existing state of the art quantification algorithms for EEG, demonstrating the potential of this tool to be used clinically to help improve chronic pain treatment. A temporally pain-specific biomarker using EEG was developed using EEG microstates to evaluate their specificity to pain compared to rest and two non-rest conditions evoking similar responses. Multifractal analyses on the microstate sequence showed that microstate interactions during pain were significantly more stable across time scales compared to non-painful conditions, but significantly more chaotic compared to resting state. A pain detection algorithm using deep learning techniques was constructed utilizing non-orthogonal temporal dependencies between microstates. Each branch of the deep learning network was trained to differentiate between pain and a non-painful condition to increase the specificity of the final algorithm to pain. The resulting algorithm improved on the state of the art by 14%, scoring 90.67% in terms of specificity to various levels of pain, compared to non-painful stimuli. Stacking this deep-learning pain detection algorithm on top of the pain quantification algorithm showed a 10% improvement in terms of F-score over the state of the art in pain quantification algorithms. This is an encouraging step forward in developing a clinically feasible tool that can detect, record, quantify and longitudinally compare the intensities of pain in patients to better aid the development of effective therapies to manage pain.Item Clausal Complementation in Nepal Bhasa(2021-11) Zhang, BoruiThis dissertation examines the syntax and lexical semantics of finite verbal de- pendent clauses in Nepal Bhasa through fieldwork and by creating a shallow parsing model and corpus-based search to test descriptive generalizations. Nepal Bhasa deploys two main different syntactic complementation strategies: head-final pre-verbal CPs, which I argue are true complements and head-initial post-verbal CPs, which I argue are parataxis. Complementation additionally introduces certain syntactic and morpho- logical constraints. Inchoative and perfective morphemes appear in free alternation in some mono-clausal environments, whereas in embedding structures, an embedding predicate with the inchoative suffix is restricted. By annotating a small dataset from open-source Nepal Bhasa data, I train a chunking model by adopting the technique of transfer learning in machine learning, with fine-tuning the pre-trained mBERT lan- guage model. The preliminary test results show the potential usefulness of using NLP tools to effectively build a corpus for research in low-resource languages. In particular, this method corroborates my descriptive generalization that inchoative is restricted on embedding predicates in Nepal Bhasa. Additional search over the structural treebank corpora of typologically related languages adds evidence to a cross-linguistic generaliza- tion on embedding verb restrictions.Item A combined statistical and machine learning approach for single channel speech enhancement(2015-05) Tseng, Hung-WeiIn this thesis, we study the single-channel speech enhancement problem, the goal of which is to recover a desired speech from a monaural noisy recording. Speech enhancement is a focal issue to study due to is widespread usage in speech-related applications, such as hearing aids, mobile communications, and speech recognition systems. Three speech enhancement algorithms are proposed. In the rst algorithm, the Wiener Non-negative Matrix Factorization (WNMF), we combine the traditional Wiener ltering and the NMF into a single optimization problem. The objective is to minimize the mean square error, similar to Wiener ltering, and the constraints ensure the enhanced speeches are sparsely representable by the speech model learned by NMF. WNMF is novel because it utilizes NMF to capture the speech-specific structure while simultaneously leveraging it, thus improving the Wiener filtering. For the second algorithm, we propose a Sparse Gaussian Mixture Model (SGMM) that extends the traditional NMF and the Gaussian model. SGMM better captures the complex structure of speech than the traditional NMF. To control for overrepresentation of SGMM, we impose sparsity in order to ensure that only a few Gaussian models are simultaneously active. Computationally, it is achieved by using a l0-norm in the constraint of the maximum-likelihood (ML) estimation. The contribution of SGMM is in solving the constrained ML estimation, which has a closed form update even with the non-convex and non-smooth l0-norm constraint. The final algorithm proposed is the Sparse NMF + Deep Neural Network (SNMF-DNN), in which we treat speech enhancement as a supervised regression problem - the goal being to estimate the optimal enhancement gain. SNMF, originally designed for source separation, is used to extract features from the noisy recording. DNN is subsequently trained to estimate the optimal enhancement gain. Although our system is simple and does not require any sophisticated handcrafted features, we are able to demonstrate a substantial improvement in both intelligibility and enhanced speech quality.Item Computer-aided diagnosis of prostate cancer with multiparametric MRI(2020-07) Leng, EthanProstate cancer (PCa) is a leading cause of cancer death among men in the U.S. Multiparametric magnetic resonance imaging (mpMRI), a combination of anatomic and functional imaging methods, has demonstrated potential to improve PCa detection. However, radiologic interpretation of mpMRI data is time-consuming and highly dependent on reader expertise. Therefore, a computer-aided detection (CAD) system that could accurately and automatically detect PCa using mpMRI data would provide tremendous clinical utility. This dissertation research focuses on the development and improvement of several components of such a CAD system, with topics that include image registration, quantitative pathology, and model development and evaluation.Item Data-Driven Framework for Energy Management in Extended Range Electric Vehicles Used in Package Delivery Applications(2020-08) Wang, PengyuePlug-in Hybrid Electric Vehicles (PHEVs) have potential to achieve high fuel efficiency and reduce on-road emissions compared to engine-powered vehicles when using well-designed Energy Management Strategies (EMSs). The EMS of PHEVs has been a research focus for many years and optimal or near optimal performance has been achieved using control-oriented approaches like Dynamic Programming (DP) and Model Predictive Control (MPC). These approaches either require accurate predictive models for the trip information during driving cycles or detailed velocity profiles in advance. However, such detailed information is not feasible to obtain in some real-world applications like the delivery vehicle application studied in this work. Here, data-driven approaches were developed and tested over real-world trips with the help of two-way Vehicle-to-Cloud (V2C) connectivity. First, the EMS problem was formulated as a probability density estimation problem and solved by Bayesian inference. The Bayesian algorithm deals with the condition where only small amounts of data are available and sequential parameter estimation problem elegantly, which matches the characteristics of the data generated by delivery vehicles. The predicted value of the parameter for the next trip is determined by the carefully designed prior information and all the available data of the vehicle so far. The parameter is updated before the delivery tasks using the latest trip information and stays static during the trip. This method was demonstrated on 13 vehicles with 155 real-world delivery trips in total and achieved an average of 8.9% energy efficiency improvement with respect to MPGe (miles per gallon equivalent). For vehicles with sufficient data that can represent the characteristics of future delivery trips, the EMS problem was formulated as a sequential decision-making problem under uncertainty and solved by deep reinforcement learning (DRL) algorithms. An intelligent agent was trained by interacting with the simulated environment built based on the vehicle model and historical trips. After training and validation, optimized parameter in the EMS was updated by the trained intelligent agent during the trip. This method was demonstrated on 3 vehicles with 36 real-world delivery trips in total and achieved an average of 20.8% energy efficiency improvement in MPGe. Finally, I investigated three problems that could be encountered when the developed DRL algorithms are deployed in real-world applications: model uncertainty, environment uncertainty and adversarial attacks. For model uncertainty, an uncertainty-aware DRL agent was developed, enabled by the technique of Bayesian ensemble. Given a state, the agent quantifies the uncertainty about the output action, which means although actions will be calculated for all input states, the high uncertainty associated with unfamiliar or novel states is captured. For environment uncertainty, a risk-aware DRL agent was built based on distributional RL algorithms. Instead of making decisions based on expected returns as standard RL algorithms, actions were chosen with respect to conditional value at risk, which gives more flexibility to the user and can be adapted according to different application scenarios. Lastly, the influence of adversarial attacks on the developed neural network based DRL agents was quantified. My work shows that to apply DRL agents on real-world transportation systems, adversarial examples in the form of cyber-attack should be considered carefully.Item Development of Image Analysis Tools to Quantify Potato Tuber Quality Traits(2022-08) Miller, MichaelPotato is the most popular non-cereal food crop and a major staple crop. Despite the importance of potato, it has seen little yield improvement through breeding over the past century when compared to other crops. One difficulty in potato breeding is the large number of quality traits that must be accounted for in order to create marketable potato varieties. These quality traits are often measured using imprecise, subjective scales. This thesis covers my work in improving the tools available for use in measuring and breeding for potato tuber quality traits. In Chapter 1, I review the literature relevant to a selection of quality traits and their measurement. I discuss machine learning and its use in identifying more intricate tuber quality traits, as well as efforts to perform genomic selection in autotetraploid potato as a possible application for highly quantitative quality trait data. Chapter 2 covers the mechanics and capabilities of the potato tuber image analysis program, TubAR. I compare the quantitative measurements provided by TubAR to human visual scores for analogous traits. In Chapter 3, I discuss efforts to expand the scope of traits able to be measured with image analysis by employing machine learning image classification, using the pressure bruise and skin finish traits.Item Development of Interatomic Potentials with Uncertainty Quantification: Applications to Two-dimensional Materials(2019-07) Wen, MingjianAtomistic simulation is a powerful computational tool to investigate materials on the microscopic scale and is widely employed to study a large variety of problems in science and engineering. Empirical interatomic potentials have proven to be an indis- pensable part of atomistic simulation due to their unrivaled computational efficiency in describing the interactions between atoms, which produce the forces governing atomic motion and deformation. Atomistic simulation with interatomic potentials, however, has historically been viewed as a tool limited to provide only qualitative insight. A key reason is that in such simulations there are many sources of uncertainty that are difficult to quantify, thus failing to give confidence interval on the obtained results. This thesis presents my research work on the development of interatomic potentials with the ability to quantify the uncertainty in simulation results. The methods to train interatomic po- tentials and quantify the uncertainty are demonstrated via two-dimensional materials and heterostructures throughout this thesis, whose low-dimensional nature makes them distinct from their three-dimensional counterparts in many aspects. Both physics-based and machine learning interatomic potentials are developed for MoS2 and multilayer graphene structures. The new potentials accurately model the interactions in these systems, reproducing a number of structural, energetic, elastic, and thermal properties obtained from first-principles calculations and experiments. For physics-based poten- tials, a method based on Fisher information theory is used to analyze the parametric sensitivity and the uncertainty in material properties obtained from phase average. We show that the dropout technique can be applied to train neural network potentials and demonstrate how to obtain the predictions and the associated uncertainties of material properties practically and efficiently from such potentials. Putting all these ingredients of my research work together, we create an open-source fitting framework to train inter- atomic potentials and hope it can make the development and deployment of interatomic potentials easier and less error prone for other researchers.Item Discovering genetic drivers in acute graft-versus-host disease after allogeneic hematopoietic stem cell transplantation(2019-05) Huang, HuAcute graft-versus-host disease (GVHD) is one of the major complications after allogeneic hematopoietic stem cell transplantation (allo-HCT) that cause non-relapse morbidity and mortality. Although the increasing matching rate of the human leukocyte antigen (HLA) genes between donor and recipient (DR) has significantly reduced the risk of GVHD, clinically significant GVHD remains as a transplantation challenge, even in HLA-identical transplants. Candidate gene studies and genome-wide association studies have revealed susceptible individual genes and gene pairs from DR pairs that are associated with acute GVHD; however, the roles of genetic disparities between donor and recipient remain to be understood. To identify genetic factors linked to acute GVHD, we investigated the classical HLA and non-HLA genes and conducted a genome-wide clinical outcome association study. Assessment of 4,646 antigen recognition domain (ARD)-matched unrelated donor allo-HCT cases showed that the frequency of mismatches outside the ARD in HLA genes is very low when the DR pairs are matched at ARD. Due to the low frequency of amino acid mismatches in the non-ARD region and their reportedly weak alloimmune reactions, we suggest that the non-ARD sequence mismatches within the ARD-matched DR pairs have limited influence on the development of acute GVHD, and may not be a primary factor. The genome-wide clinical outcome association study between DR pairs observed multiple autosomal minor histocompatibility antigens (MiHAs) restricted by HLA typing, though their association with acute GVHD outcome was not statistically significant. This result suggests that HLA mismatching outweighs other genetic mismatches as contributors to acute GVHD risk. In the cases of female donors to male recipients, we identified the significant association of the Y chromosome-specific peptides encoded by PCDH11Y, USP9Y, UTY, and NLGN4Y with the acute GVHD outcome. Additionally, we developed a machine learning-based genetic variant selection algorithm for ultra-high dimensional transplant genomic studies. The algorithm successfully selected a set of genes from over 1 M genetic variants, all of which have evidence to be linked to the transplant-related complications. This work offers evidence and guidance for further research in acute GVHD and allo-HCT and provides useful bioinformatics and data mining tools for transplant genomic studies.Item Distributed Training with Heterogeneous Data: Bridging Median- and Mean-Based Algorithms(2022-03) Chen, XiangyiRecently, there is a growing interest in the study of median-based algorithms for distributed non-convex optimization. Two prominent examples include signSGD with majority vote, an effective approach for communication reduction via 1-bit compression on the local gradients, and medianSGD, an algorithm recently proposed to ensure robustness against Byzantine workers. The convergence analyses for these algorithms critically rely on the assumption that all the distributed data are drawn iid from the same distribution. However, in applications such as Federated Learning, the data across different nodes or machines can be inherently heterogeneous, which violates such an iid assumption. This work analyzes signSGD and medianSGD in distributed settings with heterogeneous data. We show that these algorithms are non-convergent whenever there is some disparity between the expected median and mean over the local gradients. To overcome this gap, we provide a novel gradient correction mechanism that perturbs the local gradients with noise, which we show can provably close the gap between mean and median of the gradients. The proposed methods largely preserve nice properties of these median-based algorithms, such as the low per-iteration communication complexity of signSGD, and further enjoy global convergence to stationary solutions. Our perturbation technique can be of independent interest when one wishes to estimate mean through a median estimator.Item Energy Efficient Computing with Time-Based Digital Circuits(2019-05) Everson, LukeAdvancements in semiconductor technology have given the world economical, abundant, and reliable computing resources which have enabled countless breakthroughs in science, medicine, and agriculture which have improved the lives of many. Due to physics, the rate of these advancements is slowing, while the demand for the increasing computing horsepower ever grows. Novel computer architectures that leverage the foundation of conventional systems must become mainstream to continue providing the improved hardware required by engineers, scientists, and governments to innovate. This thesis provides a path forward by introducing multiple time-based computing architectures for a diverse range of applications. Simply put, time-based computing encodes the output of the computation in the time it takes to generate the result. Conventional systems encode this information in voltages across multiple signals; the performance of these systems is tightly coupled to improvements in semiconductor technology. Time-based computing elegantly uses the simplest of components from conventional systems to efficiently compute complex results. Two time-based neuromorphic computing platforms, based on a ring oscillator and a digital delay line, are described. An analog-to-digital converter is designed in the time domain using a beat frequency circuit which is used to record brain activity. A novel path planning architecture, with designs for 2D and 3D routes, is implemented in the time domain. Finally, a machine learning application using time domain inputs enables improved performance of heart rate prediction, biometric identification, and introduces a new method for using machine learning to predict temporal signal sequences. As these innovative architectures are presented, it will become clear the way forward will be increasingly enabled with time-based designs.Item Evaluation of Pharmacostatistical Model Components Using a Nonlinear Mixed-effect Approach(2022-12) Jaber, MutazA nonlinear mixed-effect population pharmacokinetic (PPK) approach is a pharmacostatistical concept used to study pharmacokinetic (PK) and/or pharmacodynamic (PD) variability at the population level. PPK quantifies the typical PK and/or PD population parameter values (central tendency measures) and the magnitude of variability among individuals (measures of dispersion). Types of data used in PPK are collected from either a well-controlled clinical trial or routine care (observational studies). In terms of samples collected, this method can handle dense and limited (sparse) data given a sufficient number of subjects and assuming the recorded samples were withdrawn at times to allow PK/PD parameter estimation. In a nonlinear mixed-effects modeling (NLMEM) approach of PK and PD data, two levels of random effects are generally modeled: between-subject variability (BSV) and residual unexplained variability (RUV). In the study described in chapter 3, the goal was to investigate the extent to which PK and RUV model misspecification, errors in recording dosing and sampling times, and variability in drug content uniformity contribute to the estimated magnitude of RUV and PK parameter bias. We found the contribution of dose and dosing time misspecifications have negligible effects on RUV but result in higher bias in PK parameter estimates. Inaccurate documentation of sampling time results in biased RUV and increases with the magnitude of perturbations. Combined perturbation scenarios in the studied sources will propagate the variability and accumulate in RUV magnitude and result in bias of PK parameter estimates. This work provides insight into the potential contributions of many factors that comprise RUV and bias in PK parameters.In chapter 4 we describe a study designed to evaluate the impact of deviations in recorded time in NLMEM settings. An assumption that clinical data are recorded without any error is optimistically made. While some study personnel will record the actual times when there is a deviation others record the nominal time. Therefore, we investigate including an additional random effect on the independent variable time and quantitate the bias in estimated parameters; and determine the sensitivity of the magnitude of deviation between actual and recorded times on parameter estimation bias. Hence, we report that adding a random quantity to the recorded time will lead to reducing the bias and imprecision in PK estimates compared to assuming the recorded time is absolute. In chapter 5, we take a closer look at diagnosing pharmacostatistical models. Specifically, both traditional weighted residuals (WRES) and conditional weighted residuals (CWRES) are common metrics to graphically evaluate model acceptability in population analyses. Limited by the lower limit of quantification (LLOQ) of analytical techniques, it is not uncommon to have concentrations reported as below the LLOQ (BLQ) in PK studies. Although various approaches have been proposed to accommodate BLQ data, M3 method currently appears to be the most common. That being said, NONMEM excluded the calculation of all WRES/CWRES for each subject with BLQ data due to a concern that the residuals for that subject might be biased. Our aim was to conduct a simulation study to investigate the extent to which weighted residual calculations in subjects having some BLQ data might be biased when using the M3 method. We conclude that bias in CWRES and WRES can be detected but is small and unlikely to impact decisions made based on weighted residual–based diagnostic plots when the M3 method with MDVRES is performed to accommodate BLQ observations in the scenarios we studied. Another important pharmacostatistical component is the structural PK model, and in chapter 6, we evaluate absorption models. Absorption processes are complex but rarely have sufficient data to capture the parameters of a mechanistic model. Typically, a single absorption model (e.g., first-order, mixed-order, lag, or distributive delay model), is assumed to apply to all individuals with the expectation that random effects will accommodate individual differences. However, distinct absorption profiles may coexist in a given dataset. Thus, we propose that individualized absorption models should be considered when multiple absorption profiles are evident in a population analysis. Machine learning is gaining wider attention in clinical pharmacology and pharmacometrics as computational capacity increases. Methods for machine learning use statistical algorithms and methods that are capable of doing automated learning from existing data to uncover patterns. Therefore, we wish to evaluate an exercise in chapter 7 to train a deep neural network to automatically prespecify absorption models. Finally, the knowledge provided in this thesis will bring us closer to using pharmacometrics methodology to individualize patient care by understanding the sources of variability and embracing the model individualization concept.Item Exploring the form and functions of chimpanzee pant-hoots from basic evolutionary principles.(2022-06) Desai, NisargResearchers have studied chimpanzee vocal communication extensively, focusing on evidence of parallels with human language. This approach has been effective in encouraging vocal communication research and providing some insights about the evolution of language. However, it has obscured our understanding of non-human animal communication by motivating researchers to adopt a problematic conceptual framework that uses complex linguistic phenomena as models for simpler primate vocal communication mechanisms. An approach focusing on basic evolutionary principles involves studying the intimate connection between form and function to obtain insights about the biological and evolutionary origins and mechanisms of traits. Such an approach, when employed for studying chimpanzee vocalizations, may be more fruitful in revealing fundamental factors that may shape their vocalizations. This dissertation extends our knowledge of the forms and functions of chimpanzee vocal communication. I first explored different acoustical and statistical analysis methods for describing the form of vocalizations. Next, I studied connection of the form of chimpanzee vocalization, the pant-hoot, to its possible functions. Using audio recordings and behavioral data from two chimpanzee communities in Gombe National Park, Tanzania, and one chimpanzee community in Kibale National Park, Uganda, I tested if the variation in chimpanzee calls is explained primarily by (i) community membership, or (ii) by individual traits such as age, rank, and health, and (iii) if any of these acoustic cues predicted male mating success. Individual traits better explained the acoustic variation in pant-hoots than community membership. Acoustic variation also reflected male mating success. These findings suggest that sexual selection is a key evolutionary force shaping chimpanzee vocalizations.Item Fusion of Knowledge: Enhancing AI Reasoning through Language Models and Knowledge Graphs(2024-06) Mavromatis, KonstantinosLarge Language Models (LLMs) and Knowledge Graphs (KGs) have rapidly emerged as important areas in Artificial Intelligence (AI). LLMs leverage vast amounts of unstructured text to understand and generate natural language. KGs are relational graphs that encode domain expertise and knowledge into explicit semantics. A desideratum of AI is the ability to reason and draw inferences in a rational, sensible way. The present dissertation addresses the following question: How can LLMs and KGs enhance AI reasoning? The core idea of this dissertation is to leverage LLMs as a foundation for understanding and processing natural language, while utilizing KGs to access accurate and domain-specific knowledge. We present our contributions in advancing the capabilities of AI systems with respect to the following dimensions. (1) Faithfulness: We introduce a novel KG retrieval method (GNN-RAG) for grounding the LLM reasoning on multi-hop KG facts, alleviating LLM hallucinations when answering complex questions. (2) Effectiveness: We design a powerful graph model (ReaRev) for improved reasoning over KGs on knowledge-intensive tasks, such as Question Answering. (3) Temporal Reasoning: We propose TempoQR, a method that leverages Temporal KGs and allows LMs to handle questions with temporal constraints. (4) Efficiency: We develop a graph-aware distillation framework (GRAD), in which the LM learns to utilize useful graph information, while being efficient at inference. (5) Robustness: We present SemPool, a simple graph pooling method that offers robustness when critical information is missing from the KG.Item Integrating Human and Machine Intelligence in Galaxy Morphology Classification Tasks(2018-01) Beck, MelanieThe large flood of data flowing from observatories presents significant challenges to astronomy and cosmology – challenges that will only be magnified by projects currently under development. Growth in both volume and velocity of astrophysics data is accelerating: whereas the Sloan Digital Sky Survey (SDSS) has produced 60 terabytes of data in the last decade, the upcoming Large Synoptic Survey Telescope (LSST) plans to register 30 terabytes per night starting in the year 2020. Additionally, the Euclid Mission will acquire imaging for ∼ 5 × 10^7 resolvable galaxies. The field of galaxy evolution faces a particularly challenging future as complete understanding often cannot be reached without analysis of detailed morphological galaxy features. Historically, morphological analysis has relied on visual classification by astronomers, accessing the human brains capacity for advanced pattern recognition. However, this accurate but inefficient method falters when confronted with many thousands (or millions) of images. In the SDSS era, efforts to automate morphological classifications of galaxies (e.g., Conselice et al., 2000; Lotz et al., 2004) are reasonably successful and can distinguish between elliptical and disk-dominated galaxies with accuracies of ∼80%. While this is statistically very useful, a key problem with these methods is that they often cannot say which 80% of their samples are accurate. Furthermore, when confronted with the more complex task of identifying key substructure within galaxies, automated classification algorithms begin to fail. The Galaxy Zoo project uses a highly innovative approach to solving the scalability problem of visual classification. Displaying images of SDSS galaxies to volunteers via a simple and engaging web interface, www.galaxyzoo.org asks people to classify images by eye. Within the first year hundreds of thousands of members of the general public had classified each of the ∼1 million SDSS galaxies an average of 40 times. Galaxy Zoo thus solved both the visual classification problem of time efficiency and improved accuracy by producing a distribution of independent classifications for each galaxy. While crowd-sourced galaxy classifications have proven their worth, challenges remain before establishing this method as a critical and standard component of the data processing pipelines for the next generation of surveys. In particular, though innovative, crowd-sourcing techniques do not have the capacity to handle the data volume and rates expected in the next generation of surveys. These algorithms will be delegated to handle the majority of the classification tasks, freeing citizen scientists to contribute their efforts on subtler and more complex assignments. This thesis presents a solution through an integration of visual and automated classifications, preserving the best features of both human and machine. We demonstrate the effectiveness of such a system through a re-analysis of visual galaxy morphology classifications collected during the Galaxy Zoo 2 (GZ2) project. We reprocess the top-level question of the GZ2 decision tree with a Bayesian classification aggregation algorithm dubbed SWAP, originally developed for the Space Warps gravitational lens project. Through a simple binary classification scheme we increase the classification rate nearly 5-fold classifying 226,124 galaxies in 92 days of GZ2 project time while reproducing labels derived from GZ2 classification data with 95.7% accuracy. We next combine this with a Random Forest machine learning algorithm that learns on a suite of non-parametric morphology indicators widely used for automated morphologies. We develop a decision engine that delegates tasks between human and machine and demonstrate that the combined system provides a factor of 11.4 increase in the classification rate, classifying 210,803 galaxies in just 32 days of GZ2 project time with 93.1% accuracy. As the Random Forest algorithm requires a minimal amount of computational cost, this result has important implications for galaxy morphology identification tasks in the era of Euclid and other large-scale surveys.Item Learning High-Order Relations for Network-Based Phenome-Genome Association Analysis(2019-08) Petegrosso, RaphaelAn organism's phenome is the expression of characteristics from genetic inheritance and interaction with the environment. This includes simple physical appearance and traits, and even complex diseases. In human, the understanding of the relationship of such features with genetic markers gives insights into the mechanisms involved in the expression, and can also help to design targeted therapies and new drugs. In other species, such as plants, correlation of phenotypes with genetic mutations and geoclimatic variables also assists in the understanding of evolutionary global diversity and important characteristics such as flowering time. In this thesis, we propose to use high-order machine learning methods to help in the analysis of phenome through the associations with biological networks and ontologies. We show that, by combining biological networks with functional annotation of genes, we can extract high-order relations to improve the discovery of new candidate associations between genes and phenotypes. We also propose to detect high-order relations among multiple genomics datasets, geoclimatic features, and interactions among genes, to find a feature representation that can be utilized to successfully predict phenotypes. Experiments using the Arabidopsis thaliana species shows that our approach does not only contribute with an accurate predictive tool, but also brings an intuitive alternative for the analysis of correlation among plant accessions, genetic markers, and geoclimatic variables. Finally, we propose a scalable approach to solve challenges inherited from the use of massive biological networks in phenome analysis. Our low-rank method can be used to process massive networks in parallel computing to enable large-scale prior knowledge to be incorporated and improve predictive power.Item Leveraging Machine Learning Techniques in Power and Transportation Systems(2022-06) Zheng, XinhuWith the data explosion, and the emerging demand for system modeling and simulation, the challenges are growing exponentially in terms of model accuracy and authenticity. Fortunately, the surge of modern machine learning techniques has enabled us to grapple with seemingly impossible to solve problems, to overcome the computational complexity, and mine the knowledge to guide the operational tasks, especially in complex cyber physical systems such as power and transportation systems. To handle the increasing scale and complexity in power systems, we propose a novel and efficient method to solve the Optimal Power Flow (OPF) problems, by decomposing the entire system into multiple sub-systems based on automatic regionalization. Meanwhile, by utilizing the demonstration and the deep reinforcement learning (DRL), a novel hybrid emergency voltage control method is proposed. Specifically, the experts' knowledge is extracted through a behavioral cloning model and novel insights are gained via DRL. The major advances witnessed by leveraging big data in transportation networks bring opportunities to study the driving style and car-following model in a data-driven manner. We propose an algorithm that classifies drivers into different driving styles and only requires data from a short observation window. Meanwhile, by exploiting the modeling expressiveness of deep neural networks (DNNs), we propose a DNN based car-following model that can achieve higher simulation accuracy. Accurate understanding of the environment is a prerequisite to ensure safety in autonomous driving. However, the capabilities of a single vehicle can hardly meet the requirements of a complex driving environment. To cope with these issues, a multi-vehicle and multi-sensor (MVMS) cooperative perception method is introduced to construct a global view of the environment. In addition, to justify the robustness of the perception results, we evaluate the confidence of the perception output and propose a semantic information fusion scheme based on confidence levels.Item Leveraging Machine Learning Tools To Develop Objective, Interpretable, And Accessible Assessments Of Postural Instability In Parkinson'S Disease(2023-04) Herbers, CaraParkinson's disease (PD) is the second most common neurodegenerative disease in the United States, affecting 1 million Americans. PD-related postural instability (PI) is one of the most disabling motor symptoms of PD since it is associated with increased falls and loss of independence. PI has little or no response to current PD treatments, the underlying mechanisms are poorly understood, and the current clinical assessments are subjective and introduce human error. There is a need for improved diagnostic tools of PI for clinicians to better characterize, understand, and treat PD-related PI. Several criteria are necessary to address this clinical need: (1) the clinical rating of PI should be quantified objectively, (2) additional postural tasks should be clinically assessed and quantified, and (3) the assessments of PI should occur more frequently than a biannual clinical assessment. This project sought to develop two novel approaches to address these criteria. First, deep learning markerless pose estimation was leveraged to assess reactive step length in response to shoulder pull and surface translation perturbations for individuals with and without PD. Reactive step length was altered in PD (significantly for treadmill perturbations, and with an insignificant trend for shoulder pull perturbations), and improved by dopamine replacement therapy. Next, insole plantar pressure sensor data from 111 subjects (44 PD, 67 controls) were collected and used to assess PD-related PI during typical daily balance tasks. Machine learning models were developed to accurately identify PD from young controls (area under the curve (AUC) 0.99 +/- 0.00), PD from age-matched controls (AUC 0.99 +/- 0.01), and PD non-fallers from PD fallers (AUC 0.91 +/- 0.08). It was seen that utilizing features from both static and active tasks significantly improved classification performances and that all tasks were useful for separating controls from PD; however, tasks with higher postural threat were preferred for separating PD non-fallers from PD fallers. This work produced numerous clinical and translational implications. Notably, (1) simple and accessible quantitative measures can be used to identify PD and individuals with PD who fall, and (2) machine learning models can be leveraged for implementing, quantifying, and interpreting these measures into something clinically useful.Item Machine Learning Techniques for Time Series Regression in Unmonitored Environmental Systems(2023-04) Willard, JaredThis thesis provides a computer science audience with a review of machine learningtechniques for modeling time series in unmonitored environmental systems with no available target data that have been published in recent years, and further includes three distinct research efforts applying these methods to real-world water resources prediction scenarios. Additionally, we identify several open questions for time series prediction in unmonitored sites that include incorporating dynamic inputs and site characteristics, mechanistic understanding, and explainable AI techniques in modern machine learning frameworks. This is motivated by the current state of environmental time series modeling seeing a vast increase in applications of various machine learning models, in particular deep learning models built using the growing availability of high performance computing resources. It remains difficult to predict environmental variables for which observations are concentrated in a minority of locations and most locations remain unmonitored, and although many machine learning-based approaches have been developed, there is often a lack of comparison between them. The increased attention to environmental prediction topics such as disaster response, water resources management, and climate change reveal a need to compare these approaches, and understand when and where they should be applied in unmonitored environmental prediction scenarios.Item Microbial biosynthesis of β-lactone natural products: from mechanisms to machine learning(2020-06) Robinson, SerinaNatural products with β-lactone (2-oxetanone) rings often have potent antibiotic, antifungal and antitumor properties. These reactive pharmacophores are known to covalently inhibit enzymes from over 20 different families including lipases, proteases, and fatty acid synthases. Since the discovery of the first β-lactone natural product, anisatin, in 1952, over 30 compounds with β-lactone moieties have been isolated from bacteria, fungi, plants and insects. Now in the post-genomic era, the field of natural product drug discovery is in the midst of a transformation from traditional ‘grind and find’ methods to targeted genome mining approaches. However, genomics-guided discovery of new β-lactone natural products was hampered by a lack of understanding of the enzymes that catalyze β-lactone ring formation. In 2017, our lab reported the first standalone β-lactone synthetase enzyme, OleC, in a bacterial long-chain hydrocarbon biosynthesis pathway from Xanthomonas campestris. This thesis builds on this initial breakthrough through biochemical characterization of the substrate specificity, kinetics, and mechanism of X. campestris OleC. Using these biochemical data, I trained machine learning classifiers to predict the substrate specificity of β-lactone synthetases and related adenylate-forming enzymes. I developed this into a web-based predictive tool and mapped the biochemical diversity of adenylate-forming enzymes in >50,000 candidate biosynthetic gene clusters across bacterial, plant, and fungal genomes. This global genomic analysis led to my discovery and characterization of the biosynthetic gene cluster for an orphan β-lactone natural product, nocardiolactone. To more broadly investigate enzymatic production of β-lactone compounds, a library of 1,095 distinct enzyme-substrate combinations for OleA family of enzymes upstream in the β-lactone biosynthesis pathway were screened. Overall, this body of work advanced progress towards the discovery of new β-lactone natural products and combinatorial biosynthesis of β-lactone compound libraries.
- «
- 1 (current)
- 2
- 3
- »