Browsing by Subject "Biostatistics"
Now showing 1 - 14 of 14
- Results Per Page
- Sort Options
Item Application of Systems Biology Analysis to Hepatic Injury Following Hemorrhagic Shock(2014-05) Determan, Charles Edward JrIntroduction: This dissertation is focused on the metabolomic and transcriptomic changes that occur as a result of carbohydrate prefeeding during hemorrhagic shock and trauma within the liver of a porcine model. The risk of trauma and hemorrhagic shock continues to be an important issue in both military and civilian sectors. As such, we explored the impact of a prior fed state upon the overall response to hemorrhagic shock and resuscitation. The primary hypotheses were that changes in metabolism at the metabolomic and transcriptomic levels would be dependent upon the fed state. In addition, this thesis explores a more comprehensive analysis of metabolomics datasets to standardize analysis and improve overall consistency.Materials and Methods: Algorithm comparison was accomplished using six commonly applied methods to three synthetic datasets, of different sample sizes, and three openly accessible published datasets. This comparison also incorporated metrics to measure consistency of identified features (i.e. stability) to provide further confidence in results. Metabolomics analysis was accomplished with nuclear magnetic resonance spectroscopy (NMR) and Chenomx software to profile and quantify metabolites in liver extracts. The metabolome was subsequently analyzed with partial least squares discriminant analysis (PLS-DA). Transcriptomics analysis was conducted using next-generation sequencing (NGS) technology to employ RNA-sequencing (RNA-seq) on mRNA extracts from liver biopsies. The RNA-seq data was analyzed using typical processing techniques to generate a count matrix and subsequently analyzed with the Bioconductor package EdgeR. Results: The comparison of algorithms showed that the best algorithm is associated with differently structured datasets (e.g. number of features, number of groups, sample size, etc.). Analysis of the liver metabolome revealed changes in carbon energy sources, amino acid metabolism, oxidative stress, and membrane maintenance. Transcriptomic analysis revealed changes in carbohydrate metabolism, cytokine inflammation, cholesterol synthesis and apoptosis. In addition, there is evidence of increased cytoskeleton reorganization which may correspond to a shrunken, catabolic state which provides and anti-inflammatory condition to mitigate cellular damage.Conclusion: The response to hemorrhagic shock and resuscitation is altered with respect to a fasted or carbohydrate prefed state. Metabolomics and transcriptomic analyses suggest altered metabolic pathways as a result of fed state. Altered carbohydrate metabolism was readily identified thereby confirming both methods were successful. Additionally, indications of membrane maintenance that follow cytoskeletal remodeling and cellular shrinkage are potentially reflected by 3-Hydroxyisovalerate and sn-Glycero-3-phosphocholine. These results provide further evidence for pre-conditioning (e.g. altered diet) and hypertonic resuscitation methods to possibly improve patient outcome. Further research is required in alternative prefeeding substrates (e.g. protein, lipid, etc.) as well as improving the integration of different systems level datasets to understand more thoroughly the systemic effects of hemorrhagic shock and resuscitation.Item Approaches to handling time-varying covariates in survival models.(2011-05) Salkowski, Nicholas J.Time-varying covariates present special problems in survival analyses. Their measurements are often missing, and their missing status may be related to the survival outcome of interest. This dissertation discusses three approaches to handling time-varying covariates in survival models. First, predictions of event probabilities from a joint model for longitudinal and event time data are are compared to predictions from simpler models. Second, a Bayesian joint modeling approach is used to resolve difficulties relating to inference when measurements of a potentially mediating process are partially missing. Third, many time-varying covariates can be converted into alternative time scales. This dissertation presents an approach to handle vector-valued time scales in semiparametric proportional hazards regression.Item Bayesian adaptive designs in phase I/II clinical trials(2012-09) Zhong, WeiRecently, many Bayesian methods have been developed for dose-finding when simultaneously modeling both toxicity and efficacy outcomes in a blended phase I/II fashion. A further challenge arises when all the true efficacy data cannot be obtained quickly after the treatment, so that surrogate markers are instead used (e.g, in cancer trials). In this thesis, we first propose a framework to jointly model the probabilities of toxicity, efficacy and surrogate efficacy given a particular dose. The resulting trivariate algorithm utilizes all the available data at any given time point, and can flexibly stop the trial early for either toxicity or efficacy. Our simulation studies demonstrate our proposed method can successfully improve dosage targeting efficiency and guard against excess toxicity over a variety of true model settings and degrees of surrogacy. Second, we offer a brief catalog of more flexible semiparametric and nonparametric monotone link functions to model the marginal probability of efficacy based on our proposed trivariate binary model. We show via simulation that our flexible link methods can outperform standard parametric CRM approaches in terms of both the probability of correct dose selection and the proportion of patients treated at that dose. Finally, frequentist sample size determination for binary outcome data usually requires initial guesses of the event probabilities, which may lead to a poor estimate of the necessary sample size. We propose a new two-stage Bayesian design with sample size reestimation at the interim stage. Our design inherits the properties of good interpretation and easy implementation, generalizing an earlier method to a two-sample setting, and using a fully Bayesian predictive approach to reduce an overly large initial sample size when necessary. Moreover, our design can be extended to allow patient level covariates via logistic regression, now adjusting sample size within each subgroup based on interim analyses. We illustrate the benefits of this approach with a design in non-Hodgkin lymphoma with a simple binary covariate (patient gender), offering an initial step toward within-trial personalized medicine.Item Bayesian hierarchical joint modeling for longitudinal and survival data.(2011-08) Hatfield, Laura A.In studying the evolution of a disease and effects of treatment on it, investigators often collect repeated measures of disease severity (longitudinal data) and measure time to occurrence of a clinical event (survival data). The development of joint models for such longitudinal and survival data often uses individual-specific latent processes that evolve over time and contribute to both the longitudinal and survival outcomes. Such models allow substantial flexibility to incorporate association across repeated measurements, among multiple longitudinal outcomes, and between longitudinal and survival outcomes. The joint modeling framework has been extended to handle many complexities of real data, but less attention has been paid to the properties of such models. We are interested in the “payoff” of joint modeling, that is, whether using two sources of data simultaneously offers better inference on individual- and population-level characteristics, as compared to using them separately. We consider the problem of attributing informational content to the data inputs of joint models by developing analytical and numerical approaches and demonstrating their use. As a motivating application, we consider a clinical trial for treatment of mesothelioma, a rapidly fatal form of lung cancer. The trial protocol included patient-reported outcome (PRO) collection throughout the treatment phase and followed patients until progression or death to determine progression-free survival times. We develop models that extend the joint modeling framework to accommodate several features of the longitudinal data, including bounded support, excessive zeros, and multiple PROs measured simultaneously. Our approaches produce clinically relevant treatment effect estimates on several aspects of disease simultaneously and yield insights on individual-level variation in disease processes.Item Bayesian hierarchical modeling for adaptive incorporation of historical information In clinical trials.(2010-08) Hobbs, Brian PaulBayesian clinical trial designs offer the possibility of a substantially reduced sample size, increased statistical power, and reductions in cost and ethical hazard. However when prior and current information conflict, Bayesian methods can lead to higher than expected Type I error, as well as the possibility of a costlier and lengthier trial. We develop several models that allow for the commensurability of the information in the historical and current data to determine how much historical information is used. First, we propose methods for univariate Gaussian data and provide an example analysis of data from two successive colon cancer trials that illustrates a linear models extension of our adaptive borrowing approach. Next, we extend the general method to linear and linear mixed models as well as generalized linear and generalized linear mixed models. We also provide two more sample analyses using the colon cancer data. Finally, we consider the effective historical sample size of our adaptive method for the case when historical data is available only for the concurrent control arm, and propose "optimal" use of new patients in the current trial using an adaptive randomization scheme that is balanced with respect to the amount of incorporated historical information. The approach is then demonstrated using data from a trial comparing antiretroviral strategies in HIV-1-infected persons. Throughout the thesis we present simulation studies that compare frequentist operating characteristics and highlight the advantages of our adaptive borrowing methods.Item Incorporating biological knowledge of genes into microarry data analysis.(2009-04) Tai, FengMicroarray data analysis has become one of the most active research areas in bioinformatics in the past twenty years. An important application of microarray technology is to reveal relationships between gene expression profiles and various clinical phenotypes. A major characteristic in microarray data analysis is the so called "large p, small n" problem, which makes it difficult for parameter estimation. Most of the traditional statistical methods developed in this area target to overcome this difficulty. The most popular technique is to utilize an L1 norm penalty to introduce sparsity into the model. However, most of those traditional statistical methods for microarray data analysis treat all genes equally, as for usual covariates. Recent development in gene functional studies have revealed complicated relationships among genes from biological perspectives. Genes can be categorized into biological functional groups or pathways. Such biological knowledge of genes along with microarray gene expression profiles provides us the information of relationships not only between gene and clinical outcomes but also among the genes. Utilizing such information could potentially improve the predictive power and gene selection. The importance of incorporating biological knowledge into analysis has been increasingly recognized in recent years and several new methods have been developed. In our study, we focus on incorporating biological information, such as the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, into microarray data analysis for the purpose of prediction. Our first method aims implement this idea by specifying different L1 penalty terms for different gene functional groups. Our second method models a covariance matrix for the genes by assuming stronger within-group correlations and weaker between-group correlations. The third method models spatial correlations among the genes over a gene network in a Bayesian framework.Item Network-based mixture models for genomic data.(2009-06) Wei, PengA common task in genomic studies is to identify genes satisfying certain conditions, such as differentially expressed genes between normal and tumor tissues or regulatory target genes of a transcription factor (TF). Standard approaches treat all the genes identically and independently a priori and ignore the fact that genes work coordinately in biological processes as dictated by gene networks, leading to inefficient analysis and reduced power. We propose incorporating gene network information as prior biological knowledge into statistical modeling of genomic data to maximize the power for biological discoveries. We propose a spatially correlated mixture model based on the use of latent Gaussian Markov random fields (GMRF) to smooth gene specific prior probabilities in a mixture model over a network, assuming that neighboring genes in a network are functionally more similar to each other. In addition, we propose a Bayesian implementation of a discrete Markov random field (DMRF)-based mixture model for incorporating gene network information, and compare its performance with that based on Gaussian Markov random fields. We also extend the network-based mixture models to ones that are able to integrate multiple gene networks and diverse types of genomic data, such as protein- DNA binding, gene expression and DNA sequence data, to accurately identify regulatory target genes of a TF. Applications to high-throughput microarray data, along with simulations, demonstrate the utility of the new methods and the statistical efficiency gains over other methods.Item New penalized regression approaches to analysis of genetic and genomic data.(2012-08) Kim, SunkyungItem On Bayesian hierarchical modelling for large spatial datasets.(2012-03) Guhaniyogi, RajarshiWe propose a class of fully process-based low-rank spatially-varying cross-covariance matrices that produce non-degenerate spatial processes and that effectively capture non-stationary covariances among the multiple outcomes. We provide theoretical and modeling insight into these constructions and elucidate certain implications of some common structural assumptions in building cross-covariance matrices. We also propose low rank version of cross-covariance functions using predictive process class of models, popularly employed in spatial statistics to handle large datasets. Predictive process is obtained by projecting the parent Gaussian process onto a space spanned by a set of basis functions. An efficient model to choose those basis functions and have been proposed. Being a low rank model, predictive process often loses spatial information which might lead to spurious inferences. In the Chapter of this thesis, this loss of information has been quantified and model based adjustments have been suggested. Proposed models have been validated with carefully designed simulation studies. Finally, they have been employed to analyze interesting ecological datasets. Our framework has been able to produce substantive inferential tools such as maps of non-stationary cross-covariances that constitute the premise of further mechanistic modeling and hitherto not been easily available for environmental scientists and ecologists.Item Spatiotemporal Gradient Modeling with Applications(2013-07) Quick, Harrison S.Advances in Geographical Information Systems (GIS) have led to enormous recent growth in spatiotemporal databases and associated statistical modeling with applications in various scientific disciplines, including environmental monitoring, ecological systems, forestry, hydrology, meteorology and public health. After inferring on a spatiotemporal process for a given dataset, inferential interest may turn to estimating rates of change, or gradients, over space and time. The primary focus of this thesis is to further develop the methodology required for statistical inference on areally-referenced temporal and spatiotemporal gradient processes. We begin by first departing from the rather rich literature in space-time modeling by considering the setting where space is discrete but time is continuous. Our major objective here is to carry out inference on gradients of a temporal process in our dataset of monthly county level asthma hospitalization rates in the state of California, while also accounting for spatial similarities of the temporal process across neighboring counties. In addition to using a more flexible stochastic process embedded within a dynamic Markov random field framework that permits inference on the temporal gradient process, we also develop methods for allowing region-specific variance components, leading to variable smoothing in our spatial regions. We then move to the continuous space, continuous time setting. Here, we develop, within a flexible spatiotemporal process model setting, a framework to estimate arbitrary directional gradients over space at any given timepoint, temporal derivatives at any given spatial location and, finally, mixed spatiotemporal gradients that reflect rapid change in spatial gradients over time and vice-versa. After illustrating the use of our methodology on a dataset comprising daily PM2.5 concentrations in California, we show how the method can be implemented to analyze highly censored data (e.g., data below detectable limits) and apply these methods to data collected during the cleanup efforts of the Deepwater Horizon (BP) oil spill. Through the use of these methods, we believe researchers can gain significant insight into potentially important spatiotemporally varying risk factors that may as of yet be unknown (or at least not accounted for). Furthermore, the gradient process in and of itself can provide valuable information, for instance by being adapted to alert public health officials of dramatically rising pollution levels in a particular region, potentially leading to a reduction in exposure and, ultimately, a reduction in the incidence of poor health outcomes.Item Statistical methods for gene set based significance analysis.(2011-07) Lee, Sang MeeGene set enrichment analysis (GSEA) is a method to identify groups of genes, which are statistically more differentially expressed than all other genes across different treatments within a microarray study. Most of the existing approaches have largely relied on nonparametric methods and require repeated computation of permutation and resampling data to assess the significance of a gene set. In this dissertation, we study parametric approaches for GSEA by formulating the enrichment analysis into a simple model comparison problem. The methods not only gain the flexibility in statistical modeling corresponding to biological problems but also achieve computational efficiency. First, we propose a likelihood based approach assuming a finite mixture model for a two-class comparison problem and the implementation of the analysis is achieved by a likelihood ratio based testing approach. In addition we extend the parametric methods to flexible two-component mixture models for one-sided enrichment analysis which aims to test for enrichment of up (or down) regulation only. Also, we develop chi-square mixture models which incorporate the idea of two-class comparison studies into multiple category microarray experiments. Applications to gene expression data, along with simulations, demonstrate the computational efficiency and the competitive performance of the proposed methods.Item Statistical methods for genetics and genomics studies(2008-12) Li, MeijuanGenomics study: the data quality from microarray analysis is highly dependent on RNA quality. Because of the lability of RNA, steps involved in tissue sampling, RNA purification, and RNA storage are known to potentially lead to the degradation of RNAs, therefore, assessment of RNA quality is essential. Existing methods for estimating the quality of RNA on microarray either suffer from subjectivity or are inefficient in performance. To overcome these drawbacks, in this dissertation, a linear regression method for assessing RNA quality for a hybridized Genechip is proposed. In particular, our approach used the probe intensities that the Affymetrix software associates with each microarray. The effectiveness and improvements of the proposed method over the existing methods are illustrated by the application of the method to the previously published 19 human Affymetrix microarray data sets for which external verification of RNA quality is available. Genetics study : although population-based association mapping may be subject to the bias caused by population stratification, alternative methods that are robust to population stratification such as family-based linkage analysis have lower mapping resolution. In this dissertation, we propose association tests for fully observed quantitative traits as well censored data in structured populations with complex genetic relatedness among the sampled individuals. Our methods correct for continuous population stratification by first deriving population structure variables and kinship matrices through random genetic marker data and then modeling the relationship between trait values, genotypic scores at a candidate marker, and genetic background variables through a semiparametric model, where the error distribution for fully observed data or the baseline survival function for censored data is modeled as a mixture of Polya trees centered around a family of parametric distributions. We also propose multivariate Bayesian statistical models with a Gaussian conditional autoregressive (CAR) framework for multi-trait association mapping in structured populations, where the effects attributable to kinship matrix is modeled via CAR and the population structure variables are included as covariates to adjust populations stratification. We compared our model to the existing structured association tests in terms of model fit, false positive rate, power, precision, and accuracy using real data sets as well as simulated data sets.Item Statistical methods for multi-class differential gene expression detection(2011-11) Cao, XitingOne of the major goals of microarray data analysis is to identify differentially expressed genes. In cancer studies, RNA is extracted from the tissue samples of cancer patients (case class) and healthy people (control class) to obtain the gene expression data and genes that are dierentially expressed between case and control are identied to be candidate biomarkers which could undergo further studies. More often, we encounter situations where gene expression between more than two classes are being compared instead of the traditional case/control setup, e.g., multiple disease stages or dierent experimental conditions. In this dissertation, the problem of identifying dierentially expressed genes in a multi-class comparison setting will be addressed. To identify the dierentially expressed genes, it is important to select a test statistic to rank the genes, and common approaches usually summarize each gene expression into a univariate test statistic and nd a critical value for the ranking statistics to claim which gene is dierentially expressed. In the dissertation, a univariate test statistic (the moderated F-statistics) is rst used as a summary statistic and its distribution is empirically estimated using maximum likelihood. After that, A multivariate test statistic is proposed as a summary statistic for each gene and both parametric and non-parametric empirical Bayes approaches are adopted to rank the genes. The performances of the proposed methods are illustrated by extensive simulation studies and application to public microarray datasets. The results show that the proposed methods have better detection power than the commonly used approaches when controlling false discovery rates at the same level.Item Statistical methods in genome sequence analysis.(2011-10) Kong, XiaoxiaoMass spectral data alignment study. The first part of this thesis deals with the need to align spectra to correct for massto- charge experimental variation in clinical applications of mass spectrometry (MS). Proteomics is the large-scale study of proteins. The term “proteomics” was first coined in 1997 to make an analogy with genomics, the study of genes. Most MS-based proteomic data analysis methods involve a two-step approach, identify peaks first and then do the alignment and statistical inference on these identified peaks only. However, the peak identification step relies on prior information on the proteins of interest or a peak detection model, both of which are subject to error. Also numerous additional features such as peak shape and peak width are lost in simple peak detection, and these are informative for correcting mass variation in the alignment step. Here we present a novel Bayesian approach to align the complete spectra. The approach is based on a parametric model which assumes the spectrum and alignment function are Gaussian processes, but the alignment function is monotone. We show how to use the expectation-maximization algorithm to find the posterior mode of the set of alignment functions and the mean spectrum for a patient population. After alignment, we conduct tests while controlling for error attributable to multiple comparisons on the level of the peaks identified from the absolute mean spectra difference of two patient populations. Motif discovery study. In the second part of this thesis we show how to reformulate the usual model-based approach to motif detection as a conditional log-linear model and how this reformulation of the problem allows one to use the lasso to build complex dependency structures into the motif probability model in a fashion that is not overparameterized. We illustrate the performance of the approach with a set of simulations and show that it can dramatically outperform existing methods when there is dependence in the motif and is comparable in cases where there is no dependence. By not marginalizing out the parameters that govern the probability distribution of the motif (as is usually done), we can characterize the motif in a more rigorous fashion. In the final part of the thesis we describe how to incorporate the Bayesian group lasso, the Bayesian adaptive lasso, and the Bayesian group adaptive lasso into conditional loglinear modeling for motif discovery. If an explanatory factor is represented by a group of derived input variables, the lasso tends to select individual derived input variables from the grouped variables, while the group lasso could overcome this difficulty and still do variable selection at the group level. Also the lasso shrinkage produces biased estimates for the large coefficients, while the adaptive group lasso can overcome this difficulty and maintain the oracle property. Finally the group adaptive lasso enjoys both the advantage of the group lasso and the adaptive lasso.