Statistical methods in genome sequence analysis.

Mass spectral data alignment study. The first part of this thesis deals with the need to align spectra to correct for massto- charge experimental variation in clinical applications of mass spectrometry (MS). Proteomics is the large-scale study of proteins. The term “proteomics” was first coined in 1997 to make an analogy with genomics, the study of genes. Most MS-based proteomic data analysis methods involve a two-step approach, identify peaks first and then do the alignment and statistical inference on these identified peaks only. However, the peak identification step relies on prior information on the proteins of interest or a peak detection model, both of which are subject to error. Also numerous additional features such as peak shape and peak width are lost in simple peak detection, and these are informative for correcting mass variation in the alignment step. Here we present a novel Bayesian approach to align the complete spectra. The approach is based on a parametric model which assumes the spectrum and alignment function are Gaussian processes, but the alignment function is monotone. We show how to use the expectation-maximization algorithm to find the posterior mode of the set of alignment functions and the mean spectrum for a patient population. After alignment, we conduct tests while controlling for error attributable to multiple comparisons on the level of the peaks identified from the absolute mean spectra difference of two patient populations. Motif discovery study. In the second part of this thesis we show how to reformulate the usual model-based approach to motif detection as a conditional log-linear model and how this reformulation of the problem allows one to use the lasso to build complex dependency structures into the motif probability model in a fashion that is not overparameterized. We illustrate the performance of the approach with a set of simulations and show that it can dramatically outperform existing methods when there is dependence in the motif and is comparable in cases where there is no dependence. By not marginalizing out the parameters that govern the probability distribution of the motif (as is usually done), we can characterize the motif in a more rigorous fashion. In the final part of the thesis we describe how to incorporate the Bayesian group lasso, the Bayesian adaptive lasso, and the Bayesian group adaptive lasso into conditional loglinear modeling for motif discovery. If an explanatory factor is represented by a group of derived input variables, the lasso tends to select individual derived input variables from the grouped variables, while the group lasso could overcome this difficulty and still do variable selection at the group level. Also the lasso shrinkage produces biased estimates for the large coefficients, while the adaptive group lasso can overcome this difficulty and maintain the oracle property. Finally the group adaptive lasso enjoys both the advantage of the group lasso and the adaptive lasso.

Keywords

Biostatistics

Description

University of Minnesota Ph.D. dissertation. October 2011. Major: Biostatistics. Advisor: Cavan S. Reilly, Ph.D. 1 computer file (PDF); x, 165 pages, appendix A.

Collections

Dissertations

Suggested citation

Kong, Xiaoxiao. (2011). Statistical methods in genome sequence analysis.. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/117828.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University of Minnesota Twin Cities

University Digital Conservancy

Statistical methods in genome sequence analysis.

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation