Statistical methods in genome sequence analysis.

Kong, Xiaoxiao2011-11-152011-11-152011-10https://hdl.handle.net/11299/117828University of Minnesota Ph.D. dissertation. October 2011. Major: Biostatistics. Advisor: Cavan S. Reilly, Ph.D. 1 computer file (PDF); x, 165 pages, appendix A.Mass spectral data alignment study. The first part of this thesis deals with the need to align spectra to correct for massto- charge experimental variation in clinical applications of mass spectrometry (MS). Proteomics is the large-scale study of proteins. The term “proteomics” was first coined in 1997 to make an analogy with genomics, the study of genes. Most MS-based proteomic data analysis methods involve a two-step approach, identify peaks first and then do the alignment and statistical inference on these identified peaks only. However, the peak identification step relies on prior information on the proteins of interest or a peak detection model, both of which are subject to error. Also numerous additional features such as peak shape and peak width are lost in simple peak detection, and these are informative for correcting mass variation in the alignment step. Here we present a novel Bayesian approach to align the complete spectra. The approach is based on a parametric model which assumes the spectrum and alignment function are Gaussian processes, but the alignment function is monotone. We show how to use the expectation-maximization algorithm to find the posterior mode of the set of alignment functions and the mean spectrum for a patient population. After alignment, we conduct tests while controlling for error attributable to multiple comparisons on the level of the peaks identified from the absolute mean spectra difference of two patient populations. Motif discovery study. In the second part of this thesis we show how to reformulate the usual model-based approach to motif detection as a conditional log-linear model and how this reformulation of the problem allows one to use the lasso to build complex dependency structures into the motif probability model in a fashion that is not overparameterized. We illustrate the performance of the approach with a set of simulations and show that it can dramatically outperform existing methods when there is dependence in the motif and is comparable in cases where there is no dependence. By not marginalizing out the parameters that govern the probability distribution of the motif (as is usually done), we can characterize the motif in a more rigorous fashion. In the final part of the thesis we describe how to incorporate the Bayesian group lasso, the Bayesian adaptive lasso, and the Bayesian group adaptive lasso into conditional loglinear modeling for motif discovery. If an explanatory factor is represented by a group of derived input variables, the lasso tends to select individual derived input variables from the grouped variables, while the group lasso could overcome this difficulty and still do variable selection at the group level. Also the lasso shrinkage produces biased estimates for the large coefficients, while the adaptive group lasso can overcome this difficulty and maintain the oracle property. Finally the group adaptive lasso enjoys both the advantage of the group lasso and the adaptive lasso.en-USBiostatisticsStatistical methods in genome sequence analysis.Thesis or Dissertation