Mass spectral data alignment study.
The first part of this thesis deals with the need to align spectra to correct for massto-
charge experimental variation in clinical applications of mass spectrometry (MS).
Proteomics is the large-scale study of proteins. The term “proteomics” was first coined
in 1997 to make an analogy with genomics, the study of genes. Most MS-based proteomic
data analysis methods involve a two-step approach, identify peaks first and then do
the alignment and statistical inference on these identified peaks only. However, the
peak identification step relies on prior information on the proteins of interest or a peak
detection model, both of which are subject to error. Also numerous additional features
such as peak shape and peak width are lost in simple peak detection, and these are
informative for correcting mass variation in the alignment step. Here we present a novel
Bayesian approach to align the complete spectra. The approach is based on a parametric
model which assumes the spectrum and alignment function are Gaussian processes, but
the alignment function is monotone. We show how to use the expectation-maximization algorithm to find the posterior mode of the set of alignment functions and the mean
spectrum for a patient population. After alignment, we conduct tests while controlling
for error attributable to multiple comparisons on the level of the peaks identified from
the absolute mean spectra difference of two patient populations.
Motif discovery study.
In the second part of this thesis we show how to reformulate the usual model-based
approach to motif detection as a conditional log-linear model and how this reformulation
of the problem allows one to use the lasso to build complex dependency structures into
the motif probability model in a fashion that is not overparameterized. We illustrate the
performance of the approach with a set of simulations and show that it can dramatically
outperform existing methods when there is dependence in the motif and is comparable
in cases where there is no dependence. By not marginalizing out the parameters that
govern the probability distribution of the motif (as is usually done), we can characterize
the motif in a more rigorous fashion.
In the final part of the thesis we describe how to incorporate the Bayesian group lasso,
the Bayesian adaptive lasso, and the Bayesian group adaptive lasso into conditional loglinear
modeling for motif discovery. If an explanatory factor is represented by a group of
derived input variables, the lasso tends to select individual derived input variables from the grouped variables, while the group lasso could overcome this difficulty and still do
variable selection at the group level. Also the lasso shrinkage produces biased estimates
for the large coefficients, while the adaptive group lasso can overcome this difficulty and
maintain the oracle property. Finally the group adaptive lasso enjoys both the advantage
of the group lasso and the adaptive lasso.