\documentclass[11pt]{article}
\usepackage{indentfirst}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amscd}
\usepackage{natbib}
\usepackage{url}
\usepackage{graphicx}
\usepackage[utf8]{inputenc}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\cov}{cov}
\DeclareMathOperator{\aic}{AIC}
\DeclareMathOperator{\aicc}{AICc}
\DeclareMathOperator{\bic}{BIC}
\DeclareMathOperator{\hic}{HIC}
\DeclareMathOperator{\lb}{LB}
\DeclareMathOperator{\gcsm}{gcsm}
\DeclareMathOperator{\lcsm}{lcsm}
\DeclareMathOperator{\tr}{tr}
\newcommand{\boldtheta}{{\boldsymbol{\theta}}}
\newcommand{\boldphi}{{\boldsymbol{\varphi}}}
\newcommand{\boldxi}{{\boldsymbol{\xi}}}
\newcommand{\boldmu}{{\boldsymbol{\mu}}}
\newcommand{\boldw}{{\mathbf{w}}}
\newcommand{\boldx}{{\mathbf{x}}}
\newcommand{\boldy}{{\mathbf{y}}}
\newcommand{\boldz}{{\mathbf{z}}}
\hyphenation{Wa-gen-ius}
\setlength{\textheight}{\paperheight}
\addtolength{\textheight}{- 2 in}
\setlength{\topmargin}{0.25 pt}
\setlength{\headheight}{0 pt}
\setlength{\headsep}{0 pt}
\addtolength{\textheight}{- \topmargin}
\addtolength{\textheight}{- \topmargin}
\addtolength{\textheight}{- \footskip}
\setlength{\textwidth}{\paperwidth}
\addtolength{\textwidth}{- 2 in}
\setlength{\oddsidemargin}{0.25 in}
\setlength{\evensidemargin}{0.25 in}
\addtolength{\textwidth}{- \oddsidemargin}
\addtolength{\textwidth}{- \evensidemargin}
\begin{document}
\vspace*{0.9375in}
\begin{center}
{\bfseries Model Selection in Estimation of Fitness Landscapes} \\
By \\
Charles J. Geyer and Ruth G. Shaw \\
Technical Report No.~671 \\
School of Statistics \\
University of Minnesota \\
% April 20, 2005 \\
July, 10, 2008 \\
Revised \today
\end{center}
\thispagestyle{empty}
\cleardoublepage
\setcounter{page}{1}
\thispagestyle{empty}
\begin{abstract}
A solution to the problem of estimating fitness landscapes was
proposed by \citet{la}. Another solution, which avoids problematic
aspects of the Lande-Arnold methodology, was proposed by \citet*{aster2},
who also provided an illustrative example involving real data.
An earlier technical report \citep{evo2008talk} gave an example that was
simpler in some ways (the data are simulated from the aster model so there
are no issues making the data fit the model one has with real data) and much
more complicated in others (each individual has five measured components of
fitness over four time periods, 20 variables in all) and illustrates the
full richness possible in aster analysis of fitness landscapes. The one
issue that technical report did not deal with is model selection. When
many phenotypic variables are measured, one often does not know which
to put in the model. \citet{la} proposed using principal components
regression as a method of dimension reduction, but this method is known
to have no theoretical basis. Much of late 20th century and 21st century
statistics is about model selection and model averaging, and we apply some
of this methodology (which does have strong theoretical basis) to estimation
of fitness landscapes using another simulated data set.
All analyses are done in R \citep{rcore} using the \texttt{aster}
contributed package described by \citet*{gws} except for analyses
in the style of \citet{la}, which use ordinary least squares regression.
Furthermore, all analyses are done using the \texttt{Sweave} function in R,
so this entire technical report and all of the analyses reported in it
are completely reproducible by anyone who has R with the \texttt{aster}
package installed and the R noweb file specifying the document.
This revision corrects major errors in the frequentist model averaging
calculations (Section~\ref{sec:fma}) in the first version of the technical
report.
\end{abstract}
<>=
options(keep.source = TRUE, width = 65)
ps.options(pointsize = 15)
pdf.options(pointsize = 15)
@
\section{R Package Aster}
We use R statistical computing environment \citep{rcore} in our analysis.
It is free software and can be obtained from
\url{http://cran.r-project.org}. Precompiled binaries
are available for Windows, Macintosh, and popular Linux distributions.
We use the contributed package \verb@aster@.
If R has been installed, but this package has
not yet been installed, do
\begin{verbatim}
install.packages("aster")
\end{verbatim}
from the R command line
(or do the equivalent using the GUI menus if on Apple Macintosh
or Microsoft Windows). This may require root or administrator privileges.
Assuming the \verb@aster@ package has been installed, we load it
<>=
library(aster)
@
<>=
baz <- library(help = "aster")
baz <- baz$info[[1]]
baz <- baz[grep("Version", baz)]
baz <- sub("^Version: *", "", baz)
bazzer <- paste(R.version$major, R.version$minor, sep = ".")
@
The version of the package used to make this document
is \Sexpr{baz} (which is available on CRAN).
The version of R used to make this document is \Sexpr{bazzer}.
This entire document and all of the calculations shown were made using
the R command \texttt{Sweave} and hence are exactly reproducible by
anyone who has R and the R noweb (RNW) file from which it was created.
Both the RNW file and and the PDF document produced from it are available at
\url{http://www.stat.umn.edu/geyer/aster}. For further details on
the use of Sweave and R see Chapter~1 of the technical report by
\citet{aster2tr} available at the same web site.
Not only can one exactly reproduce the results in the printable document,
one can also modify the parameters of the simulation and get different
results. Obvious modifications to try are noted on pages~\pageref{rnw:1},
\pageref{rnw:2}, \pageref{rnw:3}, and~\pageref{rnw:4} below. But, of course,
anything at all can be changed once one has the RNW file.
\section{Simulate Data from Conditional Model} \label{sec:sim-cond}
\subsection{Overview} \label{sec:overview-one}
It is hard to make up an unconditional aster model because the unconditional
parameterization is so unintuitive. Thus we proceed in two steps.
\begin{itemize}
\item First we simulate data using a conditional aster model with parameters
we understand.
\item Then we fit an unconditional aster model to these data, and simulate
data using the fitted unconditional model.
\end{itemize}
Also it is hard to make up a fitness landscape, because we model it on
the canonical parameter scale and these have no simple connection to
the mean value parameter scale. Thus we also proceed in two steps here.
\begin{itemize}
\item First we simulate data having a flat fitness landscape.
\item Then we fit models having a non-flat fitness landscapes, which we
adjust in strength to be moderately statistically significant.
\end{itemize}
The way these two issues interleave is as follows.
\begin{enumerate}
\item We simulate data using a conditional model having a flat
fitness landscape.
\item We simulate data using an unconditional model having a flat
fitness landscape.
\item We simulate data using unconditional models having a non-flat fitness
landscapes.
\end{enumerate}
In Section~\ref{sec:sim-cond} we only do the first step.
\subsection{Graphical Model}
We make a graphical model based on consecutive time periods (call them years
for concreteness). Data are ``observed'' over 10 years.
For each individual we observe in year $i$ three random
variables $U_i$, $V_i$, and $W_i$. All random variables on one individual
are connected by the graph shown in Figure~\ref{fig:graph}.
\begin{figure}
\begin{center}
\setlength{\unitlength}{0.45 in}
\thicklines
\begin{picture}(10.3,2.5)(-0.1,-2.3)
\put(0,0){\makebox(0,0){$1_{\hphantom{0}}$}}
\put(1,0){\makebox(0,0){$u_1$}}
\put(2,0){\makebox(0,0){$u_2$}}
\put(3,0){\makebox(0,0){$u_3$}}
\put(4,0){\makebox(0,0){$u_4$}}
\put(5,0){\makebox(0,0){$u_5$}}
\put(6,0){\makebox(0,0){$u_6$}}
\put(7,0){\makebox(0,0){$u_7$}}
\put(8,0){\makebox(0,0){$u_8$}}
\put(9,0){\makebox(0,0){$u_9$}}
\put(10,0){\makebox(0,0){$u_{10}$}}
\multiput(0.35,0)(1,0){10}{\vector(1,0){0.3}}
\put(1,-1){\makebox(0,0){$v_1$}}
\put(2,-1){\makebox(0,0){$v_2$}}
\put(3,-1){\makebox(0,0){$v_3$}}
\put(4,-1){\makebox(0,0){$v_4$}}
\put(5,-1){\makebox(0,0){$v_5$}}
\put(6,-1){\makebox(0,0){$v_6$}}
\put(7,-1){\makebox(0,0){$v_7$}}
\put(8,-1){\makebox(0,0){$v_8$}}
\put(9,-1){\makebox(0,0){$v_9$}}
\put(10,-1){\makebox(0,0){$v_{10}$}}
\multiput(1,-0.25)(1,0){10}{\vector(0,-1){0.5}}
\put(1,-2){\makebox(0,0){$w_1$}}
\put(2,-2){\makebox(0,0){$w_2$}}
\put(3,-2){\makebox(0,0){$w_3$}}
\put(4,-2){\makebox(0,0){$w_4$}}
\put(5,-2){\makebox(0,0){$w_5$}}
\put(6,-2){\makebox(0,0){$w_6$}}
\put(7,-2){\makebox(0,0){$w_7$}}
\put(8,-2){\makebox(0,0){$w_8$}}
\put(9,-2){\makebox(0,0){$w_9$}}
\put(10,-2){\makebox(0,0){$w_{10}$}}
\multiput(1,-1.25)(1,0){10}{\vector(0,-1){0.5}}
\end{picture}
\end{center}
\caption{Graph for the example. $u_i$ indicate survival,
$w_i$ count number of offspring, $v_i$ indicate $w_i > 0$.}
\label{fig:graph}
\end{figure}
The variables $U_i$ and $V_i$ are Bernoulli (zero-or-one valued);
the variables $W_i$ are nonnegative-integer-valued.
The $V_i$ are redundant: $V_i = 1$ if and only if $W_i > 0$.
The variables $U_i$ indicate survival through the $i$-th year;
the variables $W_i$ count offspring produced in the $i$-th year.
Hence the variables $V_i$ indicate successful reproduction in the $i$-year
(one or more offspring).
We model the $W_i$ as zero-truncated Poisson given $V_i$.
If one likes to think of it that way, one can ignore $V_i$ and say that $W_i$
is zero-inflated Poisson given $U_i$, but the aster way to think about this
requires three variables $U_i$, $V_i$, and $W_i$ because that fits the
required exponential family structure.
In these data, the variable $\sum_i W_i$ is considered fitness (total
number of offspring over the course of the study, which will include most
of the lifespan).
\subsection{Simulation}
We set the seeds of the random number generator so that we
obtain the same results every time. To get different results,
obtain the RNW file, change this statement, and reprocess
using \texttt{Sweave} and \LaTeX. \label{rnw:1}
<>=
RNGversion("2.5.0")
set.seed(42)
@
We also set up some parameters of the simulation. These can also be changed.
<>=
nind <- 350
ntime <- 10
psurv <- 0.8
prepr <- 0.7
mpois <- 3
theta.surv <- log(psurv) - log(1 - psurv)
theta.repr <- log(prepr) - log(1 - prepr)
theta.pois <- log(mpois)
tau.pois <- mpois / (1 - exp(- mpois))
@
For a first pass we set
mean survival per unit time to be \verb@psurv@ = \Sexpr{psurv}
and mean reproduction indicator per unit time
to be \verb@prepr@ = \Sexpr{prepr}.
For number of offspring in a time period given any offspring in that
time period, the easiest thing to specify is the mean of the untruncated
Poisson random variable that when truncated is the distribution we want.
This untruncated mean is \verb@mpois@ = \Sexpr{mpois}. The mean of the
corresponding zero-truncated distribution
is $\tau = \Sexpr{round(tau.pois, 3)}$.
<>=
vars <- as.vector(outer(c("u", "v", "w"), 1:10, paste, sep = ""))
vtype <- as.factor(substr(as.character(vars), 1, 1))
fam <- rep(1, length(vars))
fam[vtype == "w"] <- 3
pred <- seq(along = vars) - 1
pred[vtype == "u"] <- seq(along = vars)[vtype == "u"] - 3
pred[1] <- 0
@
\label{pg:vars}
Simulate data.
<>=
root <- matrix(1, nrow = nind, ncol = length(vars))
theta <- root
theta[ , vtype == "u"] <- theta.surv
theta[ , vtype == "v"] <- theta.repr
theta[ , vtype == "w"] <- theta.pois
x <- raster(theta, pred, fam, root)
dimnames(x) <- list(NULL, vars)
dat <- as.data.frame(x)
dat <- as.list(dat)
dat[["root"]] <- rep(1, nind)
@
Now we add some covariates. These will play the role of phenotypic variables
in the Lande-Arnold analysis and the competing aster analysis.
Lande-Arnold analysis requires that these be jointly multivariate normal
and centered at zero. Aster analysis does not.
For fair comparison, we make these covariates satisfy
the conditions required for Lande-Arnold analysis.
In addition, we make them have the same variance, and we make all pairs
of them have correlation $1 / 2$. The reason for the correlation is that
this makes some of the modeling issues more difficult.
Of course, all of this can be changed if one has the RNW file
for this document. \label{rnw:2}
<>=
npheno <- 10
zbase <- rnorm(nind)
for (i in 1:npheno) {
labz <- paste("z", i, sep = "")
dat[[labz]] <- zbase + rnorm(nind)
}
@
Reshape the data
<>=
names(dat)
dat <- as.data.frame(dat)
redata <- reshape(dat, varying = list(vars), direction = "long",
timevar = "varb", times = as.factor(vars), v.names = "resp")
@
There is one further step. We wish to make inferences about overall fitness,
taking advantage of the monotonicity property of unconditional aster models,
as explained in Section~3.2 of \citet{evo2008talk}. This requires that we
model only the $W_i$ as dependent on the phenotypic traits. To accomplish
this, we set to zero all the phenotype values except those associated
with the $W_i$, the life history variables that directly reflect fitness.
<>=
wind <- grep("w", as.character(redata$varb))
for (labz in grep("z", names(redata), value = TRUE)) {
redata[[labz]][- wind] <- 0
}
@
\section{Fit Initial Data} \label{sec:fit-initial}
\subsection{Overview} \label{sec:overview-two}
We have now simulated our ``initial data'' so we have now done step~1
of the overview in Section~\ref{sec:overview-one}. However, we still
do a few checks to see that these data make sense (of course, they
must unless the aster software is buggy, but we check anyway); we do
this in Section~\ref{sec:check-cond-mod}.
In Section~\ref{sec:check-unco-mod} we start on what we called step~2
in the overview in Section~\ref{sec:overview-one}. First we fit an
unconditional aster model to the data we just checked was reasonable
(Section~\ref{sec:check-cond-mod}).
Then we ``predict'' the mean value parameters and see whether they
make sense (according to our intuition). They sort of make sense,
but we think we can do better, so we fit a more elaborate unconditional
model adding a term \texttt{uyear} to the model formula, which lets
mortality be a linear function of year on the canonical parameter scale.
We again ``predict'' the mean value parameters and see whether they
make sense, and this time decide they are good enough
(according to our intuition). That finishes what we called step~2
in the overview in Section~\ref{sec:overview-one}, and also finishes
Section~\ref{sec:fit-initial}. We continue our overview in
Section~\ref{sec:new-sim}.
\subsection{Conditional Models} \label{sec:check-cond-mod}
Fit the aster model (the one used to generate the data).
<>=
redata$vtype <- as.factor(substr(as.character(redata$varb), 1, 1))
out2 <- aster(resp ~ vtype + 0, pred, fam, varb, id, root,
data = redata, type = "conditional")
summary(out2)
@
For comparison, these should be
<>=
theta.surv
theta.repr
theta.pois
@
For further comparison, the likelihood ratio test with this model
as the alternative hypothesis and the simulation truth for the point
null hypothesis can be calculated as follows.
<>=
mout2true <- mlogl(c(theta.surv, theta.repr, theta.pois), pred, fam, out2$x,
out2$root, out2$modmat, type = out2$type, origin = out2$origin)
dev2true <- 2 * mout2true$value - out2$deviance
dev2true
1 - pchisq(dev2true, df = 3)
@
Great! Statistics works. When we simulate data from a model
and estimate the parameters in that model, the maximum likelihood
estimates are close to the simulation truth, and the log likelihood
ratio test statistic comparing the MLE to the simulation truth parameter
value is not significant.
\subsection{Unconditional Models} \label{sec:check-unco-mod}
The next step is to switch to unconditional models.
<>=
out3 <- aster(resp ~ vtype + 0, pred, fam, varb, id, root,
data = redata)
summary(out3)
@
We have no intuition about whether this model makes sense.
We must switch to unconditional models in order to model fitness,
but is this particular unconditional model sensible? To check
that out we look at conditional mean value parameters. Do they
still make sense? We originally simulated a conditional model,
based on certain conditional mean value parameters. Now we have
switched to an unconditional model, but we can still calculate its
conditional mean value parameters. The new model (\texttt{out3})
is ``incorrect'' in the sense that it is not the old model (\texttt{out2}),
but is it close to correct in the sense that the relevant parameters
have not changed?
In order to compute conditional mean value parameters we need to supply
data (since they depend on data, see \citealp*{gws}, Section~2.5).
Since all individuals are alike, we could have a new model matrix with
just one individual all of whose variables that are ``parent'' variables
(i.~e., all of the $U_i$ and $V_i$) are equal to one. But a much simpler
choice is just to use the observed data, if it contains one such individual.
<>=
foo <- apply(out3$x, 1, function(bar) all(bar > 0))
ifoo <- seq(along = foo)[foo]
length(ifoo)
ifoo <- ifoo[1]
ifoo
@
It does.
<>=
pout3 <- predict(out3, model.type = "conditional")
pout3 <- matrix(pout3, nrow(out3$x))
pout3 <- pout3[ifoo, ]
pout3
@
We see that all of the $W_i$ have conditional expectation
\Sexpr{round(pout3[3], 4)} given $V_i = 1$.
This is ``obvious'' (we admit we did not guess this in advance, but
it is easy to explain in hindsight) from the fact that all $W_i$ are
at terminal nodes, hence their
conditional canonical parameters $\theta_j$ are equal to their unconditional
canonical parameters $\varphi_j$.
We also see that all of the $V_i$ have conditional expectation
\Sexpr{round(pout3[2], 4)} given $U_i = 1$.
This is ``obvious'' (we admit we did not guess this in advance, but
it is easy to explain in hindsight) from the fact that all $V_i$ have
exactly one successor $W_i$ and all of the $W_i$ have the same distribution
and the same conditional
canonical parameter. Thus equation (5) in \citet{gws} makes all the
unconditional canonical parameters $\varphi_j$ for the $V_i$ the same.
So we are left with several questions. First, are the conditional
mean value parameters already discussed reasonable? They are to be
compared with
<>=
prepr
tau.pois
@
which are the actual conditional expectations used to simulate the
data. They seem close enough. (We know that the difference between
\Sexpr{round(pout3[3], 4)} and \Sexpr{round(tau.pois, 4)} is just sampling
variation, because there is no difference between conditional and
unconditional canonical parameters for terminal nodes, and the conditional
canonical parameters determine the conditional mean value parameters.)
Second, are the conditional
mean value parameters for the $U_i$ reasonable? These now all differ,
because each has a different number of ``ancestor'' nodes (predecessor,
predecessor of predecessor, etc.)
<>=
pout3u <- matrix(pout3, nrow = 3)[1, ]
round(pout3u, 4)
@
Clearly, our model has changed. Originally, we modeled organisms that do
not age: they have the same mortality at all ages. This was just for
simplicity; we could have put in mortality that is a function of age.
But when we switch to an unconditional model, mortality increases (survival
probability decreases) with age.
We have two options. Since we are just
making up data, we could accept this change. Alternatively, we could
put additional terms into the unconditional aster model to allow the
unconditional canonical parameters for the $U_i$ to increase with age
more than they do in the model fit above (\texttt{out3}).
Let's try the latter.
<>=
redata$year <- as.numeric(substring(as.character(redata$varb), 2))
redata$uyear <- redata$year *
as.numeric(as.character(redata$vtype) == "u")
out4 <- aster(resp ~ vtype + uyear, pred, fam, varb, id, root,
data = redata)
summary(out4)
@
Again we look at conditional mean value parameters.
<>=
pout4 <- predict(out4, model.type = "conditional")
pout4 <- matrix(pout4, nrow(out4$x))
pout4 <- pout4[ifoo, ]
pout4
@
Again we see that all of the $W_i$ have the same conditional mean
value parameter and that it has hardly changed from what it was in
\texttt{out3}, and the same is true of all the $V_i$. Since we did
not change the model in any way that affects the parameterization
of these variables, this is no surprise.
<>=
pout4u <- matrix(pout4, nrow = 3)[1, ]
round(pout4u, 4)
@
Now the conditional mean value parameters $E(U_i \mid U_{i - 1} = 1)$ for
$i > 2$ and $E(U_1)$ seem fairly constant. Perhaps they decrease at
large ages (9 and 10), or perhaps this is just chance variation,
since there are few individuals alive at these ages. In any event,
we will be satisfied with this model \texttt{out4}. This will be
our ``baseline'' model, the model with a flat fitness landscape,
since fitness, which we take to be $\sum_i E(W_i)$, does not depend
on any of the $Z_i$, because we put none of these variables in any models
fit so far.
\section{Simulate New Data} \label{sec:new-sim}
\subsection{Overview} \label{sec:overview-three}
At this point we have a reasonable unconditional aster model
(\texttt{out4}) with a
flat fitness landscape. Now we want to simulate models with non-flat
fitness landscapes.
We will simulate two such models, which we call Model~1 and Model~2.
In Model~1, the fitness landscape depends on just two
phenotypic variables \texttt{z1} and \texttt{z2}.
This is somewhat unrealistic but allows
us to draw some simple pictures of fitness landscapes.
\citet[Section~1.2.5]{b+a} are particularly emphatic about the biological
unrealism of ``true'' (at least simulation truth) models with only a few
parameters. Hence we also simulate more realistic data that depends on
ten phenotypic variables \texttt{z1}, \ldots, \texttt{z10}.
We make both fitness landscapes depend quadratically on phenotypic variables
on the canonical parameter scale. Since the mean value parameter is a
multivariate monotone function of the canonical parameter, this means
expected fitness is a monotone function of this quadratic function
(see Section~3.2 of \citealp{evo2008talk} for details). We make our
quadratic function have negative curvature (corresponding to stabilizing
selection) in all directions. Thus it has elliptical contours.
By monotonicity, the contours of the actual fitness landscape (expected
fitness considered as a function of phenotypic variables) also has elliptical
contours. We locate the fitness maximum to one side of the distribution of
phenotypic values not right in the middle.
The only remaining decision is how steeply fitness should fall off
as we move away from the fitness maximum. In order for our simulation
to provide useful guidance, the fall-off should be steep enough so that
the fitness landscape is statistically significant but not so steep that
one doesn't need statistics to see that it is significant. (It's always
nice when evidence is so clear that statistics is unnecessary, but this
is not typical.) Thus we adjust the dependence of fitness on phenotype
in both Model~1 and Model~2 to be strong but not too strong.
This proceeds in several steps, which are substeps of what was called
Step~3 in Section~\ref{sec:overview-one}.
\begin{enumerate}
\item[(a)] We fit a model with quadratic fitness landscape to the data
with flat fitness landscape. The sole purpose is to get the coefficient
vector for such a model.
\item[(b)] We adjust the coefficients of the quadratic fitness landscape.
\item[(c)] We simulate new data having this unconditional aster model with
non-flat (quadratic on the canonical parameter scale) fitness landscape.
\item[(d)] We do a likelihood ratio test comparing the models with flat
and non-flat fitness landscape. If the result is not statistically
significant or is too statistically significant (the inference is
so clear that statistics is unnecessary) we go back to substep (b).
\end{enumerate}
\subsection{Model 1} \label{sec:model-1}
For a first try, let us have the fitness landscape depend on just the first
two phenotypic variables \texttt{z1} and \texttt{z2}.
\citet[Section~1.2.5]{b+a} are particularly emphatic about the biological
unrealism of ``true'' (at least simulation truth) models with only a few
parameters. The main virtue of this simple model is that it allows
us to draw some simple pictures of fitness landscapes. We will also
simulate a more complicated model in Section~\ref{sec:model-2}.
Now we will simulate new data using the \texttt{raster} function. To use
that we need to, first, make up a model, including dependence on \texttt{z1}
and \texttt{z2} and, second, convert unconditional canonical parameters
$\varphi$ to conditional canonical parameters $\theta$, because that is what
the \texttt{raster} function wants. There is no function in the \texttt{aster}
package that does this conversion, perhaps a defect in the \texttt{aster}
user interface. We can, however, trick the \texttt{predict.aster} function
into doing this conversion.
First we fit the model we want to use to the data we have. The fitted
parameters will make no sense, because the fitness landscape is flat for
the data we have, but we can use the model structure.
<>=
out5 <- aster(resp ~ vtype + uyear + z1 + z2 + I(z1^2) + I(z2^2) + I(z1 * z2),
pred, fam, varb, id, root, data = redata)
summary(out5)
@
We now want to make up a quadratic function of $z$ that has approximate
mean zero (averaged over the $z$) values so that the average fitness is
more or less the same as in our initial data. We also want reasonably
sane values, so we keep the quadratic term small. For simplicity,
we make the fitness landscape symmetric in the two variables.
(It took several tries to get the quadratic term small enough. These tries are
not shown. The coefficients of the linear and quadratic terms were adjusted
until the $P$-values in the tests on p.~\pageref{pg:out6-8} show that both
were highly statistically significant but not humongously so.)
Of course, all of this can be changed if one has the RNW file
for this document. \label{rnw:3}
<>=
z1 <- dat$z1
z2 <- dat$z2
ascal <- 0.013
quad <- ascal * ((z1 + z2) - (z1^2 + z2^2) + z1 * z2)
con <- mean(quad)
mean(quad - con)
@
Now we change the coefficients in \texttt{out5} to be the ones for this
made up model.
<>=
beta.new <- out5$coefficients
beta.new[1:4] <- out4$coefficients
beta.new[3] <- beta.new[3] - con
beta.new[5:6] <- ascal
beta.new[7:8] <- (- ascal)
beta.new[9] <- ascal
beta.new <- round(beta.new, 3)
beta.new
@
Then we ``predict'' the conditional canonical parameters using
\texttt{beta.new} because those are the parameters that the \texttt{raster}
function wants.
<>=
theta <- predict(out5, model.type = "conditional", parm.type = "canonical",
newcoef = beta.new)
theta <- matrix(theta, nrow = nrow(out5$x), ncol = ncol(out5$x))
# apply(theta, 2, range)
xnew <- raster(theta, pred, fam, root)
@
Now we need to reshape these new data just like we did the old.
<>=
dimnames(xnew) <- list(NULL, vars)
dnew1 <- as.data.frame(xnew)
renew <- reshape(dnew1, varying = list(vars), direction = "long",
timevar = "varb", times = as.factor(vars), v.names = "resp")
redata$resp1 <- renew$resp
@
For future reference we also save the unconditional canonical and
mean value ``simulation truth'' parameter values.
<>=
redata$tau1 <- predict(out5, newcoef = beta.new)
redata$phi1 <- predict(out5, parm.type = "canonical", newcoef = beta.new)
beta1true <- beta.new
@
\subsection{Model 2} \label{sec:model-2}
As pointed out in the Sections~\ref{sec:overview-three} and~\ref{sec:model-1},
\citet[Section~1.2.5]{b+a} are particularly emphatic about the biological
unrealism of simulation truth models like that simulated
in Section~\ref{sec:model-1}.
In reality, the true stochastic mechanism generating the data
is much more complicated than any model under consideration and even if
known would have far too many parameters to be well estimated by available
data. Not all parameters are equally important, of course, and some parameters
of the true model, assuming it were known, could be estimated better than
others.
Hence in this section we create a model that is much like the model in
Section~\ref{sec:model-1} except the quadratic function depends on all of
the $z_i$ but not equally.
As before, we start by fitting the full model to the data at hand
<>=
out5too <- aster(resp ~ vtype + uyear + poly(z1, z2, z3, z4, z5, z6, z7,
z8, z9, z10, degree = 2, raw = TRUE),
pred, fam, varb, id, root, data = redata)
names(out5too$coefficients) <- sub("^poly([^)]*)", "",
names(out5too$coefficients), extended = FALSE)
summary(out5too)
@
Now we have to figure out which coefficients are which
<>=
fred <- strsplit(names(out5too$coefficients)[-(1:4)], "\\.")
fred <- lapply(fred, as.numeric)
@
Then we construct the quadratic function that is the canonical parameter
for fitness. This time we don't bother to adjust the intercept.
Of course, all of this can be changed if one has the RNW file
for this document. \label{rnw:4}
<>=
b <- 0.5
bvec <- c(1, 1, b^(1:8))
bvec <- bvec / sum(bvec) * 2
round(bvec, 5)
@
Again make a parameter vector \texttt{beta.new} constructed above
<>=
beta.new <- out5too$coefficients
beta.new[1:4] <- out4$coefficients
for (i in seq(along = fred)) {
freddy <- fred[[i]]
sally <- freddy > 0
if (sum(sally) == 1) {
if (freddy[sally] == 1) {
### linear term
beta.new[i + 4] <- ascal * bvec[sally]
} else {
### quadratic term
beta.new[i + 4] <- (- ascal * bvec[sally])
}
} else {
### cross term
beta.new[i + 4] <- ascal * mean(bvec[sally])
}
}
beta.new
@
And the rest is just like the Section~\ref{sec:model-1}.
<>=
theta <- predict(out5too, model.type = "conditional",
parm.type = "canonical", newcoef = beta.new)
theta <- matrix(theta, nrow = nrow(out5too$x), ncol = ncol(out5too$x))
# apply(theta, 2, range)
xnew <- raster(theta, pred, fam, root)
@
Reshape.
<>=
dimnames(xnew) <- list(NULL, vars)
dnew2 <- as.data.frame(xnew)
renew <- reshape(dnew2, varying = list(vars), direction = "long",
timevar = "varb", times = as.factor(vars), v.names = "resp")
redata$resp2 <- renew$resp
@
For future reference we also save the unconditional canonical and
mean value ``simulation truth'' parameter values.
<>=
redata$tau2 <- predict(out5too, newcoef = beta.new)
redata$phi2 <- predict(out5too, parm.type = "canonical", newcoef = beta.new)
beta2true <- beta.new
@
\section{Fit New Data}
\subsection{Overview} \label{sec:overview-four}
At this point, we have finished making up data
(by simulation from an aster model). In real life, none of the work
up to here would be necessary. We would collect data by measuring
fitness components of real living organisms. For the purposes of this
technical report we pretend that the simulated data, \verb@redata$resp@
and \verb@redata$resp2@ are such real scientific data. Everything we
do from here on is an example of what could be done with real data.
\subsection{Fit Model 1} \label{sec:fit-new}
So we should be able to fit the model that was used to simulate these data.
We start with the simpler model~1 that has only \texttt{z1} and \texttt{z2}
involved in the simulation truth model.
\label{pg:out6}
<>=
out6 <- aster(resp1 ~ vtype + uyear + z1 + z2 + I(z1^2) + I(z2^2) + I(z1 * z2),
pred, fam, varb, id, root, data = redata)
summary(out6)
@
For further comparison, the likelihood ratio test with this model
as the alternative hypothesis and the simulation truth for the point
null hypothesis can be calculated as follows.
<>=
mout6true <- mlogl(beta1true, pred, fam, out6$x,
out6$root, out6$modmat, type = out6$type, origin = out6$origin)
dev6true <- 2 * mout6true$value - out6$deviance
dev6true
1 - pchisq(dev6true, df = length(beta.new))
@
Great! Statistics works again. When we simulate data from a model
and estimate the parameters in that model, the maximum likelihood
estimates are close to the simulation truth.
We also look at how statistically significant our quadratic effect is
<>=
out8 <- aster(resp1 ~ vtype + uyear,
pred, fam, varb, id, root, data = redata)
out7 <- aster(resp1 ~ vtype + uyear + z1 + z2,
pred, fam, varb, id, root, data = redata)
anova(out8, out7, out6)
anova(out8, out6)
@
\label{pg:out6-8}
\subsection{Plot Monotone Function of Fitness}
We extract the quadratic part of the fitness function (on the canonical
parameter scale), which is a monotone transformation of actual fitness,
see Sections~3.4, and~3.10 of \citet*{aster2tr} for details.
<>=
qcoef <- out6$coefficients
@
<>=
a1 <- qcoef["z1"]
a2 <- qcoef["z2"]
a <- c(a1, a2)
A11 <- qcoef["I(z1^2)"]
A22 <- qcoef["I(z2^2)"]
A12 <- qcoef["I(z1 * z2)"] / 2
A <- matrix(c(A11, A12, A12, A22), 2, 2)
@
Figure~\ref{fig:surf} (page~\pageref{fig:surf})
shows the scatterplot of data values for \texttt{z1} and \texttt{z2}
and the contours of the estimated quadratic fitness function.
It is similar to Figure~3 in the first submission of \citet{aster2},
which is Figure~3.1 in \citet{aster2tr}.
(In the second submission of the paper,
this figure was changed to be like Figure~\ref{fig:nerf} below.)
It is made by the following code.
<