\documentclass[11pt]{article} \usepackage{indentfirst} \usepackage{amsmath} \usepackage{amssymb} \usepackage{amscd} \usepackage{natbib} \usepackage{url} \usepackage{graphicx} \usepackage[utf8]{inputenc} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\aic}{AIC} \DeclareMathOperator{\aicc}{AICc} \DeclareMathOperator{\bic}{BIC} \DeclareMathOperator{\hic}{HIC} \DeclareMathOperator{\lb}{LB} \DeclareMathOperator{\gcsm}{gcsm} \DeclareMathOperator{\lcsm}{lcsm} \DeclareMathOperator{\tr}{tr} \newcommand{\boldtheta}{{\boldsymbol{\theta}}} \newcommand{\boldphi}{{\boldsymbol{\varphi}}} \newcommand{\boldxi}{{\boldsymbol{\xi}}} \newcommand{\boldmu}{{\boldsymbol{\mu}}} \newcommand{\boldw}{{\mathbf{w}}} \newcommand{\boldx}{{\mathbf{x}}} \newcommand{\boldy}{{\mathbf{y}}} \newcommand{\boldz}{{\mathbf{z}}} \hyphenation{Wa-gen-ius} \setlength{\textheight}{\paperheight} \addtolength{\textheight}{- 2 in} \setlength{\topmargin}{0.25 pt} \setlength{\headheight}{0 pt} \setlength{\headsep}{0 pt} \addtolength{\textheight}{- \topmargin} \addtolength{\textheight}{- \topmargin} \addtolength{\textheight}{- \footskip} \setlength{\textwidth}{\paperwidth} \addtolength{\textwidth}{- 2 in} \setlength{\oddsidemargin}{0.25 in} \setlength{\evensidemargin}{0.25 in} \addtolength{\textwidth}{- \oddsidemargin} \addtolength{\textwidth}{- \evensidemargin} \begin{document} \vspace*{0.9375in} \begin{center} {\bfseries Model Selection in Estimation of Fitness Landscapes} \\ By \\ Charles J. Geyer and Ruth G. Shaw \\ Technical Report No.~671 \\ School of Statistics \\ University of Minnesota \\ % April 20, 2005 \\ July, 10, 2008 \\ Revised \today \end{center} \thispagestyle{empty} \cleardoublepage \setcounter{page}{1} \thispagestyle{empty} \begin{abstract} A solution to the problem of estimating fitness landscapes was proposed by \citet{la}. Another solution, which avoids problematic aspects of the Lande-Arnold methodology, was proposed by \citet*{aster2}, who also provided an illustrative example involving real data. An earlier technical report \citep{evo2008talk} gave an example that was simpler in some ways (the data are simulated from the aster model so there are no issues making the data fit the model one has with real data) and much more complicated in others (each individual has five measured components of fitness over four time periods, 20 variables in all) and illustrates the full richness possible in aster analysis of fitness landscapes. The one issue that technical report did not deal with is model selection. When many phenotypic variables are measured, one often does not know which to put in the model. \citet{la} proposed using principal components regression as a method of dimension reduction, but this method is known to have no theoretical basis. Much of late 20th century and 21st century statistics is about model selection and model averaging, and we apply some of this methodology (which does have strong theoretical basis) to estimation of fitness landscapes using another simulated data set. All analyses are done in R \citep{rcore} using the \texttt{aster} contributed package described by \citet*{gws} except for analyses in the style of \citet{la}, which use ordinary least squares regression. Furthermore, all analyses are done using the \texttt{Sweave} function in R, so this entire technical report and all of the analyses reported in it are completely reproducible by anyone who has R with the \texttt{aster} package installed and the R noweb file specifying the document. This revision corrects major errors in the frequentist model averaging calculations (Section~\ref{sec:fma}) in the first version of the technical report. \end{abstract} <>= options(keep.source = TRUE, width = 65) ps.options(pointsize = 15) pdf.options(pointsize = 15) @ \section{R Package Aster} We use R statistical computing environment \citep{rcore} in our analysis. It is free software and can be obtained from \url{http://cran.r-project.org}. Precompiled binaries are available for Windows, Macintosh, and popular Linux distributions. We use the contributed package \verb@aster@. If R has been installed, but this package has not yet been installed, do \begin{verbatim} install.packages("aster") \end{verbatim} from the R command line (or do the equivalent using the GUI menus if on Apple Macintosh or Microsoft Windows). This may require root or administrator privileges. Assuming the \verb@aster@ package has been installed, we load it <>= library(aster) @ <>= baz <- library(help = "aster") baz <- baz$info[[1]] baz <- baz[grep("Version", baz)] baz <- sub("^Version: *", "", baz) bazzer <- paste(R.version$major, R.version$minor, sep = ".") @ The version of the package used to make this document is \Sexpr{baz} (which is available on CRAN). The version of R used to make this document is \Sexpr{bazzer}. This entire document and all of the calculations shown were made using the R command \texttt{Sweave} and hence are exactly reproducible by anyone who has R and the R noweb (RNW) file from which it was created. Both the RNW file and and the PDF document produced from it are available at \url{http://www.stat.umn.edu/geyer/aster}. For further details on the use of Sweave and R see Chapter~1 of the technical report by \citet{aster2tr} available at the same web site. Not only can one exactly reproduce the results in the printable document, one can also modify the parameters of the simulation and get different results. Obvious modifications to try are noted on pages~\pageref{rnw:1}, \pageref{rnw:2}, \pageref{rnw:3}, and~\pageref{rnw:4} below. But, of course, anything at all can be changed once one has the RNW file. \section{Simulate Data from Conditional Model} \label{sec:sim-cond} \subsection{Overview} \label{sec:overview-one} It is hard to make up an unconditional aster model because the unconditional parameterization is so unintuitive. Thus we proceed in two steps. \begin{itemize} \item First we simulate data using a conditional aster model with parameters we understand. \item Then we fit an unconditional aster model to these data, and simulate data using the fitted unconditional model. \end{itemize} Also it is hard to make up a fitness landscape, because we model it on the canonical parameter scale and these have no simple connection to the mean value parameter scale. Thus we also proceed in two steps here. \begin{itemize} \item First we simulate data having a flat fitness landscape. \item Then we fit models having a non-flat fitness landscapes, which we adjust in strength to be moderately statistically significant. \end{itemize} The way these two issues interleave is as follows. \begin{enumerate} \item We simulate data using a conditional model having a flat fitness landscape. \item We simulate data using an unconditional model having a flat fitness landscape. \item We simulate data using unconditional models having a non-flat fitness landscapes. \end{enumerate} In Section~\ref{sec:sim-cond} we only do the first step. \subsection{Graphical Model} We make a graphical model based on consecutive time periods (call them years for concreteness). Data are ``observed'' over 10 years. For each individual we observe in year $i$ three random variables $U_i$, $V_i$, and $W_i$. All random variables on one individual are connected by the graph shown in Figure~\ref{fig:graph}. \begin{figure} \begin{center} \setlength{\unitlength}{0.45 in} \thicklines \begin{picture}(10.3,2.5)(-0.1,-2.3) \put(0,0){\makebox(0,0){$1_{\hphantom{0}}$}} \put(1,0){\makebox(0,0){$u_1$}} \put(2,0){\makebox(0,0){$u_2$}} \put(3,0){\makebox(0,0){$u_3$}} \put(4,0){\makebox(0,0){$u_4$}} \put(5,0){\makebox(0,0){$u_5$}} \put(6,0){\makebox(0,0){$u_6$}} \put(7,0){\makebox(0,0){$u_7$}} \put(8,0){\makebox(0,0){$u_8$}} \put(9,0){\makebox(0,0){$u_9$}} \put(10,0){\makebox(0,0){$u_{10}$}} \multiput(0.35,0)(1,0){10}{\vector(1,0){0.3}} \put(1,-1){\makebox(0,0){$v_1$}} \put(2,-1){\makebox(0,0){$v_2$}} \put(3,-1){\makebox(0,0){$v_3$}} \put(4,-1){\makebox(0,0){$v_4$}} \put(5,-1){\makebox(0,0){$v_5$}} \put(6,-1){\makebox(0,0){$v_6$}} \put(7,-1){\makebox(0,0){$v_7$}} \put(8,-1){\makebox(0,0){$v_8$}} \put(9,-1){\makebox(0,0){$v_9$}} \put(10,-1){\makebox(0,0){$v_{10}$}} \multiput(1,-0.25)(1,0){10}{\vector(0,-1){0.5}} \put(1,-2){\makebox(0,0){$w_1$}} \put(2,-2){\makebox(0,0){$w_2$}} \put(3,-2){\makebox(0,0){$w_3$}} \put(4,-2){\makebox(0,0){$w_4$}} \put(5,-2){\makebox(0,0){$w_5$}} \put(6,-2){\makebox(0,0){$w_6$}} \put(7,-2){\makebox(0,0){$w_7$}} \put(8,-2){\makebox(0,0){$w_8$}} \put(9,-2){\makebox(0,0){$w_9$}} \put(10,-2){\makebox(0,0){$w_{10}$}} \multiput(1,-1.25)(1,0){10}{\vector(0,-1){0.5}} \end{picture} \end{center} \caption{Graph for the example. $u_i$ indicate survival, $w_i$ count number of offspring, $v_i$ indicate $w_i > 0$.} \label{fig:graph} \end{figure} The variables $U_i$ and $V_i$ are Bernoulli (zero-or-one valued); the variables $W_i$ are nonnegative-integer-valued. The $V_i$ are redundant: $V_i = 1$ if and only if $W_i > 0$. The variables $U_i$ indicate survival through the $i$-th year; the variables $W_i$ count offspring produced in the $i$-th year. Hence the variables $V_i$ indicate successful reproduction in the $i$-year (one or more offspring). We model the $W_i$ as zero-truncated Poisson given $V_i$. If one likes to think of it that way, one can ignore $V_i$ and say that $W_i$ is zero-inflated Poisson given $U_i$, but the aster way to think about this requires three variables $U_i$, $V_i$, and $W_i$ because that fits the required exponential family structure. In these data, the variable $\sum_i W_i$ is considered fitness (total number of offspring over the course of the study, which will include most of the lifespan). \subsection{Simulation} We set the seeds of the random number generator so that we obtain the same results every time. To get different results, obtain the RNW file, change this statement, and reprocess using \texttt{Sweave} and \LaTeX. \label{rnw:1} <>= RNGversion("2.5.0") set.seed(42) @ We also set up some parameters of the simulation. These can also be changed. <>= nind <- 350 ntime <- 10 psurv <- 0.8 prepr <- 0.7 mpois <- 3 theta.surv <- log(psurv) - log(1 - psurv) theta.repr <- log(prepr) - log(1 - prepr) theta.pois <- log(mpois) tau.pois <- mpois / (1 - exp(- mpois)) @ For a first pass we set mean survival per unit time to be \verb@psurv@ = \Sexpr{psurv} and mean reproduction indicator per unit time to be \verb@prepr@ = \Sexpr{prepr}. For number of offspring in a time period given any offspring in that time period, the easiest thing to specify is the mean of the untruncated Poisson random variable that when truncated is the distribution we want. This untruncated mean is \verb@mpois@ = \Sexpr{mpois}. The mean of the corresponding zero-truncated distribution is $\tau = \Sexpr{round(tau.pois, 3)}$. <>= vars <- as.vector(outer(c("u", "v", "w"), 1:10, paste, sep = "")) vtype <- as.factor(substr(as.character(vars), 1, 1)) fam <- rep(1, length(vars)) fam[vtype == "w"] <- 3 pred <- seq(along = vars) - 1 pred[vtype == "u"] <- seq(along = vars)[vtype == "u"] - 3 pred[1] <- 0 @ \label{pg:vars} Simulate data. <>= root <- matrix(1, nrow = nind, ncol = length(vars)) theta <- root theta[ , vtype == "u"] <- theta.surv theta[ , vtype == "v"] <- theta.repr theta[ , vtype == "w"] <- theta.pois x <- raster(theta, pred, fam, root) dimnames(x) <- list(NULL, vars) dat <- as.data.frame(x) dat <- as.list(dat) dat[["root"]] <- rep(1, nind) @ Now we add some covariates. These will play the role of phenotypic variables in the Lande-Arnold analysis and the competing aster analysis. Lande-Arnold analysis requires that these be jointly multivariate normal and centered at zero. Aster analysis does not. For fair comparison, we make these covariates satisfy the conditions required for Lande-Arnold analysis. In addition, we make them have the same variance, and we make all pairs of them have correlation $1 / 2$. The reason for the correlation is that this makes some of the modeling issues more difficult. Of course, all of this can be changed if one has the RNW file for this document. \label{rnw:2} <>= npheno <- 10 zbase <- rnorm(nind) for (i in 1:npheno) { labz <- paste("z", i, sep = "") dat[[labz]] <- zbase + rnorm(nind) } @ Reshape the data <>= names(dat) dat <- as.data.frame(dat) redata <- reshape(dat, varying = list(vars), direction = "long", timevar = "varb", times = as.factor(vars), v.names = "resp") @ There is one further step. We wish to make inferences about overall fitness, taking advantage of the monotonicity property of unconditional aster models, as explained in Section~3.2 of \citet{evo2008talk}. This requires that we model only the $W_i$ as dependent on the phenotypic traits. To accomplish this, we set to zero all the phenotype values except those associated with the $W_i$, the life history variables that directly reflect fitness. <>= wind <- grep("w", as.character(redata$varb)) for (labz in grep("z", names(redata), value = TRUE)) { redata[[labz]][- wind] <- 0 } @ \section{Fit Initial Data} \label{sec:fit-initial} \subsection{Overview} \label{sec:overview-two} We have now simulated our ``initial data'' so we have now done step~1 of the overview in Section~\ref{sec:overview-one}. However, we still do a few checks to see that these data make sense (of course, they must unless the aster software is buggy, but we check anyway); we do this in Section~\ref{sec:check-cond-mod}. In Section~\ref{sec:check-unco-mod} we start on what we called step~2 in the overview in Section~\ref{sec:overview-one}. First we fit an unconditional aster model to the data we just checked was reasonable (Section~\ref{sec:check-cond-mod}). Then we ``predict'' the mean value parameters and see whether they make sense (according to our intuition). They sort of make sense, but we think we can do better, so we fit a more elaborate unconditional model adding a term \texttt{uyear} to the model formula, which lets mortality be a linear function of year on the canonical parameter scale. We again ``predict'' the mean value parameters and see whether they make sense, and this time decide they are good enough (according to our intuition). That finishes what we called step~2 in the overview in Section~\ref{sec:overview-one}, and also finishes Section~\ref{sec:fit-initial}. We continue our overview in Section~\ref{sec:new-sim}. \subsection{Conditional Models} \label{sec:check-cond-mod} Fit the aster model (the one used to generate the data). <>= redata$vtype <- as.factor(substr(as.character(redata$varb), 1, 1)) out2 <- aster(resp ~ vtype + 0, pred, fam, varb, id, root, data = redata, type = "conditional") summary(out2) @ For comparison, these should be <>= theta.surv theta.repr theta.pois @ For further comparison, the likelihood ratio test with this model as the alternative hypothesis and the simulation truth for the point null hypothesis can be calculated as follows. <>= mout2true <- mlogl(c(theta.surv, theta.repr, theta.pois), pred, fam, out2$x, out2$root, out2$modmat, type = out2$type, origin = out2$origin) dev2true <- 2 * mout2true$value - out2$deviance dev2true 1 - pchisq(dev2true, df = 3) @ Great! Statistics works. When we simulate data from a model and estimate the parameters in that model, the maximum likelihood estimates are close to the simulation truth, and the log likelihood ratio test statistic comparing the MLE to the simulation truth parameter value is not significant. \subsection{Unconditional Models} \label{sec:check-unco-mod} The next step is to switch to unconditional models. <>= out3 <- aster(resp ~ vtype + 0, pred, fam, varb, id, root, data = redata) summary(out3) @ We have no intuition about whether this model makes sense. We must switch to unconditional models in order to model fitness, but is this particular unconditional model sensible? To check that out we look at conditional mean value parameters. Do they still make sense? We originally simulated a conditional model, based on certain conditional mean value parameters. Now we have switched to an unconditional model, but we can still calculate its conditional mean value parameters. The new model (\texttt{out3}) is ``incorrect'' in the sense that it is not the old model (\texttt{out2}), but is it close to correct in the sense that the relevant parameters have not changed? In order to compute conditional mean value parameters we need to supply data (since they depend on data, see \citealp*{gws}, Section~2.5). Since all individuals are alike, we could have a new model matrix with just one individual all of whose variables that are ``parent'' variables (i.~e., all of the $U_i$ and $V_i$) are equal to one. But a much simpler choice is just to use the observed data, if it contains one such individual. <>= foo <- apply(out3$x, 1, function(bar) all(bar > 0)) ifoo <- seq(along = foo)[foo] length(ifoo) ifoo <- ifoo[1] ifoo @ It does. <>= pout3 <- predict(out3, model.type = "conditional") pout3 <- matrix(pout3, nrow(out3$x)) pout3 <- pout3[ifoo, ] pout3 @ We see that all of the $W_i$ have conditional expectation \Sexpr{round(pout3[3], 4)} given $V_i = 1$. This is ``obvious'' (we admit we did not guess this in advance, but it is easy to explain in hindsight) from the fact that all $W_i$ are at terminal nodes, hence their conditional canonical parameters $\theta_j$ are equal to their unconditional canonical parameters $\varphi_j$. We also see that all of the $V_i$ have conditional expectation \Sexpr{round(pout3[2], 4)} given $U_i = 1$. This is ``obvious'' (we admit we did not guess this in advance, but it is easy to explain in hindsight) from the fact that all $V_i$ have exactly one successor $W_i$ and all of the $W_i$ have the same distribution and the same conditional canonical parameter. Thus equation (5) in \citet{gws} makes all the unconditional canonical parameters $\varphi_j$ for the $V_i$ the same. So we are left with several questions. First, are the conditional mean value parameters already discussed reasonable? They are to be compared with <>= prepr tau.pois @ which are the actual conditional expectations used to simulate the data. They seem close enough. (We know that the difference between \Sexpr{round(pout3[3], 4)} and \Sexpr{round(tau.pois, 4)} is just sampling variation, because there is no difference between conditional and unconditional canonical parameters for terminal nodes, and the conditional canonical parameters determine the conditional mean value parameters.) Second, are the conditional mean value parameters for the $U_i$ reasonable? These now all differ, because each has a different number of ``ancestor'' nodes (predecessor, predecessor of predecessor, etc.) <>= pout3u <- matrix(pout3, nrow = 3)[1, ] round(pout3u, 4) @ Clearly, our model has changed. Originally, we modeled organisms that do not age: they have the same mortality at all ages. This was just for simplicity; we could have put in mortality that is a function of age. But when we switch to an unconditional model, mortality increases (survival probability decreases) with age. We have two options. Since we are just making up data, we could accept this change. Alternatively, we could put additional terms into the unconditional aster model to allow the unconditional canonical parameters for the $U_i$ to increase with age more than they do in the model fit above (\texttt{out3}). Let's try the latter. <>= redata$year <- as.numeric(substring(as.character(redata$varb), 2)) redata$uyear <- redata$year * as.numeric(as.character(redata$vtype) == "u") out4 <- aster(resp ~ vtype + uyear, pred, fam, varb, id, root, data = redata) summary(out4) @ Again we look at conditional mean value parameters. <>= pout4 <- predict(out4, model.type = "conditional") pout4 <- matrix(pout4, nrow(out4$x)) pout4 <- pout4[ifoo, ] pout4 @ Again we see that all of the $W_i$ have the same conditional mean value parameter and that it has hardly changed from what it was in \texttt{out3}, and the same is true of all the $V_i$. Since we did not change the model in any way that affects the parameterization of these variables, this is no surprise. <>= pout4u <- matrix(pout4, nrow = 3)[1, ] round(pout4u, 4) @ Now the conditional mean value parameters $E(U_i \mid U_{i - 1} = 1)$ for $i > 2$ and $E(U_1)$ seem fairly constant. Perhaps they decrease at large ages (9 and 10), or perhaps this is just chance variation, since there are few individuals alive at these ages. In any event, we will be satisfied with this model \texttt{out4}. This will be our ``baseline'' model, the model with a flat fitness landscape, since fitness, which we take to be $\sum_i E(W_i)$, does not depend on any of the $Z_i$, because we put none of these variables in any models fit so far. \section{Simulate New Data} \label{sec:new-sim} \subsection{Overview} \label{sec:overview-three} At this point we have a reasonable unconditional aster model (\texttt{out4}) with a flat fitness landscape. Now we want to simulate models with non-flat fitness landscapes. We will simulate two such models, which we call Model~1 and Model~2. In Model~1, the fitness landscape depends on just two phenotypic variables \texttt{z1} and \texttt{z2}. This is somewhat unrealistic but allows us to draw some simple pictures of fitness landscapes. \citet[Section~1.2.5]{b+a} are particularly emphatic about the biological unrealism of ``true'' (at least simulation truth) models with only a few parameters. Hence we also simulate more realistic data that depends on ten phenotypic variables \texttt{z1}, \ldots, \texttt{z10}. We make both fitness landscapes depend quadratically on phenotypic variables on the canonical parameter scale. Since the mean value parameter is a multivariate monotone function of the canonical parameter, this means expected fitness is a monotone function of this quadratic function (see Section~3.2 of \citealp{evo2008talk} for details). We make our quadratic function have negative curvature (corresponding to stabilizing selection) in all directions. Thus it has elliptical contours. By monotonicity, the contours of the actual fitness landscape (expected fitness considered as a function of phenotypic variables) also has elliptical contours. We locate the fitness maximum to one side of the distribution of phenotypic values not right in the middle. The only remaining decision is how steeply fitness should fall off as we move away from the fitness maximum. In order for our simulation to provide useful guidance, the fall-off should be steep enough so that the fitness landscape is statistically significant but not so steep that one doesn't need statistics to see that it is significant. (It's always nice when evidence is so clear that statistics is unnecessary, but this is not typical.) Thus we adjust the dependence of fitness on phenotype in both Model~1 and Model~2 to be strong but not too strong. This proceeds in several steps, which are substeps of what was called Step~3 in Section~\ref{sec:overview-one}. \begin{enumerate} \item[(a)] We fit a model with quadratic fitness landscape to the data with flat fitness landscape. The sole purpose is to get the coefficient vector for such a model. \item[(b)] We adjust the coefficients of the quadratic fitness landscape. \item[(c)] We simulate new data having this unconditional aster model with non-flat (quadratic on the canonical parameter scale) fitness landscape. \item[(d)] We do a likelihood ratio test comparing the models with flat and non-flat fitness landscape. If the result is not statistically significant or is too statistically significant (the inference is so clear that statistics is unnecessary) we go back to substep (b). \end{enumerate} \subsection{Model 1} \label{sec:model-1} For a first try, let us have the fitness landscape depend on just the first two phenotypic variables \texttt{z1} and \texttt{z2}. \citet[Section~1.2.5]{b+a} are particularly emphatic about the biological unrealism of ``true'' (at least simulation truth) models with only a few parameters. The main virtue of this simple model is that it allows us to draw some simple pictures of fitness landscapes. We will also simulate a more complicated model in Section~\ref{sec:model-2}. Now we will simulate new data using the \texttt{raster} function. To use that we need to, first, make up a model, including dependence on \texttt{z1} and \texttt{z2} and, second, convert unconditional canonical parameters $\varphi$ to conditional canonical parameters $\theta$, because that is what the \texttt{raster} function wants. There is no function in the \texttt{aster} package that does this conversion, perhaps a defect in the \texttt{aster} user interface. We can, however, trick the \texttt{predict.aster} function into doing this conversion. First we fit the model we want to use to the data we have. The fitted parameters will make no sense, because the fitness landscape is flat for the data we have, but we can use the model structure. <>= out5 <- aster(resp ~ vtype + uyear + z1 + z2 + I(z1^2) + I(z2^2) + I(z1 * z2), pred, fam, varb, id, root, data = redata) summary(out5) @ We now want to make up a quadratic function of $z$ that has approximate mean zero (averaged over the $z$) values so that the average fitness is more or less the same as in our initial data. We also want reasonably sane values, so we keep the quadratic term small. For simplicity, we make the fitness landscape symmetric in the two variables. (It took several tries to get the quadratic term small enough. These tries are not shown. The coefficients of the linear and quadratic terms were adjusted until the $P$-values in the tests on p.~\pageref{pg:out6-8} show that both were highly statistically significant but not humongously so.) Of course, all of this can be changed if one has the RNW file for this document. \label{rnw:3} <>= z1 <- dat$z1 z2 <- dat$z2 ascal <- 0.013 quad <- ascal * ((z1 + z2) - (z1^2 + z2^2) + z1 * z2) con <- mean(quad) mean(quad - con) @ Now we change the coefficients in \texttt{out5} to be the ones for this made up model. <>= beta.new <- out5$coefficients beta.new[1:4] <- out4$coefficients beta.new[3] <- beta.new[3] - con beta.new[5:6] <- ascal beta.new[7:8] <- (- ascal) beta.new[9] <- ascal beta.new <- round(beta.new, 3) beta.new @ Then we ``predict'' the conditional canonical parameters using \texttt{beta.new} because those are the parameters that the \texttt{raster} function wants. <>= theta <- predict(out5, model.type = "conditional", parm.type = "canonical", newcoef = beta.new) theta <- matrix(theta, nrow = nrow(out5$x), ncol = ncol(out5$x)) # apply(theta, 2, range) xnew <- raster(theta, pred, fam, root) @ Now we need to reshape these new data just like we did the old. <>= dimnames(xnew) <- list(NULL, vars) dnew1 <- as.data.frame(xnew) renew <- reshape(dnew1, varying = list(vars), direction = "long", timevar = "varb", times = as.factor(vars), v.names = "resp") redata$resp1 <- renew$resp @ For future reference we also save the unconditional canonical and mean value ``simulation truth'' parameter values. <>= redata$tau1 <- predict(out5, newcoef = beta.new) redata$phi1 <- predict(out5, parm.type = "canonical", newcoef = beta.new) beta1true <- beta.new @ \subsection{Model 2} \label{sec:model-2} As pointed out in the Sections~\ref{sec:overview-three} and~\ref{sec:model-1}, \citet[Section~1.2.5]{b+a} are particularly emphatic about the biological unrealism of simulation truth models like that simulated in Section~\ref{sec:model-1}. In reality, the true stochastic mechanism generating the data is much more complicated than any model under consideration and even if known would have far too many parameters to be well estimated by available data. Not all parameters are equally important, of course, and some parameters of the true model, assuming it were known, could be estimated better than others. Hence in this section we create a model that is much like the model in Section~\ref{sec:model-1} except the quadratic function depends on all of the $z_i$ but not equally. As before, we start by fitting the full model to the data at hand <>= out5too <- aster(resp ~ vtype + uyear + poly(z1, z2, z3, z4, z5, z6, z7, z8, z9, z10, degree = 2, raw = TRUE), pred, fam, varb, id, root, data = redata) names(out5too$coefficients) <- sub("^poly([^)]*)", "", names(out5too$coefficients), extended = FALSE) summary(out5too) @ Now we have to figure out which coefficients are which <>= fred <- strsplit(names(out5too$coefficients)[-(1:4)], "\\.") fred <- lapply(fred, as.numeric) @ Then we construct the quadratic function that is the canonical parameter for fitness. This time we don't bother to adjust the intercept. Of course, all of this can be changed if one has the RNW file for this document. \label{rnw:4} <>= b <- 0.5 bvec <- c(1, 1, b^(1:8)) bvec <- bvec / sum(bvec) * 2 round(bvec, 5) @ Again make a parameter vector \texttt{beta.new} constructed above <>= beta.new <- out5too$coefficients beta.new[1:4] <- out4$coefficients for (i in seq(along = fred)) { freddy <- fred[[i]] sally <- freddy > 0 if (sum(sally) == 1) { if (freddy[sally] == 1) { ### linear term beta.new[i + 4] <- ascal * bvec[sally] } else { ### quadratic term beta.new[i + 4] <- (- ascal * bvec[sally]) } } else { ### cross term beta.new[i + 4] <- ascal * mean(bvec[sally]) } } beta.new @ And the rest is just like the Section~\ref{sec:model-1}. <>= theta <- predict(out5too, model.type = "conditional", parm.type = "canonical", newcoef = beta.new) theta <- matrix(theta, nrow = nrow(out5too$x), ncol = ncol(out5too$x)) # apply(theta, 2, range) xnew <- raster(theta, pred, fam, root) @ Reshape. <>= dimnames(xnew) <- list(NULL, vars) dnew2 <- as.data.frame(xnew) renew <- reshape(dnew2, varying = list(vars), direction = "long", timevar = "varb", times = as.factor(vars), v.names = "resp") redata$resp2 <- renew$resp @ For future reference we also save the unconditional canonical and mean value ``simulation truth'' parameter values. <>= redata$tau2 <- predict(out5too, newcoef = beta.new) redata$phi2 <- predict(out5too, parm.type = "canonical", newcoef = beta.new) beta2true <- beta.new @ \section{Fit New Data} \subsection{Overview} \label{sec:overview-four} At this point, we have finished making up data (by simulation from an aster model). In real life, none of the work up to here would be necessary. We would collect data by measuring fitness components of real living organisms. For the purposes of this technical report we pretend that the simulated data, \verb@redata$resp@ and \verb@redata$resp2@ are such real scientific data. Everything we do from here on is an example of what could be done with real data. \subsection{Fit Model 1} \label{sec:fit-new} So we should be able to fit the model that was used to simulate these data. We start with the simpler model~1 that has only \texttt{z1} and \texttt{z2} involved in the simulation truth model. \label{pg:out6} <>= out6 <- aster(resp1 ~ vtype + uyear + z1 + z2 + I(z1^2) + I(z2^2) + I(z1 * z2), pred, fam, varb, id, root, data = redata) summary(out6) @ For further comparison, the likelihood ratio test with this model as the alternative hypothesis and the simulation truth for the point null hypothesis can be calculated as follows. <>= mout6true <- mlogl(beta1true, pred, fam, out6$x, out6$root, out6$modmat, type = out6$type, origin = out6$origin) dev6true <- 2 * mout6true$value - out6$deviance dev6true 1 - pchisq(dev6true, df = length(beta.new)) @ Great! Statistics works again. When we simulate data from a model and estimate the parameters in that model, the maximum likelihood estimates are close to the simulation truth. We also look at how statistically significant our quadratic effect is <>= out8 <- aster(resp1 ~ vtype + uyear, pred, fam, varb, id, root, data = redata) out7 <- aster(resp1 ~ vtype + uyear + z1 + z2, pred, fam, varb, id, root, data = redata) anova(out8, out7, out6) anova(out8, out6) @ \label{pg:out6-8} \subsection{Plot Monotone Function of Fitness} We extract the quadratic part of the fitness function (on the canonical parameter scale), which is a monotone transformation of actual fitness, see Sections~3.4, and~3.10 of \citet*{aster2tr} for details. <>= qcoef <- out6$coefficients @ <>= a1 <- qcoef["z1"] a2 <- qcoef["z2"] a <- c(a1, a2) A11 <- qcoef["I(z1^2)"] A22 <- qcoef["I(z2^2)"] A12 <- qcoef["I(z1 * z2)"] / 2 A <- matrix(c(A11, A12, A12, A22), 2, 2) @ Figure~\ref{fig:surf} (page~\pageref{fig:surf}) shows the scatterplot of data values for \texttt{z1} and \texttt{z2} and the contours of the estimated quadratic fitness function. It is similar to Figure~3 in the first submission of \citet{aster2}, which is Figure~3.1 in \citet{aster2tr}. (In the second submission of the paper, this figure was changed to be like Figure~\ref{fig:nerf} below.) It is made by the following code. <>= plot(z1, z2) ufoo <- par("usr") nx <- 101 ny <- 101 z <- matrix(NA, nx, ny) x <- seq(ufoo[1], ufoo[2], length = nx) y <- seq(ufoo[3], ufoo[4], length = ny) for (i in 1:nx) { for (j in 1:ny) { b <- c(x[i], y[j]) z[i, j] <- sum(a * b) + as.numeric(t(b) %*% A %*% b) } } contour(x, y, z, add = TRUE) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Scatterplot of \texttt{z1} versus \texttt{z2} with contours of the estimated quadratic function of phenotypic covariates in the linear predictor, which is only a monotone function of fitness not actual fitness. Compare Figure~\ref{fig:nerf}.} \label{fig:surf} \end{figure} \subsection{Plot Actual Fitness} \label{sec:actual} The plot produced in the preceding section (Figure~\ref{fig:surf}) is good enough for most purposes, even though the numbers on the contours are not actual fitness but a monotone transformation of it (the corresponding canonical parameter). The contours shown are contours of actual fitness, but the numbers attached to them do not reflect actual fitness. We can get a similar plot where the numbers on the contours are actual fitness. It just takes a bit more work. The first plot of this kind was made in Section~3.11 of \citet{aster2tr}. Figure~3 of the second submission of \citet{aster2} was also a plot of this kind. Figure~\ref{fig:nerf} (page~\pageref{fig:nerf}) shows the scatterplot of data values for \texttt{z1} and \texttt{z2} and the contours of the estimated quadratic fitness function. It is made by the following code. <>= xx <- outer(x, y^0) yy <- outer(x^0, y) xx <- as.vector(xx) yy <- as.vector(yy) n <- length(xx) foo <- rep(1, n) bar <- list(z1 = xx, z2 = yy, root = foo) for (lab in levels(renew$varb)) { bar[[lab]] <- foo ##### response doesn't matter for prediction ##### } bar <- as.data.frame(bar) rebar <- reshape(bar, varying = list(vars), direction = "long", timevar = "varb", times = as.factor(vars), v.names = "resp1") rebar$vtype <- as.factor(substr(as.character(rebar$varb), 1, 1)) rebar$year <- as.numeric(substring(as.character(rebar$varb), 2)) rebar$uyear <- rebar$year * as.numeric(as.character(rebar$vtype) == "u") pbar <- predict(out6, newdata = rebar, varvar = varb, idvar = id, root = root) pbar <- matrix(pbar, nrow = nrow(bar)) pbar <- pbar[ , grep("w", vars)] zz <- apply(pbar, 1, sum) zz <- matrix(zz, nx, ny) @ and <>= plot(z1, z2) contour(x, y, zz, add = TRUE) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Scatterplot of \texttt{z1} versus \texttt{z2} with contours of the estimated fitness function. Note numbers on the contours are actual fitness.} \label{fig:nerf} \end{figure} Figure~\ref{fig:nerf} looks quite a bit different from Figure~\ref{fig:surf}. Nevertheless, every contour of one is also a contour of the other. The only differences are the numbers attached to the contours and which contours the \texttt{contour} function draws by default. \section{Lande-Arnold Analysis} \subsection{Original} \citet{la} proposed a method of analysis of phenotypic natural selection that is related to the analysis we did leading to Figure~\ref{fig:surf} but different in the following respects. First, it uses ordinary least squares (OLS) rather than aster models. Second, it does not estimate the fitness landscape itself. If $w$ is fitness and \begin{equation} \label{eq:fit} g(z) = E(w \mid z). \end{equation} is the fitness landscape, define \begin{equation} \label{eq:q-repeat} Q(\alpha, \beta, \gamma) = E \bigl\{ (w - \alpha - z^T \beta - \tfrac{1}{2} z^T \gamma z)^2 \bigr\} \end{equation} and define $\alpha_2$, $\beta_2$, and $\gamma_2$ to be the values of $\alpha$, $\beta$, and $\gamma$ that minimize $Q(\alpha, \beta, \gamma)$. Then \begin{equation} \label{eq:best-repeat} g_2(z) = \alpha_2 + z^T \beta_2 + \tfrac{1}{2} z^T \gamma_2 z \end{equation} the best quadratic approximation to the fitness landscape. \citet{la} show that $\beta_2$ and $\gamma_2$ have several other equivalent characterizations when $z$ is multivariate normal, but we will have little to say about them in this technical report. So let us try a Lande-Arnold analysis. First we construct the fitness variable. <>= widx <- grep("^w[0-9]", colnames(dnew1)) dnew1$fit <- apply(dnew1[ , widx], 1, sum) @ Then we do the OLS regression that is, in other respects, just like the aster ``regression'' producing \text{out6}. <>= lout1 <- lm(fit ~ z1 + z2 + I(z1^2) + I(z2^2) + I(z1 * z2), data = dnew1) summary(lout1) @ We now plot the results of the OLS analysis. <>= qcoef <- lout1$coefficients intercept <- qcoef[1] <> @ Figure~\ref{fig:lurf} (page~\pageref{fig:lurf}) shows the scatterplot of data values for \texttt{z1} and \texttt{z2} and the contours of the estimated quadratic function. For comparison, it also shows the contours of the fitness landscape taken from Figure~\ref{fig:nerf}. It is made by the following code. <>= <> zols <- matrix(NA, nx, ny) for (i in 1:nx) { for (j in 1:ny) { b <- c(x[i], y[j]) zols[i, j] <- intercept + sum(a * b) + as.numeric(t(b) %*% A %*% b) } } contour(x, y, zols, col = "red", add = TRUE) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Scatterplot of \texttt{z1} versus \texttt{z2} with contours of the estimated fitness landscape (black) and contours of the Lande-Arnold function (red), its best quadratic approximation.} \label{fig:lurf} \end{figure} Clearly from Figure~\ref{fig:lurf} the two analyses are qualitatively similar but differ in detail. To the extent they disagree the aster analysis is correct and the Lande-Arnold analysis incorrect (we know this because we know the ``true'' model under which the data were simulated). The red contours in Figure~\ref{fig:lurf} are the best quadratic approximation to the fitness landscape, but the fitness landscape is not close to quadratic, so even the best quadratic approximation is not a very good approximation. We know fitness is nonnegative, so all of the red contours with negative values are nonsense. To counterbalance the negative nonsense, the best quadratic approximation must underestimate the peak (because it must average to the correct average, a property of best quadratic approximation). The maximum value of the surface with black contours (on the grid where it was evaluated) is \Sexpr{round(max(zz), 2)}; the maximum value of the surface with red contours (on the same grid) is \Sexpr{round(max(zols), 2)}. \subsection{Improved by Combination with Aster Analysis} \label{sec:improve} To the extent that we still care about best quadratic approximation now that we know how to estimate the fitness landscape itself, it is interesting that OLS regression is not the best way to estimate $\beta$ and $\gamma$ once we have an aster model for the data, which we do (\texttt{out6}). If we knew the fitness landscape $g(z)$, then we should estimate the best quadratic approximation by regressing $g(z)$ on $z$ rather than regressing $w$ on $z$. This would give us an estimate that had no statistical error, only error from approximating a non-quadratic function by a quadratic one. We do not know the true fitness landscape, but we do have a good approximation from the aster model, the function whose contours are shown in Figure~\ref{fig:nerf}. So we can improve our estimate of the best quadratic approximation to the fitness landscape as follows. First we go back to the aster analysis (\texttt{out6}) and obtain the unconditional mean value parameters, which in this context are the $E(w \mid z)$. <>= pout6 <- predict(out6) pout6 <- matrix(pout6, nrow = nrow(out6$x), ncol = ncol(out6$x)) wcol <- grep("w", colnames(out6$x)) afit <- apply(pout6[ , wcol], 1, sum) dnew1$afit <- afit @ Then we do the OLS regression that is, in other respects, just like the preceding OLS regression (\texttt{lout1}). <>= lout2 <- lm(afit ~ z1 + z2 + I(z1^2) + I(z2^2) + I(z1 * z2), data = dnew1) summary(lout2) @ To the extent that the estimates in the two OLS regressions (\texttt{lout1} and \texttt{lout2}) differ, the latter is better. This is clear from the smallness of the reported standard errors for the latter. These standard errors are invalid, because they are based on the assumption that the response is normal, which it is not (although it is a lot closer to normal in the latter). In order to do the Lande-Arnold analysis, we only needed to assume that $z$ was multivariate normal and needed no assumption about the distribution of $w$ given $z$. (Of course, we do know that distribution --- it is the aster model used to simulate the data --- but that distribution does not satisfy the ``usual'' assumptions for OLS regression.) But invalid or not, the standard errors give a rough indication of the variability of the ``response.'' Of course, correct standard errors could be derived based on the aster model and the delta method, but that is a lot of work. Figure~\ref{fig:alurf} (page~\pageref{fig:alurf}) is just like Figure~\ref{fig:lurf} except that now our improved Lande-Arnold estimates based on combining aster analysis with OLS regression (\texttt{lout2}) are used instead of the crude OLS estimates. \begin{figure} \begin{center} <>= qcoef <- lout2$coefficients intercept <- qcoef[1] <> <> @ \end{center} \caption{Scatterplot of \texttt{z1} versus \texttt{z2} with contours of the estimated quadratic fitness function (black) and contours of the Lande-Arnold function (red) using the improved Lande-Arnold estimators.} \label{fig:alurf} \end{figure} \subsection{Discussion} It is clear from Figure~\ref{fig:alurf} that fitness landscape analysis using aster models and Lande-Arnold analysis estimate different things. The best quadratic approximation to the fitness landscape may not be a very good approximation. It cannot be when the fitness landscape is far from quadratic, which is the usual case. \citet{la} are, of course, not to be faulted for failing to use methods that were first proposed more than 20 years after they wrote their paper, but we can do better now. \section{Model Selection} \subsection{Principal Components Regression} \citet[p.~1214]{la} have a section on model selection in which they recommend principal components regression (PCR). They are not to be faulted for this because PCR is a very popular methodology that is the subject of many books and articles and has been used for more than 100 years. However, despite its popularity, PCR has no theory that justifies it or even motivates it. Principal components as a method of dimension reduction in multivariate analysis (where there is no response) is justified by theory, but PCR is not. This is well known to statisticians. The problem with PCR is that the principal components (which in this case would be linear combinations of the covariates $z_1$, $z_2$, $\ldots$, $z_{10}$) are chosen without any use of the response (in this case $w$ or the entire response vector containing all the $u_i$, $v_i$, and $w_i$). Hence there is no reason why the principal components should be related to the response. PCR may work, but only by fortunate coincidence. In our first simulated data set (model~1), PCR would give terrible results. We know (because we simulated the data) that $z_1$ and $z_2$ are the variables related to $w$. If we look at the loadings of these variables on the principal components <>= vz <- dat[ , grep("z[0-9]$|z10", names(dat), value = TRUE)] names(vz) vz <- var(vz) ez <- eigen(vz, symmetric = TRUE) ez$values ez$vectors[1:2, ] @ we see that we need all the principal components to contain the subspace spanned by $z_1$ and $z_2$. Thus principal components does exactly the wrong thing in our example. It couldn't be more damaging. Of course, our example was constructed to be this way. In any particular application, principal components may do better (it couldn't do worse), but one can never know when it will do a good job. In our second simulated data set (model~2), PCR would give slightly less terrible results. The first principal component is now closer to the principal axis of the quadratic form that is canonical parameter for fitness, but still not close, and the rest of the principal components are random directions having no relationship to fitness. \citet{cook} gives a thorough discussion of the origins of PCR and what is wrong with it and proposes modern methods that have some of the flavor of PCR without the problems. The earliest criticism of PCR cited by \citet{cook} is \citet{cox}, so the problems of PCR have been recognized for at least 40 years. \subsection{Information Criteria} Having given up on the idea that principal components will magically select the correct model, we move to methods that are justified by statistical theory. These methods fit all models in some specified family $\mathcal{M}$ of models, the ``models of interest'' and evaluate them using some criterion. The oldest and best known criteria are the \emph{Akaike information criterion} \citep[AIC,][]{aic} or the \emph{Bayes information criterion} \citep[BIC,][]{schwarz}. Many other information criteria have been proposed, but we shall add only one more. Corrected AIC (AICc) which is a finite-sample size correction to AIC, originally due to \citet{sugiura}. If $l(\hat{\theta}_m)$ is the log likelihood maximized over the parameter space of model $m$ and $p_m$ is the dimension of the parameter space of model $m$, then \begin{equation} \label{eq:aic} \aic(m) = - 2 l(\hat{\theta}_m) + 2 p_m \end{equation} is AIC for model $m$, \begin{equation} \label{eq:aicc} \aicc(m) = \aic(m) + \frac{2 p_m [p_m + 1]}{n - p_m - 1} \end{equation} is AICc for model $m$, and \begin{equation} \label{eq:bic} \bic(m) = - 2 l(\hat{\theta}_m) + \log(n) p_m \end{equation} is BIC for model $m$, where $n$ is the sample size. $\aic(m)$ provides a consistent estimator of the maximum expected log likelihood for model $m$, and $\bic(m)$ provides an asymptotic approximation of the Bayes factor for model $m$. In our example $\log(n)$ is <>= log(nind) @ Note that natural logarithms are used (as they are everywhere in statistics). According to \citet{b+a} AICc should always be preferred to AIC unless the sample size is very large. The idea is that the model with the smallest information criterion, whichever one $\aic(m)$, $\aicc(m)$, or $\bic(m)$ we are using, is the best estimate of the true unknown model that we can get. As everywhere else in statistics the estimate is not the truth and is random. When the class $\mathcal{M}$ of models in which we look for the best model is very large, then the probability of selecting the best model may be very small (more on this later). But we shall try these procedures out and see how they do. \subsection{The Branch and Bound Algorithm} We need another idea. Fitting all models in the class $\mathcal{M}$ is unpalatable if the class is large. The \emph{branch and bound} algorithm from computer science was first introduced into statistical model selection by \citet{leaps}, a paper that is almost unreadable being bogged down in the details of Fortran code for least squares. A good introduction to the branch and bound algorithm is given by \citet{hand}. The branch and bound algorithm works on the principle of divide and conquer. ``Branch'' refers to dividing up the work to be done. In this case we divide up the work by fixing how one phenotypic covariate enters the model. The canonical parameter is constant, linear, or quadratic in that variable. That divides the whole class of models to be examined into three disjoint subclasses (constant, linear, and quadratic in one particular variable). We can further subdivide the work by splitting on another variable, further subdivide the subdivisions by splitting on yet another, and so forth. ``Bound'' refers to bounding the criterion function (e.~g., AIC) for a whole subclass of models. If we keep track of the AIC of the best model found so far, this provides an upper bound for the AIC of the best model in the entire class of models to be examined. From \eqref{eq:aic} we see that, as a function of model complexity, the first term of \eqref{eq:aic} decreases (a supermodel has higher likelihood than one of its submodels) whereas the second term of \eqref{eq:aic} increases (a supermodel has more parameters than one of its submodels). Hence a lower bound for AIC of all the models in a subclass $M \subset \mathcal{M}$ of models is \begin{equation} \label{eq:lower-bound} - 2 l_{\text{LCS}(M)}(\hat{\theta}_{\text{LCS}(M)}) + 2 p_{\text{GCS}(M)} \end{equation} where $\text{LCS}(M)$ is the least common supermodel (the smallest model that is a supermodel of every $m$ in $M$) and $\text{GCS}(M)$ is the greatest common submodel (the largest model that is a submodel of every $m$ in $M$). The same argument provides a lower bound for BIC or AICc because their penalty terms are also increasing functions of $p_m$. Note further that the lower bound is cheap to evaluate (we only need to fit two models no matter how many models are in $M$). If \eqref{eq:lower-bound} is greater than $\aic(m)$ for the best model seen so far, then we know that $M$ cannot contain the best model and we need not examine it further. If the lower bound does not allow us to dispense with $M$, then we subdivide it, finding bounds for each subdivision. When the subdivisions become small enough they will be eliminated (no subdivision can be empty, and if it contains just one model, we evaluate that model and are done with that subdivision). Branch and bound is not magic. Typically, the number of models to be evaluated grows exponentially in the dimension of the problem. Here we have $3^k$ models if there are $k$ phenotypic covariates. The branch and bound algorithm still takes time exponential in $k$, but the constant is much smaller. Typically, the branch and bound algorithm only examines a small fraction of the possible models. The number of models actually examined is reported in each case. So branch and bound does not allow us to do arbitrarily large problems, but it does allow us to do problems somewhat larger than we could do without it. This algorithm is not implemented in the \texttt{aster} package, so we provide an implementation for the particular problem we are interested in here. <>= source("http://www.stat.umn.edu/geyer/aster/leap-funs.R") # source("leap-funs.R") args(aster.leaps) @ \label{pg:source-leaps} The function \verb@aster.leaps@ fits all models in a certain class of models, that we now describe. First each model formula starts off ``\verb@resp1 ~ vtype + uyear@'' just like \texttt{out6} if given the argument \verb@response = "resp1"@. The rest of the terms are linear or quadratic in some or all of the variables \texttt{z1}, \texttt{z2}, \ldots, the number of such variables being the argument \texttt{nsplit} to \texttt{aster.leaps}. The linear predictor may be constant, linear, or quadratic in such a variable. If ``constant'' the variable does not appear in the formula. If ``linear'' the variable appears linearly, for example, if \texttt{z2} and \texttt{z3} are linear, then the formula continues ``\verb@+ z2 + z3@'' If ``quadratic'' the variable appears quadratically, for example, if \texttt{z1}, \texttt{z4}, and \texttt{z5} are quadratic, then the formula continues ``\verb@+ poly(z1, z4, z5, degree = 2, raw = TRUE)@''. Note that this means mixed terms of degree two, such as, \verb@z1 * z4@ are included in the formula, in effect. One argument to \texttt{aster.leaps}, that is \texttt{nsplit}, has been described already. The other non-optional arguments \texttt{pred}, \texttt{fam}, and \texttt{data} are passed to the \texttt{aster} function to fit aster models, for this case, we give them the same values they had in producing \texttt{out6}, that is \texttt{pred}, \texttt{fam}, and \texttt{redata}. \subsection{Five Predictors} \label{sec:five} \subsubsection{Model 1} Let's try it. We apply the branch and bound algorithm to the data for Model 1 simulated in Section~\ref{sec:model-1} in which fitness is a quadratic function of \texttt{z1} and \texttt{z2} only. Because this function may take a lot of time to run, we store the results in the current working directory, and simply load them if they exist. <>= options(show.error.messages = FALSE, warn = -1) try(load("b1out5a.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b1out5a")) { b1out5a <- aster.leaps(pred, fam, data = redata, nsplit = 5, response = "resp1") save(b1out5a, file = "b1out5a.rda") } secs <- b1out5a$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) names(b1out5a) b1out5a$time b1out5a$fits @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. The models fit by the branch and bound algorithm are shown. The deviance, degrees of freedom, AIC, BIC, and AICc (labeled \texttt{cic}) are shown for each model. Of the $3^5 = \Sexpr{3^5}$ models under consideration, the branch and bound algorithm fit only \verb@nrow(b1out5a$fits)@ = \Sexpr{nrow(b1out5a$fits)} in determining the model with the lowest AIC. The labels for the models indicate the degree of each predictor variable. For example, \texttt{q125l3} means \texttt{z1}, \texttt{z2}, and \texttt{z5} are quadratic, \texttt{z3} is linear, and \texttt{z4} is constant (not in the model). Note that the simulation truth model ranks number \Sexpr{match("q12l", rownames(b1out5a$fits))} according to AIC of the models that have been fit. We shall say no more right now about the how well the branch and bound algorithm with AIC worked. Nor will we say anything about performance of similar uses of the branch and bound algorithm with other numbers of predictor variables and other information criteria ($\aicc$ and $\bic$) in the rest of this section. Some aspects of performance will be examined in Section~\ref{sec:fma}. Now we repeat what we just did using $\aicc$ instead of $\aic$. <>= options(show.error.messages = FALSE, warn = -1) try(load("b1out5c.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b1out5c")) { b1out5c <- aster.leaps(pred, fam, data = redata, nsplit = 5, response = "resp1", type = "AICc", envir = b1out5a$envir) save(b1out5c, file = "b1out5c.rda") } secs <- b1out5c$time[1] b1out5c$nfit identical(rownames(b1out5a$fits), rownames(b1out5c$fits)) @ This takes essentially no time (\Sexpr{round(secs, 3)} seconds), since no new fits need to be done. However, the order of models according to AIC and AICc are different. Rather than show all the fits we now show only those having AICc within 10 of the lowest. <>= ilow <- b1out5c$fits[ , "cic"] < b1out5c$fits[1, "cic"] + 10 b1out5c$fits[ilow, , drop = FALSE] @ The number 10 is arbitrary, but \citet[Section~2.6]{b+a} make it the dividing line between ``considerably less than substantial'' support and ``essentially none.'' Note that the simulation truth model ranks number \Sexpr{match("q12l", rownames(b1out5c$fits))} according to AICc of the models that have been fit. Now we repeat what we just did using $\bic$ instead of $\aicc$. <>= options(show.error.messages = FALSE, warn = -1) try(load("b1out5b.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b1out5b")) { b1out5b <- aster.leaps(pred, fam, data = redata, nsplit = 5, response = "resp1", type = "BIC", envir = b1out5c$envir) save(b1out5b, file = "b1out5b.rda") } secs <- b1out5b$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b1out5b$fits[ , "bic"] < b1out5b$fits[1, "bic"] + 10 b1out5b$fits[ilow, , drop = FALSE] @ In all three calls to \texttt{aster.leaps} we have fit \Sexpr{nrow(b1out5b$fits)} models. In this call we fit \Sexpr{b1out5b$nfit}. The time taken for this call was \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model ranks number \Sexpr{match("q12l", rownames(b1out5b$fits))} according to BIC of the models that have been fit. \subsubsection{Model 2} Now we do the same thing with model~2, simulated in Section~\ref{sec:model-2} in which fitness is a quadratic function of \texttt{z1}, $\ldots$, \texttt{z10}. <>= options(show.error.messages = FALSE, warn = -1) try(load("b2out5a.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b2out5a")) { b2out5a <- aster.leaps(pred, fam, data = redata, nsplit = 5, response = "resp2") save(b2out5a, file = "b2out5a.rda") } secs <- b2out5a$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b2out5a$fits[ , "aic"] < b2out5a$fits[1, "aic"] + 10 b2out5a$fits[ilow, , drop = FALSE] @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model is not among the models under consideration (it depends on all 10 $z_i$). <>= options(show.error.messages = FALSE, warn = -1) try(load("b2out5c.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b2out5c")) { b2out5c <- aster.leaps(pred, fam, data = redata, nsplit = 5, response = "resp2", type = "AICc", envir = b2out5a$envir) save(b2out5c, file = "b2out5c.rda") } secs <- b2out5c$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b2out5c$fits[ , "cic"] < b2out5c$fits[1, "cic"] + 10 b2out5c$fits[ilow, , drop = FALSE] @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model is not among the models under consideration (it depends on all 10 $z_i$). <>= options(show.error.messages = FALSE, warn = -1) try(load("b2out5b.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b2out5b")) { b2out5b <- aster.leaps(pred, fam, data = redata, nsplit = 5, response = "resp2", type = "BIC", envir = b2out5c$envir) save(b2out5b, file = "b2out5b.rda") } secs <- b2out5b$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b2out5b$fits[ , "bic"] < b2out5b$fits[1, "bic"] + 10 b2out5b$fits[ilow, , drop = FALSE] @ The time taken for this call was \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model is not among the models under consideration (it depends on all 10 $z_i$). \subsection{Six Predictors} \label{sec:six} This section repeats Section~\ref{sec:five} changing ``five'' to ``six.'' \subsubsection{Model 1} <>= options(show.error.messages = FALSE, warn = -1) try(load("b1out6a.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b1out6a")) { b1out6a <- aster.leaps(pred, fam, data = redata, nsplit = 6, response = "resp1", type = "AIC", envir = b1out5b$envir) save(b1out6a, file = "b1out6a.rda") } secs <- b1out6a$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b1out6a$fits[ , "aic"] < b1out6a$fits[1, "aic"] + 10 b1out6a$fits[ilow, , drop = FALSE] @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model ranks number \Sexpr{match("q12l", rownames(b1out6a$fits))} according to AIC of the models that have been fit. <>= options(show.error.messages = FALSE, warn = -1) try(load("b1out6c.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b1out6c")) { b1out6c <- aster.leaps(pred, fam, data = redata, nsplit = 6, response = "resp1", type = "AICc", envir = b1out6a$envir) save(b1out6c, file = "b1out6c.rda") } secs <- b1out6c$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b1out6c$fits[ , "cic"] < b1out6c$fits[1, "cic"] + 10 b1out6c$fits[ilow, , drop = FALSE] @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model ranks number \Sexpr{match("q12l", rownames(b1out6c$fits))} according to AICc of the models that have been fit. <>= options(show.error.messages = FALSE, warn = -1) try(load("b1out6b.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b1out6b")) { b1out6b <- aster.leaps(pred, fam, data = redata, nsplit = 6, response = "resp1", type = "BIC", envir = b1out6c$envir) save(b1out6b, file = "b1out6b.rda") } secs <- b1out6b$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b1out6b$fits[ , "bic"] < b1out6b$fits[1, "bic"] + 10 b1out6b$fits[ilow, , drop = FALSE] @ The time taken for this call was \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model ranks number \Sexpr{match("q12l", rownames(b1out6b$fits))} according to BIC of the models that have been fit. \subsubsection{Model 2} <>= options(show.error.messages = FALSE, warn = -1) try(load("b2out6a.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b2out6a")) { b2out6a <- aster.leaps(pred, fam, data = redata, nsplit = 6, response = "resp2", envir = b2out5b$envir) save(b2out6a, file = "b2out6a.rda") } secs <- b2out6a$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b2out6a$fits[ , "aic"] < b2out6a$fits[1, "aic"] + 10 b2out6a$fits[ilow, , drop = FALSE] @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model is not among the models under consideration (it depends on all 10 $z_i$). <>= options(show.error.messages = FALSE, warn = -1) try(load("b2out6c.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b2out6c")) { b2out6c <- aster.leaps(pred, fam, data = redata, nsplit = 6, response = "resp2", type = "AICc", envir = b2out6a$envir) save(b2out6c, file = "b2out6c.rda") } secs <- b2out6c$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b2out6c$fits[ , "cic"] < b2out6c$fits[1, "cic"] + 10 b2out6c$fits[ilow, , drop = FALSE] @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model is not among the models under consideration (it depends on all 10 $z_i$). <>= options(show.error.messages = FALSE, warn = -1) try(load("b2out6b.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b2out6b")) { b2out6b <- aster.leaps(pred, fam, data = redata, nsplit = 6, response = "resp2", type = "BIC", envir = b2out6c$envir) save(b2out6b, file = "b2out6b.rda") } secs <- b2out6b$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b2out6b$fits[ , "bic"] < b2out6b$fits[1, "bic"] + 10 b2out6b$fits[ilow, , drop = FALSE] @ The time taken for this call was \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model is not among the models under consideration (it depends on all 10 $z_i$). \subsection{Seven Predictors} \label{sec:seven} This section repeats Section~\ref{sec:six} changing ``six'' to ``seven.'' \subsubsection{Model 1} <>= options(show.error.messages = FALSE, warn = -1) try(load("b1out7a.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b1out7a")) { b1out7a <- aster.leaps(pred, fam, data = redata, nsplit = 7, response = "resp1", type = "AIC", envir = b1out6b$envir) save(b1out7a, file = "b1out7a.rda") } secs <- b1out7a$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b1out7a$fits[ , "aic"] < b1out7a$fits[1, "aic"] + 10 b1out7a$fits[ilow, , drop = FALSE] @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model ranks number \Sexpr{match("q12l", rownames(b1out7a$fits))} according to AIC of the models that have been fit. <>= options(show.error.messages = FALSE, warn = -1) try(load("b1out7c.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b1out7c")) { b1out7c <- aster.leaps(pred, fam, data = redata, nsplit = 7, response = "resp1", type = "AICc", envir = b1out7a$envir) save(b1out7c, file = "b1out7c.rda") } secs <- b1out7c$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b1out7c$fits[ , "cic"] < b1out7c$fits[1, "cic"] + 10 b1out7c$fits[ilow, , drop = FALSE] @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model ranks number \Sexpr{match("q12l", rownames(b1out7c$fits))} according to AICc of the models that have been fit. <>= options(show.error.messages = FALSE, warn = -1) try(load("b1out7b.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b1out7b")) { b1out7b <- aster.leaps(pred, fam, data = redata, nsplit = 7, response = "resp1", type = "BIC", envir = b1out7c$envir) save(b1out7b, file = "b1out7b.rda") } secs <- b1out7b$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b1out7b$fits[ , "bic"] < b1out7b$fits[1, "bic"] + 10 b1out7b$fits[ilow, , drop = FALSE] @ The time taken for this call was \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model ranks number \Sexpr{match("q12l", rownames(b1out7b$fits))} according to BIC of the models that have been fit. \subsubsection{Model 2} <>= options(show.error.messages = FALSE, warn = -1) try(load("b2out7a.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b2out7a")) { b2out7a <- aster.leaps(pred, fam, data = redata, nsplit = 7, response = "resp2", envir = b2out6b$envir) save(b2out7a, file = "b2out7a.rda") } secs <- b2out7a$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b2out7a$fits[ , "aic"] < b2out7a$fits[1, "aic"] + 10 b2out7a$fits[ilow, , drop = FALSE] @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model is not among the models under consideration (it depends on all 10 $z_i$). <>= options(show.error.messages = FALSE, warn = -1) try(load("b2out7c.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b2out7c")) { b2out7c <- aster.leaps(pred, fam, data = redata, nsplit = 7, response = "resp2", type = "AICc", envir = b2out7a$envir) save(b2out7c, file = "b2out7c.rda") } secs <- b2out7c$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b2out7c$fits[ , "cic"] < b2out7c$fits[1, "cic"] + 10 b2out7c$fits[ilow, , drop = FALSE] @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model is not among the models under consideration (it depends on all 10 $z_i$). <>= options(show.error.messages = FALSE, warn = -1) try(load("b2out7b.rda")) options(show.error.messages = TRUE, warn = 0) if (! exists("b2out7b")) { b2out7b <- aster.leaps(pred, fam, data = redata, nsplit = 7, response = "resp2", type = "BIC", envir = b2out7c$envir) save(b2out7b, file = "b2out7b.rda") } secs <- b2out7b$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) ilow <- b2out7b$fits[ , "bic"] < b2out7b$fits[1, "bic"] + 10 b2out7b$fits[ilow, , drop = FALSE] @ The time taken for this call was \Sexpr{mins} minutes and \Sexpr{secs} seconds. Note that the simulation truth model is not among the models under consideration (it depends on all 10 $z_i$). \subsection{Total Time} The total time taken for all calls to \texttt{aster.leaps} is <>= leapout <- ls(pattern = "^b[12]out") secs <- 0 for (i in seq(along = leapout)) { bxout <- get(leapout[i]) secs <- secs + bxout$time[1] } mins <- floor(secs / 60) secs <- floor(secs - mins * 60) hrs <- floor(mins / 60) mins <- floor(mins - hrs * 60) @ \Sexpr{hrs} hours, \Sexpr{mins} minutes, and \Sexpr{secs} seconds. \section{Frequentist Model Averaging} \label{sec:fma} We do not know if any model under consideration (in the class $\mathcal{M}$) is correct; in our second example none are. Even if some model under consideration is correct, we do not know which one it is; in our first example neither $\aic$, $\aicc$ nor $\bic$ ever selected the correct model. Thus it makes no sense to proceed as if the model selected by $\aic$, $\aicc$ or $\bic$ is correct. So what do we do? A recent proposal is frequentist model averaging (FMA), which copies Bayesian model averaging (BMA) in operation though not in philosophy. See \citet{b+a} and \citet{hjort} for FMA and \citet{bma} for BMA. BMA does the Right Thing from the Bayesian point of view. A Bayesian considers everything unknown a parameter and formulates uncertainty about it as a prior distribution. Here both models and what the frequentist calls parameters within models are unknown. Given a prior on both models and parameters within models, the Bayesian computes the posterior distribution (on both models and parameters within models). Then, given something to predict (e.~g., the fitness landscape), which is a function of model and parameter within model the Bayesian predicts this by averaging over the posterior distribution (over both models and parameters within models). In this context full BMA would be very computationally intensive, requiring Markov chain Monte Carlo. FMA does roughly the same thing, is much easier to do, and perhaps works as well, although we do not investigate this. In this section we will mostly follow \citet[Chapter~4]{b+a}. The idea is to average inferences from various models that are reasonably well supported by the data. The virtue of averaging here is the same as everywhere else in statistics: it cancels errors more often than it amplifies them. In BMA, $\exp\bigl(- \tfrac{1}{2} \bic(m)\bigr)$ is asymptotically proportional to the posterior probability of model $m$. Hence if $g(\theta)$ is a function of the parameters we are interested in estimating, the weighted average \begin{equation} \label{eq:bma} \frac{\sum_{m \in \mathcal{M}} g(\hat{\theta}_m) \exp\bigl(- \tfrac{1}{2} \bic(m)\bigr)}{ \sum_{m \in \mathcal{M}} \exp\bigl(- \tfrac{1}{2} \bic(m)\bigr)} \end{equation} is an asymptotic approximation of the posterior expectation of $g(\theta)$. True BMA would use the true posterior probabilities and also integrate over $\theta$ within models \citep{bma}, but this is a reasonable asymptotic approximation. Proceeding by analogy, frequentists would use \begin{equation} \label{eq:fma} \frac{\sum_{m \in \mathcal{M}} g(\hat{\theta}_m) \exp\bigl(- \tfrac{1}{2} \aic(m)\bigr)}{ \sum_{m \in \mathcal{M}} \exp\bigl(- \tfrac{1}{2} \aic(m)\bigr)} \end{equation} or the same with $\aic$ replaced by $\aicc$ \citep[Chapter~4]{b+a}. There does not seem to be a strong theoretical justification for this particular form of weighted average \citep{hjort}, but any averaging is better than no averaging. Because we have not fit all models in $\mathcal{M}$ and do not want to (it would take a huge amount of time), we propose to use a short cut in the spirit of what \citet{occam} called ``Occam's window.'' Instead of averaging over all the models under consideration, we average only over a random subset $\mathcal{A}$, which we define by \begin{equation} \label{eq:prune} \mathcal{A} = \left\{\, m \in \mathcal{M} : \aic(m) < 2 \log(c) + \min_{m' \in \mathcal{M}} \aic(m') \,\right\} \end{equation} where $c$ is a constant chosen by the investigators. Then we replace $\mathcal{M}$ by $\mathcal{A}$ in \eqref{eq:bma} and \eqref{eq:fma} giving \begin{equation} \label{eq:fmac} \frac{\sum_{m \in \mathcal{A}} g(\hat{\theta}_m) \exp\bigl(- \tfrac{1}{2} \aic(m)\bigr)}{ \sum_{m \in \mathcal{A}} \exp\bigl(- \tfrac{1}{2} \aic(m)\bigr)} \end{equation} or the same with $\aic$ replaced by $\aicc$ or $\bic$. \citet{occam} used in their examples $c = 20$ ``by analogy with the popular .05 cutoff for $P$ values'' but that ``popular'' choice is itself arbitrary and without any theoretical justification, hence so is using $c = 20$ in BMA or FMA. Nevertheless, we will use the same choice. The full ``Occam's window'' proposal of \citet{occam} involved another kind of pruning using a smaller subset than the $\mathcal{A}$ defined here, but Raftery and coauthors seem to have dropped this second kind of pruning; \citet{bma} call it ``optional'' and \citet{volinsky} do not even mention it. The discussants of \citet{bma} were dubious about Occam's window and in reply \citet{bma} admitted ``we know of no formal theoretical support for it.'' It is merely a computational convenience. In FMA the change from \eqref{eq:fma} to \eqref{eq:fmac} is, perhaps, a bit less dubious, not because there is stronger justification for the pruning \eqref{eq:prune} but because there is less theoretical justification for \eqref{eq:fma}, there being no philosophically ideal form of frequentist model averaging. For our examples of FMA we will stop using $\aic$ and use only $\aicc$ and $\bic$. \subsection{Five Predictors} \label{sec:five-fma} So far we have been passing around two environments, both now quite full. <>= b1out5a$envir b1out7b$envir b2out5a$envir b2out7b$envir @ Different names for the same two environments. Hence we restore them to where we want them. <>= load("b1out5b.rda") load("b2out5b.rda") @ \subsubsection{Model 1} First we use the data simulated from Model~1. Since we now want to be sure we have all models in the set \eqref{eq:prune}, we need to rerun \verb@aster.leaps@ using its \verb@cutoff@ argument. <>= options(show.error.messages = FALSE, warn = -1) try(load("f1out5c.rda")) options(show.error.messages = TRUE, warn = 0) cutoff <- 2 * log(20) if (! exists("f1out5c")) { f1out5c <- aster.leaps(pred, fam, data = redata, nsplit = 5, response = "resp1", type = "AICc", envir = b1out5b$envir, cutoff = cutoff) save(f1out5c, file = "f1out5c.rda") } secs <- f1out5c$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. We had to fit \Sexpr{f1out5c$nfit} new models that we had not needed to fit in finding the minimum $\aic$, $\aicc$ and $\bic$. <>= ilow <- f1out5c$fits[ , "cic"] < f1out5c$fits[1, "cic"] + cutoff mods <- f1out5c$fits[ilow, , drop = FALSE] print(mods) @ There are \Sexpr{sum(ilow)} models to average over (shown above). For the ``parameter'' to average, we take the whole fitness surface. Since this is a function, we cannot fit it at all points. We will fit it at the \Sexpr{nind} data points. First we define a function that takes the mean value parameter, $\tau$ in the notation in \citet{gws}, to fitness; it uses the global variable \texttt{vars} defined on p.~\pageref{pg:vars}. <>= tau2fit <- function(tau) { tau <- matrix(tau, ncol = length(vars)) widx <- grep("^w[0-9]", vars) apply(tau[ , widx], 1, sum) } @ Thus, for example, the ``simulation truth'' fitness for Model~1 is given by <>= true <- tau2fit(redata$tau1) @ Since we did not save the whole fit, just the $\aic$, $\aicc$, and $\bic$ values for the models we fit during execution of the branch and bound algorithm, we need to refit the model in the set $\mathcal{A}$. Since we only know these models by their ``names,'' which are character strings <>= modnames <- rownames(mods) modnames @ we need a function that turns such strings into model formulas and fits the models. Such a function, called \texttt{redomod}, is in the file sourced on p.~\pageref{pg:source-leaps} that also contained the \texttt{aster.leaps} function. Here is how it is called. <>= args(redomod) @ The \texttt{string} argument is the model specification, e.~g., \texttt{"\Sexpr{modnames[1]}"}, the \texttt{data} argument is the data frame containing the variables, for us always \texttt{redata}, and the \texttt{response} argument is the name of the response, for us either \texttt{"resp1"} or \texttt{"resp2"}. Thus, for example, the fitness predicted by the model ``selected'' by $\aicc$ (the model with smallest $\aicc$) is given by <>= out <- redomod(modnames[1], data = redata, response = "resp1") wslct <- tau2fit(predict(out)) @ and the fitness predicted by the correct model, which we know because the data are simulated but which we would not know in real life, is given by <>= out <- redomod("q12l", data = redata, response = "resp1") wbest <- tau2fit(predict(out)) @ Now we want to calculate the fitness predicted using FMA. First we calculate the weights used in the weighted average, then we refit the models and average the predictions of the models using these weights. <>= wgt <- mods[ , "cic"] wgt <- wgt - wgt[1] wgt <- exp(- wgt / 2) wgt <- wgt / sum(wgt) wgt wpred <- 0 * true modnames <- rownames(mods) for (i in seq(along = modnames)) { out <- redomod(modnames[i], data = redata, response = "resp1") wpred <- wpred + wgt[i] * tau2fit(predict(out)) } @ Now we compare these ``predictions'' (actually, \emph{estimates} is the better term in this situation). There are several criteria we could use for comparison. The most natural to statisticians is root-mean-square (RMS) error <>= ### rms error sqrt(mean((wslct - true)^2)) sqrt(mean((wpred - true)^2)) sqrt(mean((wbest - true)^2)) @ We see that FMA does slightly better than merely using the model ``selected'' by $\aicc$, but neither does as well as using the correct model, which in real life we would not know. Or we could use mean absolute error. <>= ### mean abs error mean(abs(wslct - true)) mean(abs(wpred - true)) mean(abs(wbest - true)) @ Same story. Or we could use maximum error. <>= ### max error max(abs(wslct - true)) max(abs(wpred - true)) max(abs(wbest - true)) @ Same story. Since we get the same results by all three criteria, from now on we will only use RMS error. We save these results for future reference. <>= m1save <- data.frame(npred = c(5, 5, 2), type = c("select-AICc", "FMA-AICc", "correct"), rms = c(sqrt(mean((wslct - true)^2)), sqrt(mean((wpred - true)^2)), sqrt(mean((wbest - true)^2)))) print(m1save) @ Now we redo everything using $\bic$ instead of $\aicc$. <>= options(show.error.messages = FALSE, warn = -1) try(load("f1out5b.rda")) options(show.error.messages = TRUE, warn = 0) cutoff <- 2 * log(20) if (! exists("f1out5b")) { f1out5b <- aster.leaps(pred, fam, data = redata, nsplit = 5, response = "resp1", type = "BIC", envir = f1out5c$envir, cutoff = cutoff) save(f1out5b, file = "f1out5b.rda") } secs <- f1out5b$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. We had to fit \Sexpr{f1out5b$nfit} new models that we had not yet needed to fit. <>= ilow <- f1out5b$fits[ , "bic"] < f1out5b$fits[1, "bic"] + cutoff mods <- f1out5b$fits[ilow, , drop = FALSE] print(mods) @ There are \Sexpr{sum(ilow)} models to average over (shown above) which have ``names'' <>= modnames <- rownames(mods) modnames @ The fitness predicted by the model ``selected'' by $\bic$ is given by <>= out <- redomod(modnames[1], data = redata, response = "resp1") wslct <- tau2fit(predict(out)) @ We do not have to redo \texttt{wbest}. It is the same as before (it did not depend on whether we were using $\aicc$ or $\bic$). Now we want to calculate the fitness predicted using FMA. First we calculate the weights used in the weighted average, then we refit the models and average the predictions of the models using these weights. <>= wgt <- mods[ , "bic"] wgt <- wgt - wgt[1] wgt <- exp(- wgt / 2) wgt <- wgt / sum(wgt) wgt wpred <- 0 * true modnames <- rownames(mods) for (i in seq(along = modnames)) { out <- redomod(modnames[i], data = redata, response = "resp1") wpred <- wpred + wgt[i] * tau2fit(predict(out)) } @ Now we compare these estimates using RMS error and also compare with our other estimates using $\aicc$. <>= tmp <- data.frame(npred = rep(5, 2), type = c("select-BIC", "FMA-BIC"), rms = c(sqrt(mean((wslct - true)^2)), sqrt(mean((wpred - true)^2)))) m1save <- rbind(m1save, tmp) m1save <- m1save[rev(order(m1save[ , "rms"])), ] print(m1save) @ Despite the fact that $\bic$ is supposed to do better than $\aicc$ in this situation (where one of the models under consideration is true and a rather small one of them), it actually does worse in this particular example, the problem being that the model it ``selects'' \texttt{"\Sexpr{modnames[1]}"} is too small. \subsubsection{Model 2} Now we use the data simulated from Model~2. <>= options(show.error.messages = FALSE, warn = -1) try(load("f2out5c.rda")) options(show.error.messages = TRUE, warn = 0) cutoff <- 2 * log(20) if (! exists("f2out5c")) { f2out5c <- aster.leaps(pred, fam, data = redata, nsplit = 5, response = "resp2", type = "AICc", envir = b2out5b$envir, cutoff = cutoff) save(f2out5c, file = "f2out5c.rda") } secs <- f2out5c$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. We had to fit \Sexpr{f2out5c$nfit} new models that we had not needed to fit in finding the minimum $\aic$, $\aicc$ and $\bic$. <>= ilow <- f2out5c$fits[ , "cic"] < f2out5c$fits[1, "cic"] + cutoff mods <- f2out5c$fits[ilow, , drop = FALSE] print(mods) @ There are \Sexpr{sum(ilow)} models to average over (shown above), which have ``names'' <>= modnames <- rownames(mods) modnames @ The ``simulation truth'' fitness for Model~2 is given by <>= wtrue <- tau2fit(redata$tau2) @ The fitness predicted by the model ``selected'' by $\aicc$ is given by <>= out <- redomod(modnames[1], data = redata, response = "resp2") wslct <- tau2fit(predict(out)) @ and the fitness predicted by the correct model, which we know because the data are simulated but which we would not know in real life, is given by <>= out <- redomod("q123456789Al", data = redata, response = "resp2") wbest <- tau2fit(predict(out)) @ Now we want to calculate the fitness predicted using FMA. First we calculate the weights used in the weighted average, then we refit the models and average the predictions of the models using these weights. <>= wgt <- mods[ , "cic"] wgt <- wgt - wgt[1] wgt <- exp(- wgt / 2) wgt <- wgt / sum(wgt) wgt wpred <- 0 * wtrue modnames <- rownames(mods) for (i in seq(along = modnames)) { out <- redomod(modnames[i], data = redata, response = "resp2") wpred <- wpred + wgt[i] * tau2fit(predict(out)) } @ Now we compare these estimates using root-mean-square error. <>= m2save <- data.frame(npred = c(5, 5, 10), type = c("select-AICc", "FMA-AICc", "correct"), rms = c(sqrt(mean((wslct - true)^2)), sqrt(mean((wpred - true)^2)), sqrt(mean((wbest - true)^2)))) m2save <- m2save[rev(order(m2save[ , "rms"])), ] print(m2save) @ We see that FMA does slightly worse than merely using the model ``selected'' by $\aicc$, and both do better than using the correct model, because the correct model just has too many parameters to estimate well with this amount of data. Now we redo everything using $\bic$ instead of $\aicc$. <>= options(show.error.messages = FALSE, warn = -1) try(load("f2out5b.rda")) options(show.error.messages = TRUE, warn = 0) cutoff <- 2 * log(20) if (! exists("f2out5b")) { f2out5b <- aster.leaps(pred, fam, data = redata, nsplit = 5, response = "resp2", type = "BIC", envir = f2out5c$envir, cutoff = cutoff) save(f2out5b, file = "f2out5b.rda") } secs <- f2out5b$time[1] mins <- floor(secs / 60) secs <- floor(secs - mins * 60) @ The call to this function took \Sexpr{mins} minutes and \Sexpr{secs} seconds. We had to fit \Sexpr{f2out5b$nfit} new models that we had not yet needed to fit. <>= ilow <- f2out5b$fits[ , "bic"] < f2out5b$fits[1, "bic"] + cutoff mods <- f2out5b$fits[ilow, , drop = FALSE] print(mods) @ There are \Sexpr{sum(ilow)} models to average over (shown above) which have ``names'' <>= modnames <- rownames(mods) modnames @ The fitness predicted by the model ``selected'' by $\bic$ is given by <>= out <- redomod(modnames[1], data = redata, response = "resp2") wslct <- tau2fit(predict(out)) @ We do not have to redo \texttt{wbest}. It is the same as before (it did not depend on whether we were using $\aicc$ or $\bic$). Now we want to calculate the fitness predicted using FMA. First we calculate the weights used in the weighted average, then we refit the models and average the predictions of the models using these weights. <>= wgt <- mods[ , "bic"] wgt <- wgt - wgt[1] wgt <- exp(- wgt / 2) wgt <- wgt / sum(wgt) wgt wpred <- 0 * wtrue modnames <- dimnames(mods)[[1]] for (i in seq(along = modnames)) { out <- redomod(modnames[i], data = redata, response = "resp2") wpred <- wpred + wgt[i] * tau2fit(predict(out)) } @ Now we compare these estimates using root-mean-square error. <>= tmp <- data.frame(npred = rep(5, 2), type = c("select-BIC", "FMA-BIC"), rms = c(sqrt(mean((wslct - true)^2)), sqrt(mean((wpred - true)^2)))) m2save <- rbind(m2save, tmp) m2save <- m2save[rev(order(m2save[ , "rms"])), ] print(m2save) @ The order between ``selection'' and FMA is now confused, with one FMA doing better than the ``selection'' and the other worse, but all of the model selection and model averaging estimators are better than using the correct model with 10 predictors and \Sexpr{length(out5too$coefficients)} parameters to estimate. \subsection{Summary} We originally intended to repeat the analyses in this section using more predictor variables, but since we have seen the pattern we expected (more or less) with five predictors, and since this model selection and model averaging are a minor point of the paper (although most of the work), we stop here. \section{Discussion} It is hard to know what lessons to draw from simulations. All simulation studies are designed, consciously or unconsciously, to ``prove'' a particular point. Since nearly any method can be made to look good if the simulation is chosen precisely to make it look good, simulations actually prove nothing. All we can say about this simulation is that we had, or at least were consciously aware of, no ax to grind other than illustrating that principal components regression is a bad idea. But this latter point is so well understood by statisticians, that we did not bother to illustrate it with our simulations. It is so obvious that principal components would do a horrible job on our example, that there would be no point to actually illustrating this. Among model selection and model averaging ideas that actually have some theoretical justification, we have no ax to grind, not being experts in the area. We tried some and they worked, more or less. Most readers, however, are probably disappointed in how well they worked. Many users of statistics have no idea how badly most model selection schemes work on realistic problems, and the actual performance of the best schemes known is much worse than users desire. But there is no magic, statistics works as well as it works. Short of a fairy godmother with a magic wand, it is clear that no model selection method known will do much better than the ones we tried here. We make no claim that the methods we tried are optimal; more complicated methods might do slightly better, but not much better. Of course, how well methods work depend on how obvious the true model is. If the effects are made large enough or if the sample size is large enough, then any method that is any good (this does not include principal components regression) will select the correct model. If one increases the sample size \texttt{nind} (p.~\pageref{rnw:1}) or if one increases the strength of the quadratic effect \texttt{ascal} (p.~\pageref{rnw:3}), then the problem becomes much easier and for sufficiently large \texttt{nind} or \texttt{ascal} all of the methods we tried will ``select'' the simulation truth model with reasonably high probability. But how realistic is that? Note that the data we actually simulated is very easy when the true predictors are known ($P$-values on p.~\pageref{pg:out6-8}). Most scientists plan experiments just large enough to show something, not so large that everything is obvious without statistics. We did, at least, show two contrasting situations. When the true model, which is, of course, unknown in real life, is ``sparse'' with only a few regression coefficients nonzero, then BIC does better. It has a bias towards small models, so when this bias works in its favor, it does better. When the true model, is non-sparse with many regression coefficients nonzero, then AIC and $\aicc$ do better. They have a bias towards large models, so when this bias works in their favor, they do better. Choose whichever you like, depending on your opinion about the true state of affairs. As we have repeatedly mentioned, \citet[Section~1.2.5]{b+a} are particularly emphatic about the biological unrealism of ``true'' (at least simulation truth) models with only a few parameters. Although we have nothing particular to add to this, we agree, for what it is worth. Thus we would usually use $\aicc$ rather than $\bic$. Frequentist model averaging is new (less than ten years old) and we have even less to say about that. It does seem to work better than ``selecting'' a model and pretending it is true, a threadbare pretense when there are thousands of models under consideration and a negligible chance of selecting the correct one. In such circumstances, ``selecting'' a model and pretending it is true, especially overinterpreting the ``selected'' model and claiming that the predictors ``selected'' are the ones that explain the phenomena, is clearly wrong, and frequentist model averaging is a sensible substitute. \begin{thebibliography}{} \bibitem[Akaike(1973)]{aic} Akaike, H. (1973). \newblock Information theory and an extension of the maximum likelihood principle. \newblock \emph{Second International Symposium on Information Theory}, 267--281. % \bibitem[Barndorff-Nielsen(1978)]{barndorff} % Barndorff-Nielsen, O.~E. (1978). % \newblock \emph{Information and Exponential Families}. % \newblock Chichester: John Wiley. % \bibitem[Brown(1986)]{brown} % Brown, L.~D. (1986). % \newblock \emph{Fundamentals of Statistical Exponential Families: with % Applications in Statistical Decision Theory}. % \newblock Hayward, CA: Institute of Mathematical Statistics. \bibitem[Burnham and Anderson(2002)]{b+a} Burnham, K.~P. and Anderson, D.~R. (2002). \newblock \emph{Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach}, 2nd ed. \newblock New York: Springer-Verlag. \bibitem[Cook(2007)]{cook} Cook, R. D. (2007). \newblock Fisher lecture: Dimension reduction in regression (with discussion). \newblock \emph{Statistical Science}, in press. \newblock \url{http://www.imstat.org/sts/future_papers.html} \bibitem[Cox(1968)]{cox} Cox, D.~R. (1968). \newblock Notes on some aspects of regression analysis. \newblock \emph{Journal of the Royal Statistical Society, Ser. A}, \textbf{131}, 265--279. \bibitem[Furnival and Wilson(1974)]{leaps} Furnival, G.~M. and Wilson, R.~W., Jr. (1974). \newblock Regressions by Leaps and Bounds \newblock \emph{Technometrics}, \textbf{16}, 499--511. \newblock Reprinted, \emph{Technometrics}, \textbf{42}, 69--79. \bibitem[Geyer and Shaw(2008)]{evo2008talk} Geyer, C.~J. and Shaw, R.~G. (2008). \newblock Supporting Data Analysis for a talk to be given at Evolution 2008 University of Minnesota, June 20--24. \newblock University of Minnesota School of Statistics Technical Report No.~669. \newblock \url{http://www.stat.umn.edu/geyer/aster/} \bibitem[Geyer et al.(2007)Geyer, Wagenius and Shaw]{gws} Geyer, C.~J., Wagenius, S. and Shaw, R.~G. (2007). \newblock Aster models for life history analysis. \newblock \emph{Biometrika}, \textbf{94} 415--426. \bibitem[Hand(1981)]{hand} Hand, D.~J. (1981). \newblock Branch and bound in statistical data analysis. \newblock \emph{The Statistician}, \textbf{30}, 1--13. \bibitem[Hjort and Claeskens(2003)]{hjort} Hjort N.~L. and Claeskens G. (2003). \newblock Frequentist model average estimators. \newblock \emph{Journal of the American Statistical Association}, \textbf{98}, 879--899. \bibitem[Hoeting et al.(1999)Hoeting, Madigan, Raftery, and Volinsky]{bma} Hoeting, J.~A., Madigan, D., Raftery, A.~E., and Volinsky, C.~T. (1999). \newblock Bayesian model averaging: A tutorial (with discussion). \newblock \emph{Statistical Science}, \textbf{19}, 382--417. \newblock Corrected version available at \url{http://www.stat.washington.edu/www/research/online/1999/hoeting.pdf}. \bibitem[Lande and Arnold(1983)]{la} Lande, R. and Arnold, S.~J. (1983). \newblock The measurement of selection on correlated characters. \newblock \emph{Evolution}, \textbf{37}, 1210--1226. % \bibitem[Lindgren(1993)]{lindgren} % Lindgren, B.~W. (1993). % \newblock \emph{Statistical Theory}, 4th ed. % \newblock New York: Chapman \& Hall. \bibitem[Madigan and Raftery(1994)]{occam} Madigan, D. and Raftery, A.~E. (1994). \newblock Model selection and accounting for model uncertainty in graphical models using Occam's window. \newblock \emph{Journal of the American Statistical Association}, \textbf{89}, 1535--1546. % \bibitem[Phillips and Arnold(1989)]{p+a} % Phillips, P.~C. and Arnold, S.~J. (1989). % \newblock Visualizing multivariate selection. % \newblock \emph{Evolution}, \textbf{43}, 1209--1222. \bibitem[R Development Core Team(2008)]{rcore} R Development Core Team (2008). \newblock R: A language and environment for statistical computing. \newblock R Foundation for Statistical Computing, Vienna, Austria. \newblock \url{http://www.R-project.org}. % \bibitem[Rockafellar and Wets(2004)]{raw} % Rockafellar, R.~T. and Wets, R. J.-B. (2004). % \newblock \emph{Variational Analysis}, corr.\ 2nd printing. % \newblock Berlin: Springer-Verlag. \bibitem[Schwarz(1978)]{schwarz} Schwarz, G. (1978). \newblock Estimating the dimension of a model. \newblock \emph{Annals of Statistics}, \textbf{6}, 461--464. \bibitem[Shaw, et al.(2008) Shaw, Geyer, Wagenius, Hangelbroek, and Etterson]{aster2} Shaw, R.~G., Geyer, C.~J., Wagenius, S., Hangelbroek, H.~H., and Etterson, J.~R. (2008). \newblock Unifying life history analysis for inference of fitness and population growth. \newblock \emph{American Naturalist}, in press. \newblock \url{http://www.stat.umn.edu/geyer/aster/} \bibitem[Shaw, et al.(2007) Shaw, Geyer, Wagenius, Hangelbroek, and Etterson]{aster2tr} Shaw, R.~G., Geyer, C.~J., Wagenius, S., Hangelbroek, H.~H., and Etterson, J.~R. (2007). \newblock Supporting data analysis for ``Unifying life history analysis for inference of fitness and population growth''. \newblock University of Minnesota School of Statistics Technical Report No.~658 \newblock \url{http://www.stat.umn.edu/geyer/aster/} % \bibitem[Shaw, et al.(2007b) Shaw, Geyer, Wagenius, Hangelbroek, and % Etterson]{aster2tr2} % Shaw, R.~G., Geyer, C.~J., Wagenius, S., Hangelbroek, H.~H., and % Etterson, J.~R. (2007b). % \newblock More Supporting data analysis for ``Unifying life history analysis % for inference of fitness and population growth''. % \newblock University of Minnesota School of Statistics Technical Report % No.~661 % \newblock \url{http://www.stat.umn.edu/geyer/aster/} \bibitem[Sugiura(1978)]{sugiura} Sugiura, N. (1978). \newblock Further analysis of the data by Akaike's information criterion and the finite corrections. \newblock \emph{Communications in Statistics, Theory and Methods}, \textbf{A7}, 13--26. \bibitem[Volinsky, et al.(1997)Volinsky, Madigan, Raftery and Kronmal]{volinsky} Volinsky, C.~T., Madigan, D., Raftery, A.~E. and Kronmal, R.~A. (1997). \newblock Bayesian model averaging in proportional hazard models: Assessing the risk of a stroke. \newblock \emph{Applied Statistics}, \textbf{46}, 433--448. \end{thebibliography} \end{document}