\documentclass[11pt,notitlepage]{article} \usepackage{indentfirst} \usepackage{amsmath} \usepackage{amssymb} \usepackage{natbib} \usepackage{url} \usepackage{graphicx} \usepackage[utf8]{inputenc} \usepackage{tikz-cd} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\text{E}} \newcommand{\Var}{\text{Var}} \newcommand{\Fi}{\Sigma} \newcommand{\iFi}{\Sigma^{-1}} \newcommand{\eiFi}{\widehat{\Sigma}^{-1}} \newcommand{\eFi}{\widehat{\Sigma}} \newcommand{\Rsoft}{\texttt{R}} \newcommand{\astertwo}{\texttt{aster2}\;} \newcommand{\betahat}{\hat{\beta}} \newcommand{\tauhat}{\hat{\tau}} \newcommand{\betaenv}{\hat{\beta}_{\text{env}}} \newcommand{\tauenv}{\hat{\tau}_{\text{env}}} \newcommand{\upenv}{\hat{\upsilon}_{\text{env}}} \newcommand{\B}{\mathrm{B}} \newcommand{\T}{\mathrm{T}} \newcommand{\D}{\mathcal{D}} \newcommand{\Su}{\mathcal{S}} \setlength{\parindent}{0cm} \begin{document} \vspace*{0.9375in} % \begin{center} % {\bfseries Web-based Supplementary Materials for } \\ % {\bfseries ``An Application of Envelope Methodology and Aster Models''} \vspace{0.15cm} \\ % by Daniel J. Eck, Charles J. Geyer, and R. Dennis Cook. \vspace{0.5cm} \\ % % Technical Report No.~801 (I'm just making up this number)\\ % need new number % School of Statistics \\ % University of Minnesota \\ % \today % \end{center} \begin{center} {\bfseries Supporting Data Analysis for \\ ``An Application of Envelope Methodology and Aster Models"} \\ By \\ Daniel J. Eck, Charles J. Geyer, and R. Dennis Cook \\ Technical Report No.~699 \\ School of Statistics \\ University of Minnesota \\ \today \end{center} \thispagestyle{empty} \cleardoublepage \setcounter{page}{1} \thispagestyle{empty} In this technical report, we reproduce the examples that appear in \citet*{eck} and show that the combination of general envelope methodology with aster models estimates expected Darwinian fitness with lower variability than estimation conducted with aster models alone. The \texttt{envlpaster} package \citep{envlp-package} is developed to facilitate these calculations. This package implements the bootstrap algorithms discussed in the paper and contains the two datasets corresponding to the examples discussed in the paper. There are other functions in this package that we do not explain in this technical report. Those functions fall into two categories. They are either necessary for implementing the 1D algorithm or they implement other bootstrap techniques not discussed in \citet*{eck}. The other bootstrap techniques are explained in the help pages of the \texttt{envlpaster} package. \\ The bootstraps and model selection calculations in this technical report take substantial time to run. As a result, these calculations are not performed using this \texttt{knitr} script; they are performed in five \texttt{.RData} files and the code that produced these calculations is given in this technical report. The five \texttt{.RData} files are: \begin{itemize} \item[1.] \texttt{ex1-Efronboot.RData} \item[2.] \texttt{ex1-secondboot.RData} \item[3.] \texttt{ex1-AIC.RData} %\item[4.] \texttt{ex2-Efronboot.RData} %\item[5.] \texttt{ex2-secondboot.RData} \item[4.] \texttt{techreport\_eigenboot\_fitboot.RData} \item[5.] \texttt{main-all-terms-DavidLowrydata.RData} %\item[7.] \texttt{ex2-Efronboot-1d.RData} %\item[8.] \texttt{ex2-secondboot-1d.RData} %\item[9.] \texttt{ex2-MLEboot.RData} %\item[10.] \texttt{ex2-AIC.RData} \end{itemize} Daniel J. Eck (\url{eckxx049@umn.edu}) is the current maintainer of these \texttt{.RData} files and they are available on request. \section{Replicating Example 1} A population of 3000 organisms was simulated to form the dataset used in this aster analysis. We generated this data to include a true envelope model. The code generating this dataset is provided in Appendix B. The last line of Appendix B shows that this code reproduces the dataset included in the \texttt{envlpaster} package. The organisms in this dataset either die or remain living for ten years. We observe three random variables for each year $i$ of the study. These variables are $U_i, Vi,$ and $W_i$. The variables $U_i$ and $Vi$ are Bernoulli where $U_i$ models survival and $V_i$ models whether or not an individual reproduces conditional on survival. The variables $W_i$ count offspring produced in the $i$-th year conditional on reproduction. Thus, $W_i$ is modeled as zero-truncated Poisson given $V_i$. The graphical structure for one individual is given by the graph shown in Figure \ref{graph3}. In these data, the variable $\sum_i W_i$ is considered to be Darwinian fitness. There are two covariates $(z_1,z_2)$ thought to be associated with Darwinian fitness and the aster model selected by the likelihood ratio test (LRT) is a full quadratic model with respect to these covariates. A full aster analysis of this data and its construction can be seen in \citet*{geyer4}. Note that our data differs from that in \citet*{geyer4} since we have included a true envelope model. \\ \begin{figure} \begin{center} \setlength{\unitlength}{0.45 in} \thicklines \begin{picture}(10.3,2.5)(-0.1,-2.3) \put(0,0){\makebox(0,0){$1_{\hphantom{0}}$}} \put(1,0){\makebox(0,0){$U_1$}} \put(2,0){\makebox(0,0){$U_2$}} \put(3,0){\makebox(0,0){$U_3$}} \put(4,0){\makebox(0,0){$U_4$}} \put(5,0){\makebox(0,0){$U_5$}} \put(6,0){\makebox(0,0){$U_6$}} \put(7,0){\makebox(0,0){$U_7$}} \put(8,0){\makebox(0,0){$U_8$}} \put(9,0){\makebox(0,0){$U_9$}} \put(10,0){\makebox(0,0){$U_{10}$}} \multiput(0.35,0)(1,0){10}{\vector(1,0){0.3}} \put(1,-1){\makebox(0,0){$V_1$}} \put(2,-1){\makebox(0,0){$V_2$}} \put(3,-1){\makebox(0,0){$V_3$}} \put(4,-1){\makebox(0,0){$V_4$}} \put(5,-1){\makebox(0,0){$V_5$}} \put(6,-1){\makebox(0,0){$V_6$}} \put(7,-1){\makebox(0,0){$V_7$}} \put(8,-1){\makebox(0,0){$V_8$}} \put(9,-1){\makebox(0,0){$V_9$}} \put(10,-1){\makebox(0,0){$V_{10}$}} \multiput(1,-0.25)(1,0){10}{\vector(0,-1){0.5}} \put(1,-2){\makebox(0,0){$W_1$}} \put(2,-2){\makebox(0,0){$W_2$}} \put(3,-2){\makebox(0,0){$W_3$}} \put(4,-2){\makebox(0,0){$W_4$}} \put(5,-2){\makebox(0,0){$W_5$}} \put(6,-2){\makebox(0,0){$W_6$}} \put(7,-2){\makebox(0,0){$W_7$}} \put(8,-2){\makebox(0,0){$W_8$}} \put(9,-2){\makebox(0,0){$W_9$}} \put(10,-2){\makebox(0,0){$W_{10}$}} \multiput(1,-1.25)(1,0){10}{\vector(0,-1){0.5}} \end{picture} \end{center} \caption{Graphical structure of the aster model for the simulated data in Section 5. The top layer corresponds to survival; these random variables are Bernoulli. The middle layer corresponds to whether or not an individual reproduced; these random variables are also Bernoulli. The bottom layer corresponds to offspring count; these random variables are zero-truncated Poisson.} \label{graph3} \end{figure} The \texttt{envlpaster} and \texttt{aster2} packages are needed to carry out all calculations. The \texttt{envlpaster} packages requires \texttt{aster}, \texttt{MASS}, \texttt{caTools} and \texttt{trust} to be installed. <>= library(aster2) library(envlpaster) data(simdata30nodes) data <- simdata30nodes.asterdata @ The \texttt{simdata30nodes} dataset contains almost all that is necessary to fit the aster model used in the paper. The \texttt{root} and \texttt{pred} vectors and the model matrix \texttt{modmat} necessary for aster model fitting are included directly in the \texttt{asterdata} object included in the \texttt{simdata30nodes} dataset within the \texttt{envlpaster} package. <>= vars <- variables pred <- data$pred families <- data$families code <- data$code @ The \texttt{fam} vector, which specifies the exponential families appearing in the graphical structure of an aster model, is not included but is constructed using the \texttt{code} vector. In this example, the \texttt{code} vector has entries that are 1-2 valued where a 1 corresponds to Bernoulli distributions and a 2 corresponds to zero-truncated Poisson distributions. In the \texttt{aster} library, a zero-truncated Poisson distribution is coded with the number 3 so we change the 2's that appear in \texttt{code} to 3's when we construct the \texttt{fam} vector. <>= fam <- code fam[fam == 2] <- 3 @ The model fitting is not of class \texttt{aster.formula} and is fit using the aster data directly. The first \texttt{nnode} columns of the \texttt{simdata30nodes} dataset contain the responses necessary to fit the aster submodel of interest. Here \texttt{nnode} = \Sexpr{length(data$pred)}. Recall that the full aster model is a saturated statistical model (one unknown parameter per observed response value) and is of little scientific use; the aster submodel has a reduced parameter count where the parameters of this model correspond to the relationship between Darwinian fitness with both phenotypic traits and covariates of interest. The different parameterizations of the aster model are seen in Figure~\ref{parm}. The relationship between the aster model parameterizations displayed in Figure~\ref{parm} are further explained in \citet*{geyer3}. <>= nnode <- length(vars) xnew <- as.matrix(simdata30nodes[,c(1:nnode)]) m1 <- aster(xnew, root, pred, fam, modmat) m1$formula <- formula m1$xlevels <- xlevels m1$terms <- terms summary(m1) @ There are nine components in both the submodel canonical parameter vector $\beta$ and the submodel mean-value parameter vector $\tau$. We write \[ \tau = \left(\begin{array}{c} \gamma \\ \upsilon \end{array}\right) = M^T\mu\] where $\gamma \in \R^{p-k}$ are nuisance parameters and $\upsilon \in \R^k$ are the parameters of interest that are relevant to the estimation of expected Darwinian fitness, $M$ is the model matrix, and $\mu$ is the mean-value parameter vector of the unconditional aster model, see Figure~\ref{parm}. In our example, $p = 9$ and $k = 5$. The parameter vector $\upsilon$ are the five unknown slope coefficients for the full quadratic model in terms of the covariates $z_1$ and $z_2$. Ultimately, there are five parameterizations of interest that are present in the aster analyses we consider. These five parameterizations are given below and the formal relations among the first four parameterizations are shown in Figure~\ref{parm}. These parameterizations are: \begin{itemize} {\setlength\itemindent{10pt} \item[1)] The aster model canonical parameter vector $\beta \in \R^p$. } {\setlength\itemindent{10pt} \item[2)] The aster model mean-value parameter vector $\tau \in \R^p$. } {\setlength\itemindent{10pt} \item[3)] The unconditional aster model canonical parameter vector $\varphi \in \R^m$. } {\setlength\itemindent{10pt} \item[4)] The unconditional aster model mean-value parameter vector $\mu \in \R^m$. } {\setlength\itemindent{10pt} \item[5)] The best surrogate of Darwinian fitness defined in \citet*{eck} as $h(\mu) \in \R^d$ or $g(\tau) \in \R^d$ where $h$ and $g$ are a differentiable functions and $d$ is the number of hypothetical individuals that expected Darwinian fitness is estimated for. } \end{itemize} % Our methods are not applicable with respect to conditional aster models (needs to be included). In this example, true Darwinian fitness is unknown. The best available surrogate of Darwinian fitness is total offspring count. When Darwinian fitness is taken to be total offspring count, the function $h(\mu) = A\mu$ where $A$ is a matrix with entries $a_{i,j} = 1$ when $\mu_j$ corresponds to offspring for individual $i$. The matrix $A$ specifies that expected Darwinian fitness for every individual is the sum of offspring nodes for that individual. \\ \begin{figure}[t!] \begin{center} \begin{tikzpicture}[descr/.style={fill=white},text height=1.5ex, text depth=0.25ex] \node (a) at (0,0) {$\theta$}; \node (b) at (4,0) {$\varphi$}; \node (c) at (8,0) {$\beta$}; \node (d) at (0,-3) {$\xi$}; \node (e) at (4,-3) {$\mu$}; \node (f) at (8,-3) {$\tau$}; \path[->,font=\scriptsize] ([yshift= 5pt]a.east) edge node[above] {\text{aster transform}} ([yshift= 5pt]b.west) ([yshift= -5pt]b.west) edge node[below] {\text{inverse aster transform}} ([yshift= -5pt]a.east) ([yshift= 5pt]b.east) edge node[above] {} ([yshift= 5pt]c.west) ([yshift= -5pt]c.west) edge node[below] {$\varphi = a + M\beta$} ([yshift= -5pt]b.east) ([xshift= 5pt]a.south) edge node[right] {$\xi_j = c_j^{\prime}(\theta)$} ([xshift= 5pt]d.north) ([xshift= -5pt]d.north) edge node[right] {} ([xshift= -5pt]a.south) ([xshift= 5pt]b.south) edge node[right] {$\mu = \nabla c(\varphi)$} ([xshift= 5pt]e.north) ([xshift= -5pt]e.north) edge node[right] {} ([xshift= -5pt]b.south) ([xshift= 5pt]c.south) edge node[right] {$\tau = \nabla_{\beta} c(a + M\beta)$} ([xshift= 5pt]f.north) ([xshift= -5pt]f.north) edge node[right] {} ([xshift= -5pt]c.south) ([yshift= 5pt]e.east) edge node[above] {$\tau = M^T\mu$} ([yshift= 5pt]f.west) ([yshift= -5pt]f.west) edge node[below] {} ([yshift= -5pt]e.east) ([yshift= 5pt]d.east) edge node[above] {\text{multiplication}} ([yshift= 5pt]e.west) ([yshift= -5pt]e.west) edge node[below] {\text{division}} ([yshift= -5pt]d.east); \end{tikzpicture} \end{center} \caption{A depiction of the transformations necessary to change aster model parameterizations. Unlabeled arrows represent transformations that need to be solved using optimization software. $M$ is a known model matrix, $a$ is a known offset vector, and $c$ is the cumulant function for the unconditional aster model.} \label{parm} \end{figure} <>= beta <- m1$coef; p <- length(beta) target <- 5:9; m <- length(target) x <- m1$x nind <- n <- nrow(x) @ %<<>>= %offset <- as.vector(m1$origin) %@ We obtain the maximum likelihood estimator (MLE) of the submodel mean-value parameter, denoted $\tauhat$ using \texttt{aster} software and the relations in Figure~\ref{parm}. \\ <>= mu <- predict(m1, parm.type = "mean.value", model.type = "unconditional") modmat.mat <- matrix(modmat, ncol = p) tau <- crossprod(modmat.mat, mu) @ We now transition from aster models to the combination of aster models and envelope methodology. The goal of this combination is to incorporate envelope methodology with respect to parameters of interest in the $\tau$ parameterization to yield efficiency gains when estimating expected Darwinian fitness. We first provide some envelope methodology basics relevant to aster modeling scenarios. The particular branch of envelope methodology relevant to aster problems is general envelope methodology discussed in \citet*{cook2}. The theory of general envelope methodology requires a $\sqrt{n}$ consistent and asymptotically normal estimator of an unknown parameter vector and a $\sqrt{n}$ consistent estimator of its asymptotic covariance matrix \citep*{cook2}. Aster models, being an exponential family, satisfy these conditions. \\ Let $\T = \text{span}(\upsilon)$. The subspace $\T$ is a line in $\R^k$ since the unknown parameter $\upsilon$ is a point in $\R^k$. The envelope subspace $\mathcal{E}_{\Sigma_{\upsilon,\upsilon}}(\T)$ is defined as the intersection of all reducing subspaces of $\Sigma_{\upsilon,\upsilon}$ that contains $\T$. Reducing subspaces of $\Sigma_{\upsilon,\upsilon}$ are sums of eigenspaces of $\Sigma_{\upsilon,\upsilon}$ where the eigenspaces of $\Sigma_{\upsilon,\upsilon}$ are spans of the eigenvectors of $\Sigma_{\upsilon,\upsilon}$. An eigenvector $x$ of $\Sigma_{\upsilon,\upsilon}$ is a solution to the equation $\Sigma_{\upsilon,\upsilon}x = \lambda x$ where $\lambda$ is an eigenvalue needed to solve the equation. There are $k$ eigenvectors and eigenvalues in total. Eigenspaces are defined to be the space spanned by all eigenvectors corresponding to one particular eigenvalue. When all of the eigenvalues are unique, all distinct reducing subspaces of $\Sigma_{\upsilon,\upsilon}$ can be specified by the $2^k$ unique combinations of spans of eigenvectors. The decomposition of a matrix into eigenvalues and eigenvectors is a general mathematical decomposition that has a statistical context when applied to a covariance matrix. When applied to a covariance matrix, such as $\Sigma_{\upsilon,\upsilon}$, this decomposition provides insight into the directions of variability (given by the eigenvectors) and the strength of variability in that direction (given by the eigenvalues). \\ The envelope subspace, as defined above to be the intersection of all reducing subspaces of $\Sigma_{\upsilon,\upsilon}$ that contain $\T$, satisfies both \begin{itemize} \item[1.] $\T \subset \mathcal{E}_{\Sigma_{\upsilon,\upsilon}}(\T)$ \item[2.] $\Sigma_{\upsilon,\upsilon} = P_{\mathcal{E}}\Sigma_{\upsilon,\upsilon}P_{\mathcal{E}} + Q_{\mathcal{E}}\Sigma_{\upsilon,\upsilon}Q_{\mathcal{E}}$ \end{itemize} where $\mathcal{E}$ denotes the envelope subspace when used as a subscript, $P_{\mathcal{E}}$ is the projection into $\mathcal{E}_{\Sigma_{\upsilon,\upsilon}}(\T)$, and $Q_{\mathcal{E}}$ is the projection into the orthogonal complement of $\mathcal{E}_{\Sigma_{\upsilon,\upsilon}}(\T)$. The reducing subspace assumption is needed for conditional 2 to hold. Intuitively, the envelope estimator reduces variability in estimation at no cost to consistency by making use of the conditions assumed of $\mathcal{E}_{\Sigma_{\upsilon,\upsilon}}(\T)$. The first condition allows for consistent estimation to take place. The second condition states that only a part of the asymptotic covariance matrix $\Sigma_{\upsilon,\upsilon}$, seen as $P_{\mathcal{E}}\Sigma_{\upsilon,\upsilon}P_{\mathcal{E}}$, is variability associated with the estimation of expected Darwinian fitness. The other part, seen as $Q_{\mathcal{E}}\Sigma_{\upsilon,\upsilon}Q_{\mathcal{E}}$, is immaterial to our estimation problem and can be discarded. When $P_{\mathcal{E}}$ is known, the envelope estimator of $\upsilon$ is then $P_{\mathcal{E}}\hat{\upsilon}$, where $$ \sqrt{n}\left(P_{\mathcal{E}}\hat{\upsilon} - \upsilon\right) \stackrel{\D}{\longrightarrow} N\left(0, P_{\mathcal{E}}\Sigma_{\upsilon,\upsilon} P_{\mathcal{E}} \right). $$ With condition 2 in mind, we can see that the envelope estimator of $\upsilon$ has less asymptotic variability than the MLE when $P_{\mathcal{E}}$ is known. In actual problems, $P_{\mathcal{E}}$ is unknown and requires estimation. The \texttt{envlpaster} software can estimate $P_{\mathcal{E}}$ using the 1D algorithm proposed in \citet*[Algorithm 2]{cook2} and discussed in \citet*[Sections 3 and 4]{eck} or by using reducing subspaces directly as in \citet*[Section 4]{eck}. The reducing subspace approach to envelope estimation is of particular interest in the current applications and we only motivate this approach in this technical report. See the references in this paragraph for details on the 1D algorithm, specifically the last paragraph of \citet*[Section 4]{eck}. \\ %In the current analysis, envelope estimators are constructed via reducing subspaces as in \citet*[Section 4]{eck}. In our context, a reducing subspace is the space spanned by a subset of the eigenvectors of $\Sigma_{\upsilon,\upsilon}$ where $\Sigma_{\upsilon,\upsilon}$ is the part of the Fisher information matrix $\Sigma$ corresponding to parameters of interest. The envelope subspace is the intersection of all reducing subspaces $G$ that satisfy both %\begin{itemize} %\item[1.] $\T \subset G$ %\item[2.] $\Sigma_{\upsilon,\upsilon} = P_G\Sigma_{\upsilon,\upsilon}P_G + Q_G\Sigma_{\upsilon,\upsilon}Q_G$ %\end{itemize} %where $P_G$ is the projection into the reducing subspace $G$ and $Q_G$ is the projection into the orthogonal complement of $G$. The restriction to reducing subspaces ensures that condition 2 holds. The two conditions assure that the variability represented by $P_G\Sigma_{\upsilon,\upsilon}P_G$ is the only variability necessary for the estimation of $\upsilon$. We stand to estimate $\upsilon$ at potentially much greater efficiency when the smallest reducing subspace $G$ that satisfies the two envelope conditions is identified. In applications, this space requires estimation and this estimation poses a cost on potential efficiency gains. When the cost of estimation is small relative to the discarded variability $Q_G\Sigma_{\upsilon,\upsilon}Q_G$, then we stand to estimate $\upsilon$ at much greater efficiency and expected Darwinian fitness at potentially greater efficiency. \\ In our applications, envelope estimators are constructed using eigenvectors of the estimated Fisher information matrix $\widehat{\Sigma}_{\upsilon,\upsilon}$ which is the estimate of $\Sigma_{\upsilon,\upsilon}$ obtained from \texttt{aster} software. In this particular analysis, $\widehat{\Sigma}_{\upsilon,\upsilon}$ is the lower 5 by 5 block of $\widehat{\Sigma}$. All possible envelope estimators constructed from reducing subspaces of $\widehat{\Sigma}_{\upsilon,\upsilon}$ are produced and compared using the \texttt{selection} function in the \texttt{envlpaster} package. There are 31 possible reducing subspaces in total and the \texttt{combs} function in the \texttt{caTools} package is used keep track of the indices. Reducing subspaces are constructed one dimension at a time. At a specified dimension, we construct all of the possible envelope estimators using reducing subspaces. There are $\binom{5}{1}$ possible reducing subspaces of dimension $1$, $\binom{5}{2}$ possible reducing subspaces of dimension $2$, and so on. At a particular iteration of this procedure, denote the reducing subspace of $\widehat{\Sigma}_{\upsilon,\upsilon}$ by $\widehat{G}$. The envelope estimator of $\upsilon$ is given by $\hat{\upsilon}_{\text{env}} = P_{\widehat{G}} \hat{\upsilon}$. The envelope estimator of $\tau$ is \[ \hat{\tau}_{\text{env}} = \left(\begin{array}{c} \hat{\gamma} \\ \hat{\upsilon}_{\text{env}} \end{array}\right). \] Estimators of this form correspond to an aster model with a new model matrix that incorporates the estimated projection into the envelope space. This model matrix is given as \[ M_{\text{env}} = M \left(\begin{array}{cc} I & 0 \\ 0 & P_{\widehat{G}} \end{array}\right). \] The model matrix $M_{\text{env}}$ satisfies $\hat{\tau}_{\text{env}} = M_{\text{env}}^T Y$ which says that the envelope estimator of $\tau$ is a maximum likelihood estimator for the aster model with model matrix $M_{\text{env}}$. Once the envelope estimator $\hat{\tau}_{\text{env}}$ is calculated, it is transformed to the $\beta$ parameterization using the model matrix $M_{\text{env}}$ as an argument of \texttt{transformUnconditional}. The parameter vector $\beta$ arises naturally in the log likelihood \begin{equation} \label{likelihood} l(\beta) = \langle M^TY, \beta \rangle - c(\beta). \end{equation} The transformation from $\hat{\tau}$ to $\hat{\beta}$ provides a maximizer of (\ref{likelihood}) with model matrix given by $M_{\text{env}}$. However, $\hat{\beta}$ is not unique since $M_{\text{env}}$ is not of full column rank. This non identifiability is not problemmatic because final inference is not conducted with respect to the $\beta$ parameterization. Final inference is conducted with respect to the parameterization of Darwinian fitness which does not posess such problems. When in the $\beta$ parameterization, we compute AIC, BIC, and the p-value of the LRT using \texttt{aster} software and the likelihood (\ref{likelihood}). The \texttt{selection} function in the \texttt{envlpaster} package below computes the AIC, BIC, and the p-value of the LRT for all 31 envelope models including the model corresponding to the MLE. This is done by specifying the \texttt{method = "eigen"} and \texttt{type = "mean-value"} arguments in the \texttt{selection} function. \\ <>= foo2 <- selection(parm = tau, index = target, model = m1, data = data, alpha = 0.05, type = "mean-value", method = "eigen") @ The calculation above is not performed in this script. The calculation is performed in \texttt{ex1-Efronboot.R} and is sourced into this document from the \texttt{ex1-Efronboot.RData} file. \\ <>= load("ex1-Efronboot.RData") foo2$aic; foo2$bic; foo2$LRT @ We see that BIC and the LRT at size $\alpha = 0.05$ favor the reducing subspace that is the sum of the first, fourth, and fifth eigenspaces of $\widehat{\Sigma}_{\upsilon,\upsilon}$. This reducing space will be referred to as $\widehat{G}_1$. On the other hand, AIC favors the reducing subspace that is the sum of all eigenspaces of $\widehat{\Sigma}_{\upsilon,\upsilon}$ with the exception of the third eigenspace. This reducing subspace will be referred to as $\widehat{G}_2$. Since eigenvalues of $\widehat{\Sigma}_{\upsilon,\upsilon}$ are unique, the space $\widehat{G}_1$ can also be defined as the space spanned by the first, fourth, and fifth eigenvector of $\widehat{\Sigma}_{\upsilon,\upsilon}$ and $\widehat{G}_2$ can be defined similarly. Both of these reducing subspaces are smaller than the full space which is a sum of all of the eigenspaces of $\widehat{\Sigma}_{\upsilon,\upsilon}$ and both are larger than the true reducing subspace. The true reducing subspace of $\widehat{\Sigma}_{\upsilon,\upsilon}$ used to construct this example, $G_{\text{true}}$, is the space spanned by the first and fourth eigenvectors of $\widehat{\Sigma}_{\upsilon,\upsilon}^{1}$ where $\widehat{\Sigma}_{\upsilon,\upsilon}^{1}$ is the estimated asymptotic covariance matrix of the parameters of interest constructed from a preliminary simulated dataset. The realizations in the dataset used in this example were generated from the aster model distribution with submodel mean-value parameter $\tau_{\text{true}} = (\hat{\gamma}_1^T,P_{G_{\text{true}}}\hat{\upsilon}_1^T)^T$. Here, $\hat{\gamma}_1$ and $\hat{\upsilon}_1$ are the MLE of the nuisance parameters and parameters of interest obtained from the preliminary dataset. \\ The parametric bootstrap procedure given by Algorithm 2 \citep*{eck} is used to estimate the envelope estimator selected by BIC and the LRT of size $\alpha = 0.05$. The steps of this procedure are outlined in Algorithm 2 of the current document. Before we run our parametric bootstrap, we see which envelope model is selected when we use the 1D algorithm \citet*{cook2, cook3, eck}. This is done by changing the \texttt{method} argument in the \texttt{selection} function to \texttt{method = "1D"}. All three selection criteria favor an envelope dimension of $u = 5$. This corresponds to a selection of the full aster model with no envelope structure incorporated. The parametric bootstrap procedure discussed in Algorithm 2 is not very interesting in this case. \\ We now construct the envelope estimator of $\tau$ with respect to $\widehat{G}_1$ as suggested by BIC and the LRT of size $\alpha = 0.05$. The envelope estimator is given by $\hat{\tau}_{\text{env}} = M_{\text{env}}^T Y$ where \[ M_{\text{env}} = M \left(\begin{array}{cc} I & 0 \\ 0 & P_{\widehat{G}_1} \end{array}\right). \] Expected Darwinian fitness is estimated for 144 hypothetical individuals possessing unique values of \texttt{z1} and \texttt{z2} suggested by the original data. These hypothetical individuals will be used to build a fitness landscape for estimated expected Darwinian fitness. The fitness landscape is estimated using both maximum likelihood estimation and envelope estimation. Before we build the dataframe corresponding to these 144 hypothetical individuals, the \texttt{vars} vector containing the names of all the nodes in Figure~\ref{graph3} is specified. <>= vars <- as.vector(outer(c("u", "v", "w"), 1:10, paste, sep = "")) @ The dataframe corresponding to the 144 hypothetical individuals is constructed using the code below. <>= nx <- 12; npop <- nx^2 z1.cand <- seq(from = -6, to = 6, length = nx) z2.cand <- seq(from = -6, to = 6, length = nx) fred <- data.frame( cbind(rep(z1.cand, each = nx), rep(z2.cand, nx), matrix(1,nrow = nx^2, ncol = 30))) colnames(fred) <- c("z1","z2",vars) fred$root <- 1 @ Next, \texttt{amat} needs to be specified. \texttt{amat} specifies which nodes in the graphical structure for the hypothetical individuals are Darwinian fitness nodes. \texttt{amat} can either be a three-dimensional array or a matrix. We construct a three-dimensional array as in \citet*{geyer4}. <>= amat <- array(0, c(npop, nnode, npop)) foot <- grepl("w", vars) for (k in 1:npop) amat[k, foot, k] <- 1 @ The \texttt{newdata} argument is specified with a dataframe for the hypothetical individuals created to build the fitness landscape. This dataframe has already been constructed and has been given the name \texttt{fred}. The model matrix and dataframe in long format, corresponding to the provided dataframe specified by \texttt{newdata}, are constructed internally. If the construction of either the new model matrix or the new dataframe requires more calculation than a simple \texttt{model.matrix} or \texttt{renewdata} call respectively, then these quantities can be included using the \texttt{modmat.new} or \texttt{renewdata} arguments respectively. \\ <>= renewdata <- reshape(fred, varying = list(vars), direction = "long", timevar = "varb", times = as.factor(vars), v.names = "resp") layer2 <- gsub("[0-9]", "", as.character(renewdata$varb)) renewdata <- data.frame(renewdata, layer = layer2) renewdata$vtype <- as.factor(substr(as.character( renewdata$varb),1, 1)) renewdata$year <- as.numeric(substring(as.character( renewdata$varb), 2)) renewdata$uyear <- renewdata$year * as.numeric(as.character(renewdata$vtype) == "u") @ We also need to specify the \texttt{modmat.new} argument. The dataframe in long format is calculated as an intermediate step to the construction of this model matrix. <>= modmat.renew <- model.matrix(m1$formula, data = renewdata) @ We now have the required quantities necessary to estimate the variability of estimated expected Darwinian fitness using envelope methodology. The envelope estimator of expected Darwinian fitness for these hypothetical individuals has asymptotic distributions given by \begin{equation}\label{asymptotics-envelope-unknown} \sqrt{n}\left(g(\tauenv) - g(\tau)\right) \overset{\D}{\longrightarrow} N\left(0,\; \nabla g(\tau)\Delta_2 \nabla g(\tau)^T \right) \end{equation} where $\Delta_2$ is the unknown asymptotic covariance matrix of $\tauenv$. The asymptotic covariance matrix of estimated expected Darwinian fitness is estimated using the parametric bootstrap adjusting for model selection as seen in Algorithm 2. The dimension of the envelope space is selected using model selection criteria and is not known in advance. \citet*[Section 4]{efron} provides a double bootstrap procedure which accounts for model selection. This procedure is used to estimate the variance of estimated expected Darwinian fitness using envelope methodology. At first, data are generated as a realization from the aster model distribution evaluated at the envelope estimator. The envelope dimension is determined for each of these generated datasets. The estimator of expected Darwinian fitness is then taken to be the average of all of the envelope estimators obtained from these datasets. To estimate the variability of this envelope estimator, we generate further datasets from the aster model distribution evaluated at each separate envelope estimator of expected Darwinian fitness used to calculate the aforementioned average. The steps of this procedure are presented in \citet*[Algorithm 2]{eck}. For completeness, the steps of this algorithm are restated here. To avoid confusion, we label this algorithm as Algorithm 2. \\ \begin{table}[h!] \begin{tabular}{|p{14cm}|} \hline \textbf{Algorithm 2: Parametric bootstrap reducing subspace envelope estimator of $\upsilon$} \\ \hline \begin{enumerate} \item[1.] Fit the aster model to the data and obtain $\hat{\upsilon}$ and $\widehat{\Sigma}_{\upsilon,\upsilon}$ from the aster model fit. \item[2.] Compute the envelope estimator of $\upsilon$ in the original sample, given as $\upenv = P_{\widehat{G}}\hat{\upsilon}$ where $P_{\widehat{G}}$ is computed using reducing subspaces and selected via a model selection criterion of choice. \item[3.] Perform a parametric bootstrap by generating resamples from the distribution of the aster submodel evaluated at $\tauenv = (\hat{\gamma}^T,\upenv^T)^T$. For iteration $b=1,...,B$ of the procedure: \begin{enumerate} {\setlength\itemindent{10pt} \item[(3a)] Compute $\hat{\tau}^{(b)}$ and $\widehat{\Sigma}_{\upsilon,\upsilon}^{(b)}$ from the aster model fit to the resampled data.} {\setlength\itemindent{10pt} \item[(3b)] Build $P_{\widehat{G}}^{(b)}$ using the indices of $\widehat{\Sigma}_{\upsilon,\upsilon}^{(b)}$ that are selected using the same model selection criterion as Step 2 to build $\widehat{G}$. } {\setlength\itemindent{10pt} \item[(3c)] Compute $\upenv^{(b)} = P_{\hat{\mathcal{E}}}^{(b)}\hat{\upsilon}^{(b)}$ and $\tauenv^{(b)} = \left(\hat{\gamma}^{(b)^T},\upenv^{(b)^T}\right)^T$. } {\setlength\itemindent{10pt} \item[(3d)] Store $\hat{\tau}_{\text{env}^{(b)}}$ and $g\left(\hat{\tau}_{\text{env}}^{(b)}\right)$ where $g$ maps $\tau$ to the parameterization of Darwinian fitness. } \end{enumerate} \item[4.] After $B$ steps, the bootstrap estimator of expected Darwinian fitness is the average of the envelope estimators stored in Step 3d. This completes the first part of the bootstrap procedure. \item[5.] We now proceed with the second level of bootstrapping at the $b$-th stored envelope estimator $\hat{\tau}_{\text{env}}^{(b)}$. For iteration $k=1,...,K$ of the procedure: \begin{enumerate} {\setlength\itemindent{10pt} \item[(5a)] Generate data from the distribution of the aster submodel evaluated at $\hat{\tau}_{\text{env}}^{(b)}$. } {\setlength\itemindent{10pt} \item[(5b)] Perform Steps 3a through 3d with respect to the dataset obtained in Step 5a.} {\setlength\itemindent{10pt} \item[(5c)] Store $\hat{\tau}_{\text{env}}^{(b)^{(k)}}$ and $g\left(\hat{\tau}_{\text{env}}^{(b)^{(k)}}\right)$. } \end{enumerate} \end{enumerate} \\ \hline \end{tabular} \end{table} There are two functions in the \texttt{envlpaster} package necessary to perform all of the steps in Algorithm 2. The reason for separating this chore into two functions is computational. This separation allows users to parallelize this bootstrap using \texttt{mclapply}. The first of these functions is \texttt{fit.boot.Efron}. This function implements the first level of the parametric bootstrap procedure given by either Algorithm 1 or Algorithm 2 in \citet*{eck}. This is detailed in Steps 1 through 3d outlined in Algorithm 2. This parametric bootstrap generates resamples from the distribution evaluated at an envelope estimator of $\tau$ adjusting for model selection volatility. \\ To use the \texttt{fit.boot.Efron} function, the user must specify whether they want to use AIC, BIC, or the LRT of size $\alpha$ as a model selection criterion. The user also specifies which method, either the 1D algorithm or the reducing subspace approach, is to be used to calculate envelope estimators. When one is using a partial envelope model, then this function constructs envelope estimators of $\upsilon$ where we write $\tau = (\gamma^T,\upsilon^T)^T$ and $\upsilon$ corresponds to aster model parameters of interest. In the original sample, the reducing subspaces are indices of eigenvectors of $\widehat{\Sigma}_{\upsilon,\upsilon}$ where $\widehat{\Sigma}_{\upsilon,\upsilon}$ is the part of $\widehat{\Sigma}$ corresponding to our parameters of interest. These indices are specified by \texttt{vectors}. When all of the components of $\tau$ are components of interest, then we write $\widehat{\Sigma}_{\upsilon,\upsilon} = \widehat{\Sigma}$. \\ %When data is generated via the parametric bootstrap, it is the indices of the eigenvectors used to form the reducing subspace (not the original reducing subspaces) that are used to construct envelope estimators constructed using the generated data. \\ We now call on \texttt{fit.boot.Efron} to perform the top level of bootstrapping for our example. The calculation above is not performed in this script. It is performed in \texttt{ex1-Efronboot.R} and is sourced into this document from the \texttt{ex1-Efronboot.RData} file. \\ <>= set.seed(13) nboot <- 100 blah <- fit.boot.Efron(model = m1, nboot = nboot, index = target, dim = foo2$bic, data = data, amat = amat, newdata = fred, modmat.new = modmat.renew, renewdata = renewdata, criterion = "BIC", method = c("eigen"), quiet = FALSE) @ The contents stored in \texttt{blah}, and previously computed quantities, are all that is needed to perform the second level of bootstrapping necessary for the estimation of the variability of estimated expected Darwinian fitness incorporating envelope methodology. The function \texttt{secondboot} in the \texttt{envlpaster} package performs this second level of bootstrapping. This function implements the second level of the parametric bootstrap procedure given by either Algorithm 1 or Algorithm 2 in \citet*{eck} with respect to the mean-value parameterization. This is detailed in Steps 4 through 5c in Algorithm 2. At iteration $b$, this parametric bootstrap generates resamples from the distribution evaluated at the envelope estimator ($\hat{\tau}_{env}^{(b)}$) of $\tau$. In this case, the selected indices producing the reducing subspace which was used to construct the envelope estimator $\hat{\tau}_{env}^{(b)}$ are used to construct envelope estimators for the generated data. These resamples are used to estimate the variability of $\hat{\tau}_{env}^{(b)}$. \\ We now call on \texttt{secondboot} to perform the second level of bootstrapping for our example. The calculations involving \texttt{secondboot} are not performed in this script. These calculations are performed in \texttt{ex1-secondboot.R} and are sourced into this document from the \texttt{ex1-secondboot.RData} file. The bootstrap sample size for the second level of bootstrapping is 500. The script below performs this task. This script also calculates the bootstrapped estimator of the variability of expected Darwinian fitness incorporating envelope methodology. \\ <>= load("ex1-secondboot.RData") @ <>= nboot2 <- 500 set.seed(13) internal <- function(k){ set.seed(13) bar <- secondboot(k, out = blah, model = m1, data = data, nboot2 = nboot2, index = target, newdata = fred, amat = amat, method = "eigen") return(bar$sd.Efron) } @ <>= # $ @ %The steps and quantities needed to calculate the bootstrapped estimator of the variability of expected Darwinian fitness incorporating envelope methodology are outlined in this paragraph. When the top-level of our bootstrap procedure in Algorithm 2 has run for a total of $B$ iterations, we obtain the envelope estimator \begin{equation} \label{tau.boot} \hat{\mu}_g = \frac{1}{B} \sum_{b=1}^B g(\tauenv^{(b)}). \end{equation} The individual estimates of expected Darwinian fitness in (\ref{tau.boot}) involve model selection which is smoothed out by averaging \citep*{efron}. This envelope estimator obtained from the parametric bootstrap in Algorithm 2 has variability analogous to that in \citet*[equation (4.15)]{efron}. As in \citet*{efron}, for $b = 1,...,B$ we define the matrix $\textbf{B}^{(b)} \in \R^{K\times p}$ which has rows $\hat{\tau}_{\text{env}}^{(b)^{(k)}} - \sum_{k=1}^K\hat{\tau}_{\text{env}}^{(b)^{(k)}}/K$ and the matrix $C^{(b)} \in \R^{K \times d}$ which has columns $g\left(\tau_{\text{env}}^{(b)^{(k)}}\right) - g\left(\tau_{\text{env}}^{(b)}\right)$. $\Delta_2$ is estimated by %$g\left(\tau_{\text{env}}^{(b)^{(k)}}\right) - \hat{\mu}_g$ \begin{equation} \label{Delta.boot} \widehat{\Delta}_2 = \sum_{b=1}^B\left[\widehat{\text{cov}}^{(b)^T}\widehat{V}^{(b)^{-1}}\widehat{\text{cov}}^{(b)}\right] / B \end{equation} where \begin{equation} \label{cov.hat} \widehat{\text{cov}}^{(b)} = \textbf{B}^{(b)^T}C^{(b)}/K \end{equation} and \begin{equation} \label{V.hat} \widehat{V}^{(b)} = \textbf{B}^{(b)^T}\textbf{B}^{(b)}/K. \end{equation} The estimator (\ref{Delta.boot}) of $\Delta_2$ takes into account the volatility of model selection when estimating the variability of estimated expected Darwinian fitness using envelope methodology. The method of maximum likelihood estimation does not have the added model selection step that envelope estimation has. The bootstrap procedure outlined in Algorithm 2 efficiently estimates expected Darwinian fitness and accounts for variability associated with model selection volatility. \\ \sloppy The code below calculates (\ref{Delta.boot}). These calculations are performed in \texttt{ex1-secondboot.R} and are sourced into this document from the \texttt{ex1-secondboot.RData} file. Note that a machine with five available cores was used to perform this calculation. \\ <>= y <- nboot / 5 blah.second <- mclapply(1:5, mc.cores = 5, FUN = function(x){ y <- nboot / 5 int <- ((y*(x-1) + 1) : (y*x)) out <- lapply(int, FUN = internal) return(out) } ) @ <>= sd.Efron.mean <- rep(0,144) for(k in 1:5){ for(j in 1:y){ sd.Efron.mean <- sd.Efron.mean + blah.second[[k]][[j]] / nboot } } @ The file \texttt{techreport\_eigenboot\_fitboot.RData} contains output corresponding to the bootstrapped estimator of the variability of expected Darwinian fitness using the MLE. The bootstrap sample size for this particular bootstrap was 500. \\ <>= nboot1 <- nboot load("techreport_eigenboot_fitboot.RData") @ We now compute the ratio of standard errors. Values above 1 indicate that the envelope estimator of expected Darwinian fitness using Algorithm 2 is less variable than the MLE. The code below computes the bootstrapped standard errors of estimated expected Darwinian fitness using the MLE. <>= a <- bar$MLE.boot.out data.fit <- asterdata(fred, vars, pred, group = rep(0,30), code, families) phi.MLE <- origin + modmat.renew %*% beta mu.MLE <- transformSaturated(phi.MLE, data.fit, from = "phi", to = "mu") amat.fit <- matrix(amat, nrow = length(mu.MLE)) fit.MLE <- t(amat.fit) %*% mu.MLE fit.MLE <- as.vector(fit.MLE) a <- a - fit.MLE boot.var.MLE <- (a[, 1] %o% a[, 1]) / ncol(a) for(j in 2:ncol(a)){ boot.var.MLE <- boot.var.MLE + (a[, j] %o% a[, j]) / ncol(a) } sd.MLE <- sqrt(diag(boot.var.MLE)) ratio <- sd.MLE / sd.Efron.mean @ Output is shown for hypothetical individuals who have estimated expected Darwinian fitness greater than 8 where estimation is calculated using the MLE. This table has 5 columns. The first two columns correspond to estimation of expected Darwinian fitness using envelope methodology and its bootstrapped standard error respectively. The next two columns correspond to estimation of expected Darwinian fitness using the MLE and its bootstrapped standard error respectively. The fifth column displays the ratios of estimated standard errors where ratios greater than 1 indicate that the envelope estimator has lower estimated variability. We can see that efficiency gains are obtained for the displayed output. \\ <>= load("techreport_eigenboot_fitboot.RData") nboot <- nboot1 @ <>= tab <- matrix(0, nrow = 144, ncol = 5) tab[, 1] <- rowSums(blah$env.boot.out) / nboot tab[, 2] <- sd.Efron.mean tab[, 3] <- fit.MLE tab[, 4] <- sd.MLE tab[, 5] <- ratio colnames(tab) <- c("env","se(env)","MLE","se(MLE)","ratio") tab[tab[, 3] > 8, ] @ We plot the ratio of estimated standard errors in Figures~\ref{contour-plot} and~\ref{heatmap}. We can see that efficiency gains are obtained when using envelope methodology for hypothetical individuals that overlap the majority of the original data. \\ %The gains are less substantial than those found when using the indices suggested by BIC and the LRT of size $\alpha = 0.05$. \\ <>= z1 <- simdata30nodes$z1 z2 <- simdata30nodes$z2 max.ind <- which(tab[,1] == max(tab[,1])) x.max = fred[max.ind,1] y.max = fred[max.ind,2] plot(z1,z2, pch = 20, col = "grey") points(x = x.max, y = y.max, col = "red", pch = 20) max.ind.env <- which(tab[,3] == max(tab[,3])) x.max.env = fred[max.ind.env,1] y.max.env = fred[max.ind.env,2] points(x = x.max.env, y = y.max.env, col = "blue", pch = 20) contour(z1.cand,z2.cand, t(matrix(tab[,5],nrow = nx)), levels = c(0.9,1,1.5,2,3), labcex = 1, add=TRUE) @ \begin{figure}[h!] \begin{center} <>= z1 <- simdata30nodes$z1 z2 <- simdata30nodes$z2 max.ind <- which(tab[,1] == max(tab[,1])) x.max = fred[max.ind,1] y.max = fred[max.ind,2] plot(z1,z2, pch = 20, col = "grey") points(x = x.max, y = y.max, col = "red", pch = 20) max.ind.env <- which(tab[,3] == max(tab[,3])) x.max.env = fred[max.ind.env,1] y.max.env = fred[max.ind.env,2] points(x = x.max.env, y = y.max.env, col = "blue", pch = 20) contour(z1.cand,z2.cand, t(matrix(tab[,5],nrow = nx)), levels = c(0.9,1,1.5,2,3), labcex = 1, add=TRUE) @ \end{center} \caption{Contour plot for the ratios of $\hat{se}\left(h(\hat{\tau})\right)$ to $\hat{se}\left(h(\hat{\tau}_{\text{env}})\right)$. Ratios greater than 1 indicate efficiency gains using envelope methodology. The point in blue corresponds to the highest estimated expected Darwinian fitness value using envelope methodology and the MLE. } \label{contour-plot} \end{figure} \newpage <>= filled.contour(z1.cand,z2.cand, t(matrix(tab[,5],nrow = nx)), levels = c(0,0.9,1,1.5,2,5)) @ \begin{figure}[h!] \begin{center} <>= filled.contour(z1.cand,z2.cand, t(matrix(tab[,5],nrow = nx)), levels = c(0,0.9,1,1.5,2,5)) @ \end{center} \caption{Heat map for the ratios of $\hat{se}\left(h(\hat{\tau})\right)$ to $\hat{se}\left(h(\hat{\tau}_{\text{env}})\right)$. Ratios greater than 1 indicate efficiency gains using envelope methodology. } \label{heatmap} \end{figure} \newpage We now perform our methods with respect to the reducing subspace $\widehat{G}_2$ selected by AIC. Output is shown for hypothetical individuals who have estimated expected Darwinian fitness greater than 8 where estimation is calculated using the MLE. <>= load("ex1-AIC.RData") @ <>= tab2 <- matrix(0, nrow = 144, ncol = 5) colnames(tab2) <- c("env","se(env)","MLE","se(MLE)","ratio") tab2[, 1] <- rowSums(blah$env.boot.out) / nboot tab2[, 2] <- sd.Efron.mean tab2[, 3] <- fit.MLE tab2[, 4] <- sd.MLE tab2[, 5] <- ratio tab2[tab2[, 3] > 8, ] @ The ratios of estimated standard errors are plotted in Figures~\ref{contour-plot-AIC} and~\ref{heatmap-AIC}. We can see in these figures and in the table above that efficiency gains are obtained using envelope methodology. The gains are less substantial than those found when using the indices suggested by BIC and the LRT of size $\alpha = 0.05$. \\ <>= z1 <- simdata30nodes$z1 z2 <- simdata30nodes$z2 max.ind <- which(tab[,1] == max(tab[,1])) x.max = fred[max.ind,1] y.max = fred[max.ind,2] plot(z1,z2, pch = 20, col = "grey") points(x = x.max, y = y.max, col = "red", pch = 20) max.ind.env <- which(tab[,3] == max(tab[,3])) x.max.env = fred[max.ind.env,1] y.max.env = fred[max.ind.env,2] points(x = x.max.env, y = y.max.env, col = "blue", pch = 20) contour(z1.cand,z2.cand, t(matrix(tab[,5],nrow = nx)), levels = c(0.9,1,1.5,2,3), labcex = 1, add=TRUE) @ \begin{figure}[h!] \begin{center} <>= z1 <- simdata30nodes$z1 z2 <- simdata30nodes$z2 max.ind <- which(tab[,1] == max(tab[,1])) x.max = fred[max.ind,1] y.max = fred[max.ind,2] plot(z1,z2, pch = 20, col = "grey") points(x = x.max, y = y.max, col = "red", pch = 20) max.ind.env <- which(tab[,3] == max(tab[,3])) x.max.env = fred[max.ind.env,1] y.max.env = fred[max.ind.env,2] points(x = x.max.env, y = y.max.env, col = "blue", pch = 20) contour(z1.cand,z2.cand, t(matrix(tab[,5],nrow = nx)), levels = c(0.9,1,1.5,2,3), labcex = 1, add=TRUE) @ \end{center} \caption{Contour plot for the ratios of $\hat{se}\left(h(\hat{\tau})\right)$ to $\hat{se}\left(h(\hat{\tau}_{\text{env}})\right)$ using AIC. Ratios greater than 1 indicate efficiency gains using envelope methodology. The point in blue corresponds to the highest estimated expected Darwinian fitness value using envelope methodology and MLE. } \label{contour-plot-AIC} \end{figure} \newpage <>= filled.contour(z1.cand,z2.cand, t(matrix(tab[,5],nrow = nx)), levels = c(0,0.9,1,1.1,1.25,1.5,2)) @ \newpage \begin{figure}[h!] \begin{center} <>= filled.contour(z1.cand,z2.cand, t(matrix(tab[,5],nrow = nx)), levels = c(0,0.9,1,1.1,1.25,1.5,2)) @ \end{center} \caption{Heat map for the ratios of $\hat{se}\left(h(\hat{\tau})\right)$ to $\hat{se}\left(h(\hat{\tau}_{\text{env}})\right)$ using AIC. Ratios greater than 1 indicate efficiency gains using envelope methodology. } \label{heatmap-AIC} \end{figure} \newpage \begin{figure} \begin{center} \setlength{\unitlength}{0.45 in} \thicklines \begin{picture}(4.0,0.75)(-0.1,-0.3) \put(0,0){\makebox(0,0){$1_{\hphantom{0}}$}} \put(1,0){\makebox(0,0){$Y_1$}} \put(2,0){\makebox(0,0){$Y_2$}} \multiput(0.35,0)(1,0){2}{\vector(1,0){0.3}} \end{picture} \end{center} \caption{Graphical structure of the aster model for the data in Example 2. The first arrow corresponds to survival which is a Bernoulli random variable. The second arrow corresponds to reproduction count conditional on survival which is a zero-truncated Poisson random variable.} \label{astergraph} \end{figure} \section{Replicating Example 2} We now reproduce the second example in \citet*{eck}. The dataset comes from \citet*{lowry} and the study in which the data is obtained investigates the role of chromosomal inversions in adaptation and speciation. Phenotypic traits and covariates are recorded for 2313 yellow monkeyflowers, \emph{Mimulus guttatus}. The lifecycle of the individual \emph{M. guttatus} flowers is depicted in Figure~\ref{astergraph}. Much of the details necessary to the reproduction of this example are similar to those in the first example. The data and corresponding \texttt{.RData} file are loaded in. The data is then manipulated so that \texttt{aster} software can be used and then the quantities needed for the bootstrap functions \texttt{fit.Efron} and \texttt{secondboot} are computed. \\ The code below obtains the relevant quantities needed to fit the aster model corresponding to this data and manipulates the dataset so that aster models can be fit. \\ <>= library(aster2) library(envlpaster) data(Mguttatus) @ <>= data <- Mguttatus redata <- Mguttatus.redata vars <- quantities$vars pred <- quantities$pred group <- quantities$group code <- quantities$code fam <- quantities$fam nnode <- length(vars) n <- nrow(redata) / nnode families <- quantities$families root <- redata$root fit <- redata$fit varvar <- quantities$varvar idvar <- quantities$idvar @ The \texttt{.RData} file containing the calculations done in this example is loaded in. \\ <<>>= load("main-all-terms-DavidLowrydata.RData") @ We fit the main effects model. This model includes genetic background, site, inversion, and ecotype as predictors. \\ <>= m.main <- aster(resp ~ varb + fit:(gen_bac + site + inversion + type), pred, fam, varvar = varvar, idvar = idvar, data = redata, root = root) @ There are eight predictors comprising $\tau$ and we partition $\tau$ into $(\gamma^T, \upsilon^T)^T$ where $\gamma \in \R^2$ are nuisance parameters and $\upsilon \in \R^6$ are relevant to the estimation of expected Darwinian fitness. In this example $\Sigma_{\upsilon,\upsilon}$ is the lower 6 by 6 block of the Fisher information matrix $\Sigma$. These quantities, as well as $\beta$ are extracted from the aster model fit. \\ <>= target <- which(grepl("fit", names(m.main$coef))) avar.targ <- m.main$fisher[target, target] modmat.mat <- matrix(m.main$modmat, nrow = n*nnode) tau <- crossprod(modmat.mat, redata$resp) tau.targ <- tau[target] beta <- m.main$coef p <- length(beta) @ We now call on \texttt{selection} to see which reducing subspace is suggested for envelope estimation. \\ <>= foo <- selection(tau, target, m.main, data, type = "mean-value", method = "eigen", alpha = 0.01) @ All of the selection criteria are in alignment, and they suggest using the reducing subspace that is the sum of the first, second, third, and sixth eigenspaces of $\widehat{\Sigma}_{\upsilon,\upsilon}$. \\ <<>>= foo$bic; foo$aic; foo$LRT @ The variability of estimated expected Darwinian fitness is now estimated using using both \texttt{fit.boot.Efron} and \texttt{secondboot} to perform the bootstrap procedure outlined in Algorithm 2. First, the hypothetical \emph{M. guttatus} flowers in which expected Darwinian fitness is estimated are created. All variables in our original aster model are factor variables, therefore each hypothetical \emph{M. guttatus} flower possesses a unique factor-level combination. \\ <>= a <- levels(test$gen_bac) b <- levels(test$site) c <- levels(test$inversion) d <- levels(test$type) fred <- expand.grid(a = a, b = b, c = c, d = d) colnames(fred) <- c("gen_bac","site","inversion","type") fred$sur_flw <- 1 fred$flws <- 1 @ The dataset in long form and the model matrix corresponding to the main-effects model with these hypothetical \emph{M. guttatus} flowers is now created. \\ <>= renewdata <- reshape(fred, varying = list(vars), direction = "long", timevar = "varb", times = as.factor(vars), v.names = "resp") fit.renew <- as.numeric(grepl("flws", renewdata$varb)) renewdata$fit <- fit.renew renewdata$root <- 1 npop <- nrow(fred) modmat.renew <- model.matrix(m.main$formula, data = renewdata) modmat.renew[, 2] <- 1 - modmat.renew[, 2] index <- !(colnames(modmat.renew) %in% m.main$dropped) modmat.renew <- modmat.renew[, index] @ Next, \texttt{amat} needs to be specified. \texttt{amat} specifies which nodes in the graphical structure for the hypothetical \emph{M. guttatus} flowers correspond to Darwinian fitness nodes. \texttt{amat} can either be a three-dimensional array or a matrix. We construct a three-dimensional array as in \citet*{geyer4}. \\ <>= amat <- matrix(0, nrow = npop, ncol = nnode * npop) for(j in 1:npop) amat[j, npop + j] <- fit.renew[npop + j] @ The \texttt{fit.boot.Efron} function performs the top level of bootstrapping for our example. The calculation below is not performed in this script. The calculation is performed in \texttt{main-all-terms-DavidLowrydata.R} and is sourced into this document from the \texttt{main-all-terms-DavidLowrydata.RData} file. \\ <>= set.seed(13) nboot <- 100 blah <- fit.boot.Efron(model = m.main, nboot = nboot, index = target, vectors = foo$bic, data = data, amat = amat, newdata = fred, modmat.new = modmat.renew, renewdata = renewdata, criterion = "BIC", method = "eigen", quiet = FALSE) @ We now call on \texttt{secondboot} to perform the second level of bootstrapping for our example. The calculations involving \texttt{secondboot} are not performed in this script. These calculations are also performed in \texttt{main-all-terms-DavidLowrydata.R} and are sourced into this document from the \texttt{main-all-terms-DavidLowrydata.RData} file. The bootstrap sample size for the second level of bootstrapping is 500. The script below performs this task. This script also calculates the bootstrapped estimator of the variability of expected Darwinian fitness incorporating envelope methodology. \\ <>= nboot2 <- 500 internal <- function(k){ set.seed(13) bar <- secondboot(k, out = blah, model = m.main, data = data, nboot2 = nboot2, index = target, newdata = fred, amat = amat, method = "eigen") return(bar$sd.Efron) } ## 5 cores are used to perform the second level of bootstrapping ncores <- 5; nsamp <- nboot / ncores blah.second <- mclapply(1:ncores, mc.cores = ncores, FUN = function(x){ int <- ((nsamp*(x-1) + 1) : (nsamp*x)) out <- lapply(int, FUN = internal) return(out) } ) ## estimated standard error using sd.Efron.mean <- rep(0, npop) for(k in 1:ncores){ for(j in 1:nsamp){ sd.Efron.mean <- sd.Efron.mean + blah.second[[k]][[j]] / nboot } } @ Expected Darwinian fitness and its variability are now estimated for the hypothetical \emph{M. guttatus} flowers using the MLE. In this setting, the variability is estimated using the output from \texttt{fit.boot.Efron} to construct the bootstrap estimator of variability given by \[ \frac{1}{99} \sum_{b=1}^{100} \left(h(\hat{\mu}^{(b)}) - h(\hat{\mu})\right)\left(h(\hat{\mu}^{(b)}) - h(\hat{\mu})\right)^T \] where $h(\hat{\mu})$ is estimated expected Darwinian fitness for the hypothetical \emph{M. guttatus} flowers using the MLE and $h(\hat{\mu}^{(b)})$ are the bootstrapped estimates. <>= data.renew <- asterdata(fred, vars = vars, pred = pred, group = rep(0, length(vars)), code = code, families = families) origin.renew <- m.main$origin[1:npop,] origin.renew <- as.vector(origin.renew) mu.renew <- transformUnconditional(beta, data.renew, modmat = modmat.renew, from = "beta", to = "mu", offset = origin.renew) fit <- amat %*% mu.renew dev <- blah$MLE.boot.out - as.vector(fit) x <- (dev[, 1] %*% t(dev[, 1])) for(j in 2:nboot){ x <- x + (dev[, j] %*% t(dev[, j])) } @ The ratio of the the estimated standard error using envelope estimation obtained from Algorithm 2 to the estimated standard error using the MLE is computed. \\ <>= sd.MLE <- sqrt(diag(x) / (nboot - 1)) ratio <- sd.MLE / sd.Efron.mean @ The output below is shown for hypothetical \emph{M. guttatus} flowers who have estimated expected Darwinian fitness greater than 7 where estimation is calculated using MLE. The output table table has 5 columns. The first two columns correspond to estimation of expected Darwinian fitness using envelope methodology and its bootstrapped standard error respectively. The next two columns correspond to estimation of expected Darwinian fitness using MLE and its bootstrapped standard error respectively. The fifth column displays the ratios of estimated standard errors where ratios greater than 1 indicate that the envelope estimator has lower estimated variability. We can see that efficiency gains are obtained for the displayed output. \\ <<>>= tab3 <- matrix(0, nrow = npop, ncol = 5) colnames(tab3) <- c("env","se(env)","MLE","se(MLE)","ratio") tab3[, 1] <- rowSums(blah$env.boot.out) / nboot tab3[, 2] <- sd.Efron.mean tab3[, 3] <- fit tab3[, 4] <- sd.MLE tab3[, 5] <- ratio tab3[(tab3[, 3] > 7), ] @ \begin{thebibliography}{} %\bibitem[Cook, et al.(2010)Cook, Li, and Chiaromonte]{cook} %Cook, R.~D., Li, B., Chiaromonte, F. (2010). %\newblock Envelope models for parsimonious and %efficient multivariate linear regression %\newblock \emph{Statistica Sinica}, \textbf{20}, 927-1010. \bibitem[Cook and Zhang(2014)Cook and Zhang]{cook2} Cook, R.~D., Zhang, X. (2014). \newblock Foundations for Envelope Models and Methods. \newblock \emph{JASA}, \textbf{110:510}: 599-611. \bibitem[Cook and Zhang(2015)Cook and Zhang]{cook3} Cook, R.~D., Zhang, X. (2015). \newblock Algorithms for Envelope Estimation. \newblock \emph{Journal of Computational and Graphical Statistics}, Published online. \url{DOI:10.1080/10618600.2015.1029577}. %\bibitem[Geyer(2014)]{aster-package} %Geyer, C.~J. (2014). %\newblock R package \texttt{aster} (Aster Models), version 0.8-30. %\newblock \url{http://cran.r-project.org/package=aster}. \bibitem[Eck(2015)]{envlp-package} Eck, D.~J. (2015). \newblock R package \texttt{envlpaster}, version 0.1-1. \newblock \url{http://cran.r-project.org/package=envlpaster}. \bibitem[Eck, Geyer, and Cook(2015)Eck, Geyer, and Cook]{eck} Eck, D.~J., Geyer, C.~J., and Cook, R.~D. (2015). \newblock An Application of Envelope Methodology and Aster Models. \newblock \emph{in preparation}. \bibitem[Efron(2014)Efron]{efron} Efron, B. (2014). \newblock Estimation and Accuracy After Model Selection. \newblock \emph{JASA}, \textbf{109:507}, 991-1007. %\bibitem[Etterson, J.~R., et al.]{etterson} %Etterson, J.~R., et al. (2001). %\newblock Constraint to Adaptive Evolution in Response to %Global Warming %\newblock \emph{Science}, \textbf{294}, 151-154. %\bibitem[Fisher (1930)]{fisher} %Fisher, R.~A. (1930). %\newblock The Genetical Theory of Natural Selection. %\newblock Oxford: Clarendon Press. %\bibitem[Geyer(2010)]{aster2-package} %Geyer, C.~J. (2010). %\newblock R package aster2 (aster models), version 0.1. %\newblock \url{http://cran.r-project.org/package=aster2}. %\bibitem[Geyer(2014)]{aster-package} %Geyer, C.~J. (2014). %\newblock R package \texttt{aster} (Aster Models), version 0.8-30. %\newblock \url{http://cran.r-project.org/package=aster}. %\bibitem[Geyer, et al.(2007)Geyer, Wagenius, and Shaw]{geyer} %Geyer, C.~J., Wagenius, Stuart, Shaw, R. (2007). %\newblock Aster models for life history analysis. %\newblock \emph{Biometrika}, \textbf{94}, 415-426. %\bibitem[Geyer, et al.(2013)]{geyer2} %Geyer, C.~J., Ridley, C.~E., Latta, R.~G., Etterson, J.~R., Shaw, R.~G. (2013). %\newblock Local adaptation and genetic effects on fitness: calculations %for exponential family models with random effects %\newblock \emph{Annals of Applied Statistics}, \textbf{7}, 1778-1795. \bibitem[Geyer, C.~J.(2010)Geyer]{geyer3} Geyer, C.~J. (2010). \newblock A Philosophical Look at Aster Models. Technical Report No. 676. School of Statistics, University of Minnesota. \newblock http://purl.umn.edu/57163. \bibitem[Geyer, et al.(2009)Geyer and Shaw]{geyer4} Geyer, C.~J. and Shaw, R. (2009). \newblock Model Selection in Estimation of Fitness Landscapes. Technical Report No. 671. School of Statistics, University of Minnesota. \newblock http://conservancy.umn.edu/handle/11299/56219. %\bibitem[Kingsolver, et al.(2012)Kingsolver, Diamond, Seiter, % and Higgins]{kingsolver-et-al} %Kingsolver, J.~G., Diamond, S.~E., Seiter, S.~A., and Higgins, J.~K. (2012). %\newblock Direct and indirect phenotypic selection on developmental % trajectories in \emph{Manduca sexta}. %\newblock \emph{Functional Ecology}, \textbf{26}, 598--607. %\bibitem[Lande and Arnold(1983)]{lande} %Lande, R., Arnold, S. (1983). %\newblock The measurement of selection on correlated characters. %\newblock \emph{Evolution}, \textbf{37}, 1210-1226. \bibitem[Lowry and Willis(2010)Lowry and Willis]{lowry} Lowry, D.~B. and Willis, J.~H. (2010) \newblock A Widespread Chromosomal Inversion Polymorphism Contributes to a Major Life-History Transition, Local Adaptation, and Reproductive Isolation. \newblock \emph{PLoS Biol}, 8(9): e1000500. \url{doi:10.1371/journal.pbio.1000500}. %\bibitem[Shaw, et al.(2010)Shaw and Geyer]{shaw} %Shaw, R., Geyer, C.~G. (2010). %\newblock Inferring fitness landscapes. %\newblock \emph{Evolution}, \textbf{64}, 2510-2520. %\bibitem[Shaw, et al.(2008)Shaw, Geyer, Wagenius, Hangelbroek, and Etterson]{shaw2} %Shaw, R., Geyer, C.~G., Wagenius, S., Hangelbroek, H., and Etterson, J.~R. (2010). %\newblock Unifying life-history analyses for %inference of fitness and population growth %\newblock \emph{The American Naturalist}, \textbf{172} E35-E47. %\bibitem[Su, et al.(2011)Su and Cook]{su} %Su, Z., Cook, R.~D. (2011). %\newblock Partial envelopes for efficient estimation in %multivariate linear regression %\newblock \emph{Biometrika}, \textbf{98}, 133-146. \end{thebibliography} \section*{A. make the Mguttatus data} In this Appendix we show how the Mguttatus dataset in the \texttt{envlpaster} package was initially created starting from the data for the original analysis (Lowry, et al., 2010), which was a comma separated values (CSV) file. This file can be found at the same location at the University of Minnesota Digital Conservancy (http://conservancy.umn.edu/) where this technical report is found. <<>>= test <- read.csv("Aster_across_sites_DIV1.csv", sep = ",", header = T) @ Now we need to specify the rest of the graph shown in Figure~\ref{astergraph}. One of these variables that specify the graph is obvious. \texttt{vars} names the variables that are pasted together to form the response vector (all the variables at all the nodes of the graph). The others are less obvious. \texttt{pred} specifies the arrows of the graph, and \texttt{group} specifies the lines. In this analysis, there are no groups. If \texttt{pred[j]} is not zero, then there is an arrow from node \texttt{pred[j]} to node \texttt{j}. Otherwise, there is a line from the initial node (marked 1 in the aster graph) to node \texttt{j}. \texttt{code} is an index vector into the families list; \texttt{families[[code[j]]]} is the family for the conditional distribution of the dependence group containing node \texttt{j} given its predecessor node. <>= vars <- c("sur_flw","flws") pred <- c(0,1) group <- rep(0,length(pred)) code <- c(1,2) families <- list("bernoulli", fam.zero.truncated.poisson()) @ Now we make the asterdata object and verify that this aster object is the as the asterdata provided by the \texttt{envlpaster} package. <>= data <- asterdata(data = test, vars = vars, pred = pred, group = group, code = code, families = families) all.equal(data, Mguttatus) @ \section*{B. make the simulated data} The last line of code shows that the generated dataset analyzed in Example 1 is the same as the dataset included in the \texttt{envlpaster} package. <>= set.seed(13) nind <- 3000 ntime <- 10 psurv <- 0.95 prepr <- 0.7 mpois <- 1 theta.surv <- log(psurv) - log(1 - psurv) theta.repr <- log(prepr) - log(1 - prepr) theta.pois <- log(mpois) tau.pois <- mpois/(1 - exp(-mpois)) vars <- as.vector(outer(c("u", "v", "w"), 1:10, paste, sep = "")) vtype <- as.factor(substr(as.character(vars), 1,1)) fam <- rep(1, length(vars)) fam[vtype == "w"] <- 3 pred <- seq(along = vars) - 1 pred[vtype == "u"] <- seq(along = vars)[vtype == "u"] - 3 pred[1] <- 0 root <- matrix(1, nrow = nind, ncol = length(vars)) theta <- root theta[, vtype == "u"] <- theta.surv theta[, vtype == "v"] <- theta.repr theta[, vtype == "w"] <- theta.pois x <- raster(theta, pred, fam, root) dimnames(x) <- list(NULL, vars) dat <- as.data.frame(x) dat <- as.list(dat) dat[["root"]] <- rep(1, nind) npheno <- 10 zbase <- rnorm(nind) for (i in 1:npheno) { labz <- paste("z", i, sep = "") dat[[labz]] <- zbase + rnorm(nind) } dat <- as.data.frame(dat) redata <- reshape(dat, varying = list(vars), direction = "long", timevar = "varb", times = as.factor(vars), v.names = "resp") wind <- grep("w", as.character(redata$varb)) for (labz in grep("z", names(redata), value = TRUE)) { redata[[labz]][-wind] <- 0 } redata$vtype <- as.factor(substr(as.character(redata$varb), 1, 1)) out2 <- aster(resp ~ vtype + 0, pred, fam, varb, id, root, data = redata, type = "conditional") out3 <- aster(resp ~ vtype + 0, pred, fam, varb, id, root, data = redata) redata$year <- as.numeric(substring(as.character(redata$varb), 2)) redata$uyear <- redata$year * as.numeric(as.character(redata$vtype) == "u") out4 <- aster(resp ~ vtype + uyear, pred, fam, varb, id, root, data = redata) ############################### ##### Section 4.2 model 1 ##### out5 <- aster(resp ~ vtype + uyear + z1 + z2 + I(z1^2) + I(z2^2) + I(z1 * z2), pred, fam, varb, id, root, data = redata) # change the covariate data so that it is mean 0 z1 <- dat$z1 z2 <- dat$z2 ascal <- 0.013 quad <- ascal * ((z1 + z2) - (z1^2 + z2^2) + z1 * z2) con <- mean(quad) beta.new <- out5$coefficients beta.new[1:4] <- out4$coefficients beta.new[3] <- beta.new[3] - con beta.new[5:6] <- ascal beta.new[7:8] <- (-ascal) beta.new[9] <- ascal beta.new <- round(beta.new, 3) theta <- predict(out5, model.type = "conditional", parm.type = "canonical", newcoef = beta.new) theta <- matrix(theta, nrow = nrow(out5$x), ncol = ncol(out5$x)) xnew <- raster(theta, pred, fam, root) dimnames(xnew) <- list(NULL, vars) dnew1 <- as.data.frame(xnew) renew <- reshape(dnew1, varying = list(vars), direction = "long", timevar = "varb", times = as.factor(vars), v.names = "resp") redata$resp1 <- renew$resp redata$tau1 <- predict(out5, newcoef = beta.new) redata$phi1 <- predict(out5, parm.type = "canonical", newcoef = beta.new) beta1true <- beta.new out6 <- aster(resp1 ~ vtype + uyear + z1 + z2 + I(z1^2) + I(z2^2) + I(z1 * z2), pred, fam, varb, id, root, data = redata) ###################### # tau envelope model stuff target <- 5:9 m <- length(target) p <- length(out6$coef) avar <- out6$fisher beta <- out6$coef avar.targ <- avar[target,target] beta.targ <- beta[target] mu <- predict(out6, parm.type = "mean.value", model.type = "unconditional") modmat <- out6$modmat modmat.mat <- matrix(modmat, ncol = length(beta)) offset <- as.vector(out6$origin) x <- out6$x tau <- crossprod(modmat.mat, mu) tau.targ <- tau[target] # check whether transformUnconditional and predict.aster # are returning the same quantity fam[which(fam == 3)] <- 2 code <- fam families = list(fam.bernoulli(),fam.zero.truncated.poisson()) data <- asterdata(data = dat, vars = vars, pred = pred, group = rep(0, length(vars)), code = code, families = families) tau.foo <- transformUnconditional(parm = beta, modmat.mat, data, from = "beta", to = "tau", offset = offset, tol = 1e-10) # put the envelope truth into the model G <- eigen(avar.targ)$vec[,c(1,4)] P <- tcrossprod(G) tau_true <- tau.foo M <- t(modmat.mat) M2 <- M[target,]; M2 <- P %*% M2 M[target,] <- M2 modmat.mat.env <- t(M) tau_true[target] <- P %*% tau.targ beta.foo <- transformUnconditional(parm = tau_true, modmat.mat, data, from = "tau", to = "beta", offset = offset) theta <- predict(out6, model.type = "conditional", parm.type = "canonical", newcoef = beta.foo) theta <- matrix(theta, nrow = nrow(out6$x), ncol = ncol(out6$x)) xnew <- raster(theta, out6$pred, out6$fam, out6$root) dimnames(xnew) <- list(NULL, vars) xnew <- cbind(xnew, dat[,c(32:41)]) xnew$root <- 1 data <- asterdata(data = xnew, vars = vars, pred = pred, group = rep(0, length(vars)), code = code, families = families) @ <>= all.equal(data, simdata30nodes.asterdata) @ \end{document}