\documentclass{article}

\usepackage{indentfirst}
\usepackage{amsmath}
\usepackage{amsfonts}
% \usepackage{amssymb}
% \usepackage{amscd}
\usepackage{natbib}
\usepackage{url}
% \usepackage{graphicx}
\usepackage[utf8]{inputenc}

\addtolength{\textheight}{\headsep}
\addtolength{\textheight}{\headheight}
\setlength{\headheight}{0 pt}
\setlength{\headsep}{0 pt}

\newcommand{\set}[1]{\{\,#1\,\}}
\newcommand{\real}{\mathbb{R}}

\begin{document}

  \vspace*{0.9375in}
  \begin{center}
    {\bfseries Supplementary Material for the paper
    ``Asymptotics for Constrained Dirichlet Distributions''} \\
    By \\
    Charles J. Geyer and Glen D. Meeden \\
    Technical Report No.~691 \\
    School of Statistics \\
    University of Minnesota \\
    March 20, 2012 \\
%     \today
  \end{center}
  \thispagestyle{empty}
  \cleardoublepage
  \setcounter{page}{1}
  \thispagestyle{empty}

\begin{abstract}
This document is supplementary material for a paper.
It shows how to simulate the linear-equality-and-inequality-constrained
normal distribution that is the large sample approximation to
a similarly constrained Dirichlet posterior.
\end{abstract}

\section{Introduction} \label{sec:intro}

Throughout this document ``the paper'' refers to \citet{dirichlet}.
We are interested in simulating a constrained Dirichlet distribution that is a
posterior for multinomial data using conjugate priors.
According to the discussion in Section~4.2 of the paper,
if $y$ is the vector of multinomial data, then the unconstrained posterior
is Dirichlet with hyperparameter $\hat{\alpha}_n = y + \xi$, where $\xi$ is
the hyperparameter of the prior (which is also Dirichlet).  Here we use
$\xi = 0$, which gives an improper prior but a proper posterior so long
as no component of $y$ is zero.
The constrained Dirichlet posterior restricts the Dirichlet probability
density function to a constraint set determined by a finite set of linear
equality and inequality constraints and renormalizes so it integrates to
one over the constraint set.

According to Remark~6 in the paper, we approximate the posterior with a
normal distribution having unnormalized probability density function
\begin{equation} \label{eq:the-approx}
   h(\lambda)
   =
   \exp\left( - \frac{n}{2} (\lambda - \hat{\lambda}_n)^T
   \widehat{\Lambda}_n^{- 1} (\lambda - \hat{\lambda}_n) \right)
\end{equation}
(this is equation (22) in the paper),
where the scalar $n$ is the multinomial sample size,
the vector $\hat{\lambda}_n$ is the unconstrained posterior
mean $\hat{\alpha}_n / n$,
and the matrix $\widehat{\Lambda}_n$ is diagonal with diagonal entries that
are the corresponding components of the vector $\hat{\lambda}_n$.
The constrained asymptotic approximation of the posterior restricts
\eqref{eq:the-approx} to the constraint set
and renormalizes so it integrates to one over the constraint set.

We impose the constraints in two steps: equality constraints first, and then
inequality constraints.  The equality constraints constrain the parameter
vector $\lambda$ to an affine subspace.

Suppose the equality constraints have the form $B \lambda = a$ where $B$
is a known matrix and $a$ is a known vector, let $V$ denote the vector
subspace
$$
   V = \set{ w \in \real^d : B w = 0 }
$$
(this is equation (19) in the paper), and let $M$ be a matrix whose columns
are a basis for $V$.  Then every $\lambda$ in the equality constraint set
has the form
\begin{equation} \label{eq:beta-to-lambda}
   \lambda = \lambda_0 + M \beta
\end{equation}
for some $\beta$, where $\lambda_0$ is any point in the equality constraint
set, that is, any vector satisfying $B \lambda_0 = a$.
We can think of $\beta$ as a new parameter for the problem that, like $\lambda$
has an approximate normal distribution, and, moreover, unlike $\lambda$,
has a nondegenerate normal approximation.  The mean and variance of this
normal approximation are given in Remark~6 of the paper as
\begin{equation} \label{eq:beta-star}
   \beta^*_n
   =
   \left(M^T \widehat{\Lambda}_n^{- 1} M\right)^{- 1}
   M^T \widehat{\Lambda}_n^{- 1} (\hat{\lambda}_n - \lambda_0)
\end{equation}
(this is equation (25) in the paper) and
\begin{equation} \label{eq:beta-var}
   n^{- 1} (M^T \widehat{\Lambda}_n^{- 1} M)^{- 1}
\end{equation}
We simulate $\lambda$ having the normal approximation to the equality
constrained Dirichlet posterior by simulating $\beta$ multivariate normal
with mean vector \eqref{eq:beta-star} and variance matrix \eqref{eq:beta-var}
and then transforming to the original parameter $\lambda$
via \eqref{eq:beta-to-lambda}.

We then impose the inequality constraints.  In the simulation we reject
any $\lambda$ that do not satisfy the constraints.

\section{R Package RCDD}

<<foo,include=FALSE,echo=FALSE>>=
options(width = 70)
@
We use the R statistical computing environment \citep{rcore} in our analysis.
It is free software and can be obtained from
\url{http://cran.r-project.org}.  Precompiled binaries
are available for Windows, Macintosh, and popular Linux distributions.
We use the contributed package \verb@rcdd@ \citep{package-rcdd}.
If R has been installed, but this package has
not yet been installed, do
\begin{verbatim}
install.packages("rcdd")
\end{verbatim}
from the R command line
(or do the equivalent using the GUI menus if on Apple Macintosh
or Microsoft Windows).  This may require root or administrator privileges.

Assuming the \verb@rcdd@ package has been installed, we load it
<<library>>=
suppressPackageStartupMessages(library(rcdd))
@
<<baz,include=FALSE,echo=FALSE>>=
baz <- library(help = "rcdd")
baz <- baz$info[[1]]
baz <- baz[grep("Version", baz)]
baz <- sub("^Version: *", "", baz)
bazzer <- paste(R.version$major, R.version$minor, sep = ".")
@
The version of the package used to make this document
is \Sexpr{baz}.
The version of R used to make this document is \Sexpr{bazzer}.

This entire document and all of the calculations shown were made using
the R command \texttt{Sweave} and hence are exactly reproducible by
anyone who has R and the R noweb (RNW) file from which it was created.
Both the RNW file and and the PDF document produced from it will be made
available at the University of Minnesota Digital Conservancy.

Not only can one exactly reproduce the results in the printable document,
one can also modify the parameters of the simulation and get different
results.  Anything at all can be changed once one has the RNW file.

In particular, we set the ``seed'' of the random number generator
<<set-seed>>=
set.seed(42)
@
so that every time this RNW file is run it produces the same results.
Changing the argument of \texttt{set.seed} or removing this chunk of
R code will produce different results.

<<dimension>>=
d <- 9
n <- 1000
if (d < 7) stop("need dimension at least 7")
@
These statements choose the dimension of the parameter space
(\Sexpr{d}) and the multinomial sample size (\Sexpr{n}).
Modifying either or both of these statements will change the simulation
accordingly.

\section{The Constraints}

We set up the constraints as an RCDD H-representation.
First, the constraints defining the unit simplex.
<<simplex>>=
hrep <- makeH(- diag(d), rep(0, d), rep(1, d), 1)
@
These are the constraints that the components of the parameter are nonnegative
and sum to one.

Second, a constraint on the mean of a certain random variable.
Let $X$ be a random variable taking values 1, $\ldots,$ $d$ with probabilities
given by the vector $\lambda$.
<<mean>>=
x <- d2q(1:d)
mu <- qdq(qsum(x), d2q(d))
hrep <- addHeq(x, mu, hrep)
@
This equality constraint requires that the mean of $X$ be
$\mu = (d + 1) / 2 = \Sexpr{(d + 1) / 2}$.

Third, constraints on the median of the same random variable.
<<median>>=
dev.from.mean <- qmq(x, rep(mu, d))
dev.from.mean.plus.two <- qpq(dev.from.mean, rep("2", d))
dev.from.mean.minus.two <- qmq(dev.from.mean, rep("2", d))
hrep <- addHin(as.numeric(qsign(dev.from.mean.plus.two) < 0),
    "1/2", hrep)
hrep <- addHin(as.numeric(qsign(dev.from.mean.minus.two) > 0),
    "1/2", hrep)
@
These inequality constraints require that the median of $X$ be
in the closed interval with endpoints $\mu - 2 = \Sexpr{q2d(mu) - 2}$
and $\mu + 2 = \Sexpr{q2d(mu) + 2}$.

We now want to add some constraints on the variance of $X$ but do not
know what is reasonable.  Since any linear constraint is maximized or
minimized at an extreme point of the constraint set, we check what are
the extreme points of the current constraint set and what the variance
of $X$ is at each.
<<extreme-variance>>=
vout <- scdd(hrep, incidence = TRUE)
extremes <- vout$output[ , - c(1, 2)]
squared.dev.from.mean <- qxq(dev.from.mean,
    dev.from.mean)
extreme.var <- qmatmult(extremes, cbind(squared.dev.from.mean))
extreme.var <- as.vector(extreme.var)
extreme.var[order(q2d(extreme.var))]
more.extreme.var <- extreme.var[sapply(vout$incidence,
    function(foo) nrow(hrep) %in% foo)]
more.extreme.var[order(q2d(more.extreme.var))]
@
The points which are rows of the matrix \texttt{extremes} are the extreme
points of the current constraint set (represented by \texttt{hrep}).
The components of the vector \texttt{extreme.var} are the values of the
variance of the random variable $X$ at those extreme points.
The components of the vector \texttt{more.extreme.var} are the values for the
subset of those points at which the constraint determined by the last row
of \texttt{hrep} is binding (holds with equality).  This is the upper bound
median constraint, so this makes the point $\mu + 2 = \Sexpr{q2d(mu) + 2}$
a median of the random variable $X$.

Now we find the upper and lower quartiles of those variance values.
<<quarter>>=
fred <- q2d(more.extreme.var)
sally <- quantile(fred, type = 1)[c(2, 4)]
varbound <- more.extreme.var[fred %in% sally]
varbound <- unique(varbound)
varbound <- varbound[order(q2d(varbound))]
varbound
@
Finally, we constrain the variance of $X$ to be between those bounds.
<<variance>>=
hrep <- addHin(qneg(squared.dev.from.mean),
    qneg(varbound[1]), hrep)
hrep <- addHin(squared.dev.from.mean,
    varbound[2], hrep)
@
This completes the construction of the constraint set.
The R object \texttt{hrep} specifies it.

\section{Faces}

This is a fairly complicated convex polytope.
<<faces>>=
fout <- allfaces(hrep)
dims <- unlist(fout$dimension)
length(dims)
f <- tabulate(dims + 1, nbins = d + 1)
f
@
There is one (improper) face of dimension $d - 2$, which is the whole
constraint set.  The dimension is $d - 2$ because there are two equality
constraints: the components of $\lambda$ sum to one and the mean of $X$ is
$\mu$.  There is also one (improper) face which is the empty set (and which
is not counted above, since the function \texttt{allfaces} only lists
nonempty faces).
There are \Sexpr{f[1]} vertices (faces consisting of a single point),
\Sexpr{f[2]} edges (faces which are line segments), \Sexpr{f[d - 2]}
facets (proper faces of maximal dimension, in this case $d - 3$),
and \Sexpr{f[d - 3]} ridges (faces of dimension one less than the dimension
of facets, in this case $d - 4$).

\section{Simulation Truth}

In order that our simulated data be a good test case, we want the inequality
constraints to have effect.  For this reason we choose the simulation truth
parameter value (the $\lambda$ we use to simulate multinomial data) to satisfy
two of the four complicated inequality constraints with equality
(where ``complicated'' means not the nonnegativity constraints,
which are all satisfied with strict inequality).
We choose the two upper bound constraints, which are the last row and the
third to last row of \texttt{hrep}.

We make the simulation truth parameter value
a relative interior point of the ridge satisfying these two upper bound
constraints with equality as well as the equality constraints.
<<interior>>=
ncons <- nrow(hrep)
inies <- sapply(fout$active.set,
    function(foo) all(c(ncons - 2, ncons) %in% foo))
inies.max.dim <- max(unlist(fout$dimension[inies]))
inies.of.max.dim <- inies &
    unlist(fout$dimension) == inies.max.dim
lambda.truth <- fout$relative.interior.point[inies.of.max.dim]
length(lambda.truth) == 1
lambda.truth <- lambda.truth[[1]]
lambda.truth <- q2d(lambda.truth)
lambda.truth
@

Clearly this $\lambda$ satisfies the nonnegativity constraints.
We check ``by hand'' using the most obvious code, that it satisfies the
other constraints.
<<check>>=
x <- q2d(x)
mu <- q2d(mu)
sum(lambda.truth)
sum(lambda.truth * x)
sum(lambda.truth[x < mu - 2])
sum(lambda.truth[x > mu + 2])
sum(lambda.truth * (x - mu)^2)
q2d(varbound)
@
We see that, indeed, all of the constraints are satisfied at this point,
and two of the complicated inequality constraints are satisfied
with equality, and, of course, this is all that are possible to satisfy
with equality, because the median cannot simultaneously be
at both its lower and upper bound and similarly for the variance.

\section{Data}

Make up data.
<<data>>=
y <- rmultinom(1, size = n, prob = lambda.truth)
y <- as.vector(y)
if (any(y <= 0)) stop("must have all y values strictly positive")
y
@

\section{Remark}

None of the work done to this point in this document would need to be
done for a real data analysis.  The data vector \texttt{y} would be
given (it would be the data).  The equality and inequality constraints
would be determined by prior knowledge (perhaps influenced by other data
about the random variable $X$ that is involved in the complicated constraints).
A real data analysis would start here.

\section{Affine Hull}

We determine the affine hull of the constraint set.
<<affine>>=
equalities <- hrep[ , 1] == "1"
equalities

av <- scdd(hrep[equalities, ])$output
av.point <- av[av[ , 1] == "0", ]
av.lines <- av[av[ , 1] == "1", ]
av.point <- as.vector(av.point[- c(1, 2)])
av.lines <- av.lines[ , - c(1, 2)]
av.point
av.lines
@
The vector \texttt{av.point} is a point in the affine hull of the constraint
set.  The rows of the matrix \texttt{av.lines} are a basis for the tangent
space of the affine hull of the constraint set (the vector space parallel
to it).

We now change notation to follow Remark~6 of the paper.
<<affine-too>>=
m <- t(q2d(av.lines))
lambda.zero <- q2d(av.point)
@
Our \texttt{m} is $M$ in the paper, and our \texttt{lambda.zero}
is $\lambda_0$ in the paper.

We check to see that these objects behave well with respect to
the constraints (as they must if the work above is correct).
First we check that $\lambda_0$ satisfies the equality constraints
(it does not satisfy the inequality constraints, and does not need to).
<<check-affine-point>>=
sum(lambda.zero)
sum(x * lambda.zero)
@
Second we check that the columns of $M$ satisfy the equality constraints
with the right-hand side changed to zero (because movement along these vectors 
should go from $\lambda_0$ to other points satisfying the equality constraints).
<<check-affine-lines>>=
as.vector(rbind(rep(1, d)) %*% m)
as.vector(rbind(x) %*% m)
@

\section{Point Estimates}

<<point-estimates>>=
alpha.hat <- y
lambda.hat <- alpha.hat / n
@
In notation of Section~4.2 of the paper, we are using the improper
prior determined by setting the hyperparameter $\xi$ to the zero vector.
Our \texttt{alpha.hat} is $\hat{\alpha}_n$ in the paper, and
our \texttt{lambda.hat} is $\hat{\lambda}_n$ in the paper.

\section{Beta}

Now we find the vector \eqref{eq:beta-star} and the matrix \eqref{eq:beta-var}
that are the mean and variance of the normal distribution for the new
parameter $\beta$ that lives on the affine hull.

First, we find $\beta^*_n$ given by \eqref{eq:beta-star}.
<<beta-star>>=
lout <- lm(lambda.hat - lambda.zero ~ 0 + m,
    weights = 1 / lambda.hat)
beta.star <- lout$coefficients
@
We can also do this by literally implementing \eqref{eq:beta-star}.
<<beta-star-too>>=
beta.star.too <- solve(t(m) %*% diag(1 / lambda.hat) %*% m) %*%
    t(m) %*% diag(1 / lambda.hat) %*% cbind(lambda.hat - lambda.zero)
all.equal(as.vector(beta.star), as.vector(beta.star.too))
@

Second, we find the variance matrix given by \eqref{eq:beta-var}.
<<beta-var>>=
varmat <- solve(t(m) %*% diag(1 / lambda.hat) %*% m) / n
@

\section{Simulate Equality Constrained Posterior}

<<simulate-lambda>>=
library(MASS)
nboot <- 1e5
beta <- mvrnorm(nboot, beta.star, varmat)
lambda <- beta %*% t(m)
lambda <- sweep(lambda, 2, lambda.zero, "+")
@
This code simulates $\beta$ vectors (each row of the matrix \texttt{beta}
is one such simulation.  Then we transform to $\lambda$ using
\eqref{eq:beta-to-lambda},
except since \texttt{beta} is actually a matrix, must use sweep instead of add.

Each row of the matrix \texttt{lambda} is a simulated normal random vector
having the required mean vector and variance matrix.  We check that all
of these simulations satisfy the equality constraints.
<<lambda-check>>=
range(lambda %*% cbind(x))
range(apply(lambda, 1, sum))
@

\section{Apply Inequality Constraints to Simulation}

Now we apply the inequality constraints to the simulated realizations
of the unconstrained posterior.  To do this we need the constraints
in the form $B \lambda \le a$.
<<inequality-constraints>>=
inequalities <- hrep[ , 1] == "0"
bmat <- q2d(hrep[inequalities, ])
avec <- as.vector(bmat[ , 2])
bmat <- bmat[ , - c(1, 2)]
bmat <- (- bmat)
@
This code makes matrix \texttt{bmat} and vector \texttt{avec}, which
are $B$ and $a$ in mathematical notation, that determine the inequality
constraints as described.

Since \texttt{lambda} is actually a matrix, we must sweep instead of add
<<rejection>>=
goodies <- lambda %*% t(bmat)
goodies <- sweep(goodies, 2, avec)
goodies <- apply(goodies < 0, 1, all)
mean(goodies)
@

Our Monte Carlo (MC) sample is the ``goodies.''
<<goodies>>=
lambda <- lambda[goodies, ]
@
Now we check --- the dumb way to be safe --- that our MC sample does indeed
satisfy the constraints.
<<check-final>>=
dim(lambda)
range(apply(lambda, 1, min))
range(apply(lambda, 1, sum))
range(apply(lambda, 1, function(lambda) sum(lambda[x < mu - 2])))
range(apply(lambda, 1, function(lambda) sum(lambda[x > mu + 2])))
range(apply(lambda, 1, function(lambda)
    sum(q2d(squared.dev.from.mean) * lambda)))
q2d(varbound)
@

\section{Remark}

That ends the simulation.  The rows of the matrix \texttt{lambda} are
independent and identically distributed Monte Carlo simulations of the
(approximate, large sample, asymptotic) constrained posterior distribution.
 
\section{Importance Weights}

Suppose we were to use the asymptotic
approximation as an importance sampling distribution to sample the
correct (finite $n$) distribution.

The Dirichlet posterior distribution has unnormalized probability density
function
\begin{equation} \label{eq:dirichlet}
   g(\lambda) = \prod_{i = 1}^n \lambda_i^{\hat{\alpha}_i - 1}
\end{equation}
and the asymptotic approximation has unnormalized probability density function
\eqref{eq:the-approx}.  Their ratio is the unnormalized importance weights.
<<importance-weights-unnormalized>>=
log.g.lambda <- apply(lambda, 1, function(lambda)
    sum((alpha.hat - 1) * log(lambda)))
log.h.lambda <- apply(lambda, 1, function(lambda) {
    foo <- lambda - lambda.hat
    bar <- sum(foo^2 / lambda.hat)
    return(- n * bar / 2)})
weigh <- log.g.lambda - log.h.lambda
@
The vector \texttt{weigh} now contains the log unnormalized importance weights.
Because of the constraints, we do not know the normalizing constants for the
distributions.  Hence these unnormalized importance weights are useless for
importance sampling.  By normalizing them, however, we do make them useful
\citep[Section~11.1]{geyer-important}.
<<importance-weights-normalized>>=
weigh <- weigh - max(weigh)
weigh <- exp(weigh)
weigh <- weigh / sum(weigh)
@
If $w_i$ are these importance weights
(components of our R vector \texttt{weigh}) and $\lambda_i$ are the samples
(rows of our R matrix \texttt{lambda}), then for any function $f$ the weighted
average
$$
   \sum_{i = 1}^m w_i f(\lambda_i)
$$
is an unbiased estimator of the true posterior expectation
$E\{ f(\lambda) \mid y \}$, assuming this posterior expectation exists
(that is, that $f$ is integrable with respect to the posterior distribution).

It will be a good estimator so long as the importance weights are not too
uneven.  So let us look at them.  The following R statements make the plot
Figure~\ref{fig:weigh}.
<<label=fig1too,include=FALSE>>=
par(mar = c(5, 4, 1, 1) + 0.1)
plot(log.h.lambda, weigh, log = "y")
@
\begin{figure}
\begin{center}
<<label=fig1,fig=TRUE,echo=FALSE>>=
<<fig1too>>
@
\end{center}
\caption{Normalized Importance Weights versus Log Unnormalized Density
     of Asymptotic Normal Distribution (vertical axis is log scale).}
\label{fig:weigh}
\end{figure}
We see from the figure that the normalized importance weights are not
too unevenly distributed.  One indication of this is Figure~\ref{fig:weigh}.
Another is the ratio of the maximum importance weight to the average.
<<import-max>>=
max(weigh) * length(weigh)
@
Hence we conclude that importance sampling will work well (for these
simulated data \texttt{y}), and we can use our asymptotic approximation
to calculate expectations with respect to the exact posterior distribution.

\section{Discussion}

It is somewhat surprising that we went to all that trouble to simulate
the constrained normal approximation to the constrained exact posterior
and then didn't use the constrained normal approximation except as an
importance sampling distribution.  But this does make sense.  We need
the asymptotic approximation because we do not know how to simulate
an equality constrained Dirichlet distribution (we do know how to simulate
an unconstrained Dirichlet distribution because it is a product of beta
conditionals and marginals, but that is no help), whereas we do know how
to simulate an equality constrained normal distribution.  Thus the usefulness
of the normal approximation is that it allows us to impose
the equality constraints.
This cannot be done by rejection sampling because the equality constrained
distribution lies in a lower-dimensional affine subspace that has probability
zero under the unconstrained distribution.  In contrast the inequality
constraints can be imposed (as we did above) by rejection sampling
(simply reject the points that do not satisfy the inequality constraints).

This importance sampling scheme will work well when the normal approximation
is good (when $n$ is large) and will not work well when it isn't (when $n$
is small).  In this respect, it is no different from any other use of
asymptotic approximation.

Rather than produce a massive simulation study (repeat the above with lots
of different random number generator seeds, different sample sizes \texttt{n}
and different dimensions \texttt{d}), we have produced a ``simulation schema''
that allows interested readers to redo this document with different
choices of these quantities.  Industrious readers can with a bit more work
also change the constraints.

\begin{thebibliography}{}

\bibitem[Geyer(2011)]{geyer-important}
Geyer, C.~J. (2011).
\newblock Importance sampling, simulated tempering, and umbrella sampling.
\newblock In \emph{Handbook of Markov Chain Monte Carlo}.
    Edited by S. P. Brooks, A. E. Gelman, G. L. Jones, and X. L. Meng.
\newblock Chapman \& Hall/CRC, Boca Raton, pp.~295--311.

\bibitem[Geyer and Meeden(2009)]{package-rcdd}
Geyer, C.~J. and Meeden, G.~D. (2009).
\newblock R package \texttt{rcdd} (C Double Description for R).
\newblock Current version 1.1-3 (14 Jul 2009).
\newblock \url{http://www.stat.umn.edu/geyer/rcdd/}.

\bibitem[Geyer and Meeden(submitted)]{dirichlet}
Geyer, C.~J. and Meeden, G.~D. (submitted).
\newblock Asymptotics for constrained Dirichlet distributions.
\newblock Revised and resubmitted to \emph{Bayesian Analysis}.

\bibitem[R Development Core Team(2012)]{rcore}
R Development Core Team (2012).
\newblock R: A language and environment for statistical computing.
\newblock R Foundation for Statistical Computing, Vienna, Austria.
\newblock \url{http://www.R-project.org}.

\end{thebibliography}

\end{document}

\begin{center} \LARGE REVISED DOWN TO HERE \end{center}