## Fitting a Mixture Sampling Model, Part II

In the previous post, we introduced the mixture sampling model and showed that the posterior had an interesting bimodal shape. Here we illustrate the use of two Metropolis random walk algorithms to sample from this posterior.

Recall that the definition of the log posterior is in poisson.mix.R and the data is contained in the vector y.

We begin with a random walk algorithm using an identity var-cov matrix and a scale constant of 0.01. We begin at value (3, 3) and run for 10,000 iterations. We store the fit in the variable fit1.

proposal=list(var=diag(c(1,1)),scale=.01)

start=array(c(3,3),c(1,2))

fit1=rwmetrop(poisson.mix,proposal,start,10000,list(y=y,p=.4))

In our second algorithm, we also use an identity var-cov matrix, but use the large scale constant of 0.2.

proposal=list(var=diag(c(1,1)),scale=.2)

start=array(c(3,3),c(1,2))

fit2=rwmetrop(poisson.mix,proposal,start,10000,list(y=y,p=.4))

The acceptance rates of the two algorithms are:

Acceptance rate

fit1 94%

fit2 23%

One basic method of MCMC output analysis uses trace plots which are time series plots of the simulated draws. We show trace plots for each random walk below.

Random walk (scale = 0.01)

Note the snake-like appearance of this graph. This is typical of MCMC random walk runs with high acceptance rates.

Random walk (scale = 0.2)

This plot displays better mixing and it visits both areas of concentration of the posterior.

To check the accuracy of each chain in simulating the marginal posterior of theta1 = log(lambda1), we first use the function simcontour that is based on simulations from the grid that covers the actual posterior. A density estimate on the draws of theta1 based on this “exact” algorithm is presented as a red line and the density estimate of the draws of theta1 from the MCMC algorithm are presented in blue.

We first show the random walk with scale 0.01 and then the random walk with scale 0.2.

In this example, it seems that the second MCMC run produced a better approximation to the marginal posterior density of theta1. The first run with scale 0.01 actually missed the important area of the posterior.

## Fitting a Mixture Sampling Model

To illustrate the application of a MCMC random walk algorithm, consider the following mixture model. Suppose we observe season home run counts y1, …, y20 from a group of baseball players. Some of these players are “sluggers” who average many home runs (per season) and other players are “light hitters” who don’t hit as many home runs. We know that 40% of all players are sluggers, but we don’t know the identity (slugger or non-slugger) for each player. The sampling density for the ith player’s home run count yi has the density

f(yi | lambda1, lambda2) = p f(yi | lambda1) + (1- p) f(yi | lambda2),

where we assume f(yi | lambda) is Poisson with mean lambda and we assume we know p = .4.

Suppose we assume that (lambda1, lambda2) has the prior 1/(lambda1 lambda2). Then the posterior is given (up to a proportionality constant) by

g(lambda1, lambda2 | y) = 1/(lambda1 lambda2) prod from i=1 to 20 [p f(yi | lambda1) + (1- p) f(yi | lambda2)]

Here is some simulated data

25 42 29 29 28 25 21 34 19 11 16 13 15 17 10 16 21 17 16 20

We first transform the parameters to the real-valued parameters

theta1 = log(lambda1), theta2 = log(lambda2)

and write the following function that computes the posterior of (theta1, theta2). Here the input variable data.p is a list that contains two elements: p, the known value of p, and y, the vector of observations.

poisson.mix=function(theta, data.p)

{

lambda1=exp(theta[,1])

lambda2=exp(theta[,2])

y=data.p$y

p=data.p$p

val=0*lambda1

for (i in 1:length(y))

val=val+log(p*dpois(y[i],lambda1)+(1-p)*dpois(y[i],lambda2))

return(val)

}

The data is stored in the vector y. We use the mycontour function to graph the bivariate posterior. We also show a perspective plot of this posterior.

mycontour(poisson.mix,c(2.2,3.8,2.2,3.8),list(y=y,p=.4))

You might be surprised to see that the posterior is bimodal. This is happening since the model is not well-defined. But in the next posting, we’ll ignore this identifiability problem, and see if a random walk Metropolis can be successful in sampling from this posterior.

## Choosing the scale for a Metropolis RW algorithm

Let’s return to the example where we are sampling from a Cauchy density with unknown median theta and known scale parameter 1 and we place a uniform prior on theta. We observe the data

1.3 1.9 3.2 5.1 1.4 5.8 2.5 4.7 2.9 5.2 11.6 8.8 8.5 10.7 10.2

9.8 12.9 7.2 8.1 9.5

An attractive method of obtaining a simulated sample from this posterior is the Metropolis random walk algorithm. If theta^(t-1) represents the current simulated draw, then the next candidate is generated from the distribution

theta^(t) = theta^(t-1) + c Z,

where Z is standard normal. The main issue here is the selection of the scale constant c.

I did a simple experiment. I simulated samples of 10,000 draws using the scale values 0.2, 1, 5, 25. Here’s the R code for the Metropolis random walk with the choice of scale parameter 0.2.

cpost=function(theta,y)

{

val=0*theta

for (j in 1:length(y))

val=val+dt(y[j]-theta,df=1,log=TRUE)

return(val)

}

proposal=list(var=1,scale=.2)

fit1=rwmetrop(cpost,proposal,20,10000,data)

The acceptance rates for these four values are given by

scale acceptance rate

0.2 93%

1.0 73%

5.0 31%

25 7%

To assess the accuracy of a particular sample, we draw the exact posterior in red, and superimpose a density estimate of the simulated draws in blue.

Comparing these four figures, the scale values of 1 and 5 seem to do pretty well.

## Summarizing a posterior

Continuing our selected data example, remember that we have programmed the posterior of the transformed parameters theta1 and theta2 in the R function kaminsky.

To find a normal approximation to the posterior, we apply the function laplace in the LearnBayes package. The inputs are (1) the function defining the log posterior, (2) a starting guess at the mode, (3) the number of iterations of the Newton algorithm, and (4) the data vector (the 5th and 15th order statistics).

> start=array(c(-2,-1),c(1,2))

> fit.laplace=laplace(kaminsky,start,10,data)

The output is a list containing mode, the value of the posterior mode, and var, the estimate at the variance-covariance matrix.

> fit.laplace$mode

[,1] [,2]

[1,] -2.367745 -1.091989

> fit.laplace$var

[,1] [,2]

[1,] 0.3201467 0.1191059

[2,] 0.1191059 0.1191059

We can get more accurate summaries of the posterior by means of a Metropolis random walk algorithm. The function rwmetrop implements this algorithm for an arbitrary posterior. To use this function, we define “proposal”, a list containing the variance and scale parameter for the normal proposal density, the starting value for the MCMC chain, the number of simulated draws, and the data vector. Note that we are using the approximate variance-covariance matrix from laplace in the proposal density for rwmetrop.

> proposal=list(var=fit.laplace$var,scale=2)

> fit.mcmc=rwmetrop(kaminsky,proposal,start,10000,data)

The output of rwmetrop is a list containing accept, the acceptance rate for the chain, and par, the matrix of simulated draws.

At this point, we should run some convergence diagnostics to see if the simulated draws show sufficient mixing and don’t display unusually high autocorrrelations. For this example, the acceptance rate is about 29% which is within the acceptable range for this algorithm.

We display the simulated draws on top of the contour plot of theta1 and theta2 — it seems that that most the simulated draws fall within the first contour line.

> mycontour(kaminsky,c(-5,0,-2.5,1),data)

> title(xlab=”LOG(Y5-MU)”,ylab=”LOG BETA”)

> points(fit.mcmc$par[,1],fit.mcmc$par[,2])

We are interested in the parameters mu and beta. We first compute vectors of simulated draws of mu and beta by transforming back the simulated draws of theta1 and theta.

> MU=data[1]-exp(fit.mcmc$par[,1])

> BETA=exp(fit.mcmc$par[,2])

We display the marginal posteriors of mu and beta.

> par(mfrow=c(2,1))

> plot(density(MU),lwd=3,main=”POSTERIOR OF MU”,xlab=”MU”)

> plot(density(BETA),lwd=3,main=”POSTERIOR OF BETA”,xlab=”BETA”)

We construct 90% interval estimates by extracting quantiles from the collection of simulated draws.

> quantile(MU,c(.05,.95))

5% 95%

9.864888 10.065765

> quantile(BETA,c(.05,.95))

5% 95%

0.2012635 0.6520858

Last, suppose we are interested in predicting the 5th order statistic ys5 from a future sample of 20 observations.

To simulate from the distribution of ys5, we (1) simulate (mu, beta) from the posterior and then (2) simulate a future sample y1,…,y20 from the exponential distribution with parameters mu and beta, and (3) storing the 5th ordered observation from the simulated sample. We repeat this process 1000 times, obtaining a simulated sample from ys5. We display this predictive distribution by a histogram.

ys5=rep(0,1000)

for (j in 1:1000)

{

ys=rexp(20,rate=1/BETA[5000+j])+MU[5000+j]

ys5[j]=sort(ys)[5]

}

hist(ys5,col=”orange”)

## Learning from selected order statistics

To illustrate some computational methods for summarizing a posterior, I describe a missing data problem motivated from my undergraduate days at Bucknell. Ken Kaminsky and Paul Nelson, two of my Bucknell professors, were interested in learning about populations based on selected order statistics. (I wrote an undergraduate honors thesis on this topic. ) Here is a simple illustration of the problem.

Suppose a sample y1, …, y20 is taken from the two-parameter exponential distribution of the form f(y | mu, beta) =1/beta exp(-(y-mu)/beta), y > mu. But you don’t observe the complete dataset — all you observe are the two order statistics y(5) and y(15) (the order statistics are the observations arranged in ascending order).

Based on this selected data, we wish to (1) estimate the parameters mu and beta by 90% interval estimates and (2) predict the value of the order statistics y*(5) and y*(20) from a future sample taken from the same population.

Here’s the plan:

1. First, we write the likelihood which is the density of the observed data (y(5) and y(20)) given values of the exponential parameters mu and beta. One can show that this likelihood is given by

L(mu, beta) = f(y(5)) f(y(15)) F(y(5))^4 (F(y(15) )-F(y(5)))^9 (1- P(y(15)))^5, mu>0, beta>0.

2. Assuming a flat (uniform) prior on (mu, beta), the posterior density is proportional to the likelihood. We write a R function kaminsky0.R that computes the logarithm of the posterior — here the parameters are (mu, beta) and the data is (y(5), y(15)).

kaminsky0=function(theta,data)

{

f=function(y,mu,beta)

return(dexp(y-mu,rate=1/beta))

F=function(y,mu,beta)

return(pexp(y-mu,rate=1/beta))

y5=data[1]; y15=data[2]

mu=theta[,1]

beta=theta[,2]

loglike=log(f(y5,mu,beta))+log(f(y15,mu,beta))+

4*log(F(y5,mu,beta))+9*log(F(y15,mu,beta)-F(y5,mu,beta))+

5*log(1-F(y15,mu,beta))

return(loglike)

}

3. Graphing the posterior of (mu, beta), we see strong skewness in both parameters.

It is usually helpful to transform to real-valued parameters

theta1 = log(y(5) – mu) , theta1 = log(beta).

We write the following function kaminsky.R that computes the log posterior of (theta1, theta2).

kaminsky=function(theta,data)

{

f=function(y,mu,beta)

return(dexp(y-mu,rate=1/beta))

F=function(y,mu,beta)

return(pexp(y-mu,rate=1/beta))

y5=data[1]; y15=data[2]

mu=y5-exp(theta[,1])

beta=exp(theta[,2])

loglike=log(f(y5,mu,beta))+log(f(y15,mu,beta))+

4*log(F(y5,mu,beta))+9*log(F(y15,mu,beta)-F(y5,mu,beta))+

5*log(1-F(y15,mu,beta))

logjack=theta[,1]+theta[,2]

return(loglike+logjack)

}

Here’s a graph of the posterior of the reexpressed parameters — note that it is much more normal-shaped.

4. We’ll use several functions in the next posting to summarize the posterior.

(a) The laplace function is useful in finding the posterior mode and normal approximation to the posterior.

(b) By use of the rwmetrop function, we construct a random-walk Metropolis algorithm to simulate from the joint posterior.

## The SIR algorithm

There is a simple method, called the SIR algorithm, of taking a simulated sample of draws from one distribution p, and using these draws to produce a sample from a different distribution g. We illustrate this method for the Cauchy sampling model example introduced in the last post.

Suppose that we have a proposal density g(theta) that we believe is a rough approximation to the posterior (in terms of location and spread). Here we suppose that a t density with mean 7, variance 9 and degrees of freedom 3 is a rough approximation to our bimodal posterior density for theta.

There are three steps in the SIR algorithm.

1. (S) We Sample 1000 draws from the proposal density p. We are storing these in the vector theta.p.

theta.p=sqrt(VAR)*rt(1000,DF)+MEAN

2. (I) We compute Importance sampling weights for this sample equal to the ratios of the target density (g) to the proposal density (p).

p.theta=dt(theta.p-MEAN,DF)/sqrt(VAR)

g.theta=exp(cpost(theta.p,y))

weights=g.theta/p.theta

The following figure plots the simulated draws from the proposal density against the weights.

plot(theta.p,weights,xlim=c(2,12),xlab=”THETA”,ylab=”POST/PROPOSAL”)

3. (R) We resample 1000 draws with replacement from the simulated draws theta.p, where the sampling probabilities are proportional to the weights.

probs=weights/sum(weights)

theta.sample=sample(theta.p,size=1000,prob=probs,replace=TRUE)

The values in the vector theta should be (approximately) from the posterior density g.

The sir algorithm with a t proposal density is implemented in the function sir.R in the LearnBayes algorithm.

To illustrate this function, remember the definition of the log posterior is in cpost and the data is stored in the vector y. We create a list tpar that contains the components of the t proposal density.

MEAN=7; VAR=9; DF=3

tpar=list(m=MEAN,var=VAR,df=DF)

Then we implement the algorithm using the sir function — the output is a vector of simulated draws from the posterior.

s=sir(cpost,tpar,1000,y)

## Robust Modeling

To illustrate some of the computational methods to summarize a posterior, suppose that we observe a sample y1, …, yn from a Cauchy density with location theta and scale parameter 1. If we assign a uniform prior to theta, then the posterior density of theta is proportional to

product from i=1 to n [ (1 + (yi – theta))^{-1} ].

Suppose we observe the following 20 values:

1.3 1.9 3.2 5.1 1.4 5.8 2.5 4.7 2.9 5.2 11.6 8.8 8.5 10.7 10.2

9.8 12.9 7.2 8.1 9.5

In R, we define a new function cpost.R that computes the logarithm of the posterior density.

cpost=function(theta,y)

{

val=0*theta

for (j in 1:length(y))

val=val+dt(y[j]-theta,df=1,log=TRUE)

return(val)

}

We compute and display the posterior on the interval [2, 12].

Note the interesting bimodal shape of the posterior. Clearly a normal approximation to this posterior will not be an accurate representation.

Suppose we wish to simulate a sample from this posterior. A general method for generating a sample is the reject algorithm. To construct a reject algorithm, we find a suitable p that is easily to simulate from and covers the posterior g in the sense that g(theta) <= c p(theta) for all theta. Then one simulates a variate u from a uniform(0, 1) distribution; if u < [ g(u)/c/p(u)], then we accept u as a draw from the target distribution g. Suppose we let p be a t density with mean mu, variance v, and degrees of freedom df. Here we let

MEAN=7; VAR=9; DF=3

To find the bounding constant, we plot the logarithm of the ratio, log g(theta) – log p(theta) over the range of theta.

p.density=dt(theta-MEAN,DF)/sqrt(VAR)

plot(theta,log(post/p.density),type=”l”) From inspection of the graph (and a little calculation), it appears that the log of the ratio is bounded above by -62.98.

In the R code, we draw the posterior density and the constant x proposal density that covers the posterior.

The function rejectsampling.R will implement reject sampling with a t covering density. The inputs to this function are (1) the definition of the log target density, (2) a list giving the parameters of the t covering density (mean, variance, df), (3) the value of the log of the bounding constant, (4) the number of values simulated from the proposal density, and (5) the data used in the target function. The output of this function is a vector of values from the target distribution.

tpar=list(m=MEAN,var=VAR,df=DF)

dmax=-62.98

n=10000

theta.sim=rejectsampling(cpost,tpar,dmax,n,y)

One can compute the acceptance rate of this algorithm by dividing the length of the output vector theta.sim by the number of simulated draws from the proposal. The acceptance rate for this example is about 12%. To demonstrate that this algorithm works, we draw in the following figure (1) the exact posterior in red and (2) a density estimate of about 6000 simulated draws in blue. I’m convinced that the algorithm has indeed produced a sample from the posterior distribution.