## A Prediction Contest

Concluding our baseball example, recall that we observed the home run rates for 20 players in the month of April and we were interested in predicting their home run rates for the next month. Since we have collected data for both April and May, we can check the accuracy of three prediction methods.

1. The naive method would be to simply use the April rates to predict the May rates. Recall that the data matrix is d where the first column are the at-bats and the second column are the home run counts.

pred1=d[,2]/d[,1]

2. A second method, which we called the “pooled” method, predicts each player’s home run rate by the pooled home run rate for all 20 players in April.

pred2=sum(d[,2])/sum(d[,1])

3. The Bayesian method, predicts a player’s May rate by his posterior mean of his true rate lambda_j.

pred3=post.means

One measure of accuracy is the sum of absolute prediction errors.

The May home run rates are stored in the vector may.rates. We first compute the individual prediction errors for all three methods.

error1=abs(pred1-may.rates)

error2=abs(pred2-may.rates)

error3=abs(pred3-may.rates)

We use the apply statement to sum the absolute errors.

errors=cbind(error1,error2,error3)

apply(errors,2,sum) # sum of absolute errors for all methods

error1 error2 error3

0.3393553 0.3111552 0.2622523

By this criterion, Bayes beats Pooled beats Naive.

Finally, suppose we are interested in predicting the number of home runs hit by the first player Chase Utley in May. Suppose we know that he’ll have 115 at-bats in May.

1. We have already simulated 10,000 draws from Utley’s true home run rate lambda1 — these are stored in the first column of the matrix lam vector.

2. Then 10,000 draws from Utley’s posterior predictive distribution are obtained by use of the rpois function.

ys1=rpois(10000,AB*lam[,1])

We graph this predictive distribution by the command

plot(table(ys1),xlab=”NUMBER OF MAY HOME RUNS”)

The most likely number of May home runs is 3, but a 90% prediction interval is [1, 9].

## Fitting an Exchangeable Model

Continuing our baseball example, we observe yi home runs in ei at-bats for the ith player. We assume

1. y1, …, y20 are independent, yi is Poisson(lambda_i)

2. the true rates lambda_1,…, lambda_20 are independent Gamma(alpha, alpha/mu)

3. the hyperparameters alpha, mu are independent, mu is distributed 1/mu, alpha is distributed according to the proper prior z0/(alpha+z0)^2, where z0 is the prior median

The data is stored as a matrix d, where the first column are the ei and the second column are the yi. In our example, we let z0 = 1; that is, the prior median of alpha is one.

Here is our computing strategy.

1. First we learn about the hyperparameters (alpha, mu). The posterior of (log alpha, log mu) is programmed in the LearnBayes function poissgamexch. Here is a contour plot.

ycontour(poissgamexch,c(-1,8,-4.2,-2.5),datapar)

title(xlab=”LOG ALPHA”,ylab=”LOG MU”)

We use the function laplace to find the posterior mode and associated variance-covariance matrix. The output from laplace is used to find a proposal variance and scale parameter to use in a random walk Metropolis chain. We simulate 10,000 draws from the posterior of (log alpha, log mu). (This chain has approximately a 30% acceptance rate.)

fit=laplace(poissgamexch,c(2,-3.2),datapar)

proposal=list(var=fit$var,scale=2)

mcmcfit=rwmetrop(poissgamexch,proposal,c(1,-3),10000,datapar)

By exponentiating the simulated draws from mcmcfit, we get draws from the marginal posteriors of alpha and mu. We draw a density plot of alpha and superimpose the prior density of alpha.

alpha=exp(mcmcfit$par[,1])

mu=exp(mcmcfit$par[,2])

plot(density(alpha,adjust=2),xlim=c(0,20),col=”blue”,lwd=3,xlab=”ALPHA”,

main=”POSTERIOR OF ALPHA”)

prior=function(alpha,z0) z0/(alpha+z0)^2

theta=seq(.0001,20,length=200)

lines(theta,prior(theta,1),col=”red”,lwd=3)

2. Now we can learn about the true rate parameters. Conditional on hyperparameter values, lambda_1, …, lambda_20 have independent gamma posteriors.

We write a short function to simulate draws of lambda_j for a particular player j.

trueratesim=function(j,data,alpha,mu)

{

e=data[,1]; y=data[,2]

rgamma(length(alpha),shape=alpha+y[j],rate=alpha/mu+e[j])

}

Then we can simulate draws of all the true rates by use of the sapply function.

lam=sapply(1:20,trueratesim,d,alpha,mu)

The output lam is a 10,000 by 20 matrix. We can compute posterior means by the apply function.

post.means=apply(lam,2,mean)

To show the behavior of the posterior means, we draw a plot where

(1) we show the observed rates yi/ei as red dots

(2) we show the posterior means as blue dots

(3) a horizontal line is draw at the mean home run rates for all 20 players.

## Predicting home run rates

Here is a simple prediction problem. Suppose we observe the number of home runs and the number of at-bats for 20 baseball players during the first month of the baseball season (April).

We observe corresponding home run rates: Utley 4/88, Weeks 1/77, Aurila 4/71, and so on.

From this data, we want to predict the home run rates for the same 20 players for the next month (May).

How can we best do this? Here are some opening comments.

1. One idea would be to estimate the May rates just by using the April rates. So we predict Utley’s May rate to be 4/88, Weeks rate to be 1/77, etc. This may not be a good idea since we have limited data for each player.

2. Or maybe a better strategy would be to combine the data. Collectively in April, this group hit 60 home runs in 1697 at-bats for a rate of 60/1697 = 0.035. We could estimate each player’s May rate by 0.035. But this would ignore the fact that players have different abilities to hit home runs.

3. Actually, the best prediction strategy is a compromise between the first two ideas. A good plan is to predict a player’s May rate by a “shrinkage” estimate that shrinks a player’s individual rate towards the combined rate.

In the following postings, we’ll illustate how to fit a Bayesian exchangeable model to these data that gives a good prediction strategy.

## View of an Exchangeable Prior

In Chapter 7, we consider the following exchangeable prior for Poisson rates lam_1, …, lam_k that is described in two stages.

Stage I. Conditional on parameters alpha, mu, lam_1, …, lam_k are independent Gamma(alpha, alpha/mu)

Stage II. The parameters (alpha, mu) come from a specified prior g(alpha, mu).

Here mu is the prior mean of lam_i and alpha is a precision parameter. This structure induces the following prior on lam_1, .., lam_k:

g(lam_1, …, lam_k) = integral prod P(lam_j | alpha, mu) g(alpha, mu) dalpha dmu.

To see how this prior reflects dependence between the parameters, suppose we fix alpha to the value alpha_0 and let mu be distributed inverse gamma(a, b). Then one can show the prior on lam_1,…, lam_k is given (up to a proportionality constant) by

g(lam_1, …, lam_k) = P^(alpha_0-1)/(alpha_0 S + b)^(k alpha_0 + a),

where P is the product of lam_j and S is the sum of lam_j.

To see this prior, we program a simple function pgexchprior that computes the logarithm of the prior of lam_1 and lam_2 given parameter values (alpha_0, a, b).

pgexchprior=function(lambda,pars)

{

alpha=pars[1]

a=pars[2]

b=pars[3]

(alpha-1)*log(prod(lambda))-(2*alpha+a)*log(alpha*sum(lambda)+b)

}

The following R commands construct contour plots of the prior for lam_1 and lam_2 for the precision parameters alpha_0 = 5, 20, 80, and 200. (In each case, we assign mu an inverse-gamma (10, 10) prior.)

alpha=c(5,20,80,400)

par(mfrow=c(2,2))

for (j in 1:4)

{

mycontour(pgexchprior,c(.001,5,.001,5),c(alpha[j],10,10))

title(main=paste(“ALPHA = “,alpha[j]),xlab=”LAMBDA1″,ylab=”LAMBDA2”)

}

These plots clearly show that, as alpha increases, the prior induces stronger correlation between the two Poisson rates.

## Illustration of posterior predictive checking

To illustrate posterior predictive checking, suppose we observe data y1,…,yn that we assume is N(mu, sigma). We place a noninformative prior on (mu, sigma) and learn about the parameters based on the posterior distribution. We use the posterior predictive distribution to check out model — to see if our sample is consistent with samples predicted from our fitted model.

In practice, we construct a checking function d that is a function of the future sample y*. In this example (66 measurements of the speed of light), we suspect that the smallest observation is an outlier. So we use a checking function d(y) = y_min.

Here’s an R procedure for checking our model using this diagnostic.

We load in the speed of light data:

> y=scan(“speedoflight.txt”)

Read 66 items

> y

[1] 28 22 36 26 28 28 26 24 32 30 27 24 33 21 36 32 31 25 24

[20] 25 28 36 27 32 34 30 25 26 26 25 -44 23 21 30 33 29 27 29

[39] 28 22 26 27 16 31 29 36 32 28 40 19 37 23 32 29 -2 24 25

[58] 27 24 16 29 20 28 27 39 23

We simulate 1000 draws from the normal/scale x inv-chi-square posterior using the function normpostsim.

parameters=normpostsim(y,1000)

The output is a list — here parameters$mu contains draws from mu, and parameters$sigma2 contains draws of sigma2.

The simulation of samples from the posterior predictive distribution is done by the function normpostpred. By use of comments, we explain how this function works.

normpostpred=function(parameters,sample.size,f=min)

{

# the function normalsample simulates a single sample given parameter values mu and sigma2.

# the index j is the index of the simulation number

normalsample=function(j,parameters,sample.size)

rnorm(sample.size,mean=parameters$mu[j],sd=sqrt(parameters$sigma2[j]))

# we use the sapply command to simulate many samples, where the number of samples

# corresponds to the number of parameter simulations

m=length(parameters$mu)

post.pred.samples=sapply(1:m,normalsample,parameters,sample.size)

# on each posterior predictive sample we compute a stat, the default stat is min

stat=apply(post.pred.samples,2,f)

# the function returns all of the posterior predictive samples and a vector of values of the

# checking diagnostic.

return(list(samples=post.pred.samples,stat=stat))

}

Here the size of the original sample is 66. To generate 1000 samples of size 66 from the posterior predictive distribution, and storing the mins of the samples, we type

post.pred=normpostpred(parameters,66)

The values of the minimum observation in all samples is stored in the vector post.pred$stat. We display the mins in a histogram and draw a vertical line showing the location of the min of the data.

> hist(post.pred$stat,xlim=c(-50,15))

> lines(min(y)*c(1,1),c(0,350),lwd=3,col=”red”)

Clearly the smallest observation is inconsistent with the mins generated from samples from the posterior predictive distribution. This cast doubts on the normality assumption of the data.

## No more loops!

One aspect of the LearnBayes package that I’ve been uncomfortable with is the use of loops in my definition of the functions defining log posteriors. I had a reason for using the loops (the theta argument to the function could be a matrix), but it is generally bad programming practice. The worst aspect of the looping is that it makes the process of a writing a posterior more difficult than it really is. Looping is bad from the user’s perspective. After all, we are teaching statistics not programming, and I want to encourage people to code the posteriors for their problems using R.

Here is a simple example. Suppose your model is that y1,…, yn are independent Cauchy with location mu and scale sigma. The log posterior is given by

log g = sum (log f),

where log f is the log of the Cauchy density conditional on parameters. My old way of programming the posterior had the loop

for (i in 1:length(data))

val = val + log(dt((data[i] – mu)/sigma, df = 1)/sigma)

I think this new way is preferable. First you define the function logf for a single observation y:

logf=function(y,mu,sigma) log(dt((y-mu)/sigma,df=1)/sigma)

Then the log posterior is given by

sum(logf(data,mu,sigma))

Anyway, I think that by avoiding loops, the function for the log posterior becomes more transparent.

The new version of the LearnBayes package will contain fewer loops.