Archive

Archive for October, 2009

Home run rates — prediction

Last post, we illustrated fitting an exchangeable model on home run hitting probabilities for all players (non pitchers) in the 2007 season.

But actually, teams aren’t really interested in estimating hitting probabilities.  They are primarily interested in predicting a player’s home run performance the following 2008 baseball season.

We already have the posterior distribution for the 2007 home run probabilities.  Assuming that the probabilities stay the same for the following season and that the players will have the same number of at-bats, we can easily predict their 2008 home run counts from the posterior predictive distribution.

For each player, I simulated the predictive distribution of the number of home runs $y_i^*$ in $n_i$ at-bats in the 2008 season.  I summarize each distribution by its mode — this is the most probable value of $y_i^*$ from the predictive distribution.

In the first graph, for all players I plot the number of 2007 home runs (horizontal) against the predicted number of 2008 home runs – the number of 2007 home runs (vertical).   Given the 2007 home run count, I’m interested in my prediction of the change to the following season.  (I jittered the points so you can see the individual observations.)  For the weak home run hitters (on the left side of the graph), I predict they will perform a little worse or better the following season.  But for the heavy hitters, I predict they will hit 5-7 home runs worse in the 2008 season.  Generally, there is a negative slope in the graph that illustrates the “regression to the mean” effect.

Since we have the actual 2008 home run counts, we can actually compare the 2007 and 2008 counts to see if there is a similar effect.  In the following graph, I plot the 2007 counts against the change (2008 count – 2007 count). What do we see?  There is a negative correlation as we suspected.  Hitters who did well in 2007 tended to get worse in 2008, and a number of light hitters in 2007 tended to do better in 2008.  The biggest difference between the actual data and our predictions is the magnitude of the changes.  Three of the top hitters, for example, hit 15-20 home runs in 2008.

Categories: Hierarchical modeling

Home run rates — fitting an exchangeable model

Continuing our home run hitting example, we observe home run counts $y_1, ..., y_k$ where $y_i$ is distributed $Bin(n_i, p_i)$. Suppose we assign the home run probabilities the following exchangeable model.

1.  $p_1, ..., p_k$ distributed $Beta(K, \eta)$

2. $K, \eta$ independent, $\eta$ distributed $\eta^{-1}(1-\eta)^{-1}, K$ distributed $\frac{1}{1+K^2}$.

We described the computing strategy in class to summarize the posterior distribution of $(p_1, .., p_k), \eta, K$.  We find

1.  For $\eta$, the posterior median is 0.0283 and a 90% interval estimate is (.0268, .0296).

2. For $K$, the posterior median is 99 and a 90% interval estimate is (85, 116).

We find posterior means for the home run probabilities $p_1, ..., p_k$.  We plot the posterior means against the square root of the at-bats for the 605 non-pitchers.

This is a very different graph pattern compared to the figure plotting sqrt(AB) against the observed home run rates.  Several things to note.

1.  For the players with few AB, their home run rates are strongly shrunk towards the overall home run rate.  This is reasonable — we have little information about these players’ true home run hitting abilities.

2.  Who are the best home run hitters?  It is clear from the figure that four hitters stand out — they are the ones in the upper right corner of the graph that I’ve labeled with the number of home runs.  The best home run hitters, that is the ones with the highest estimated home run probabilities, are the two player with 46 and 54 at-bats.

3.  I’ve drawn a smoothing curve to indicate the main pattern in the graph.  Note that it’s an increasing curve — this means that players with more AB tend to have higher home run probabilities.  This makes sense — the better hitters tend to get more playing time and more at-bats.

Categories: Hierarchical modeling

Estimating home run rates

Since the World Series starts tomorrow, it seems appropriate to use baseball data to illustrate hierarchical modeling.

I collected home run for all nonpitchers in the 2007 baseball season.  We observe the pair $(HR_i, AB_i)$, the number of home runs and the number of at-bats (opportunities) for the $i$th player.

Who was the best home run hitter this season?  Major League Baseball’s criteria is simply the number of home runs hit.  Alex Rodriquez hit the most home runs  (54) that season.

Was A-Rod the best home run hitter in 2007?  Since the number of at-bats varies among players, maybe it makes more sense to consider the player who had the highest home run rate RATE = HR/AB.

Who had the highest home run rate in the 2007 season?  Actually, it was Charlton Jimerson who had 1 home run in 2 at-bats for a home run RATE = 1/2 = 0.500.

Should Jimerson be given the home run hitting crown?  Of course not, since he had only two at-bats.  It makes more sense to restrict our attention to hitters who had more opportunities.  But how many opportunities (AB) is enough?

Below I’ve plotted the home run rate HR/AB against the square root of the number of AB for all nonpitchers in the 2007 season.

This figure dramatically illustrates the problem of interpreting a collection of rates when the sample sizes vary.  The home run rates are much more variable for small sample sizes.  If one wishes to simultaneously estimate the probabilities of a home run for all players, it seems desirable to adjust the small sample rates towards the average rate.  We’ll see that we can accomplish this by means of an exchangeable prior model.

Categories: Hierarchical modeling

Learning about exponential parameters — part III

The two-parameter exponential sampling problem can be used to illustrate Gibbs sampling.  The  joint posterior density for $(\mu, \beta)$ has the form

$g(\mu, \beta | {\rm data}) \propto \beta^n \exp(-\beta(s - n\mu)) I(\mu < \min y)$

Note that each of the two conditional posteriors have simple forms.

1.  If we fix a value of $\mu$, the conditional posterior of $\beta$ has the gamma form with shape $n+1$ and rate $(s - n\mu)$.

2.  Turning things around, if we fix a value of $\beta$, the conditional posterior of $\mu$ has the form

$g(\mu | y, \beta) \propto \exp(n\beta \mu), \mu < \min y$.

This is the mirror image of an exponential random variable with rate beta and location $\min y$.  By use of the inversion method, one can simulate a value of $\mu$.

Here is a short function to implement Gibbs sampling for this example.  The inputs are the data vector y and the number of cycles of Gibbs sampling iter.   One cycle of GS is accomplished by the two lines

beta=rgamma(1,shape=n+1,rate=s-n*mu)  # simulates from [$\beta | \mu$]
mu=log(runif(1))/(beta*n)+min(y)  # simulates from [$\mu | \beta$]

gibbs.exp=function(y,iter)
{
n=length(y)
s=sum(y)
theta=array(0,c(iter,2))
mu=min(y)
for (j in 1:iter)
{
beta=rgamma(1,shape=n+1,rate=s-n*mu)
mu=log(runif(1))/(beta*n)+min(y)
theta[j,]=c(mu,beta)
}

gibbs.exp=function(y,iter)
{
n=length(y)
s=sum(y)
theta=array(0,c(iter,2))
mu=min(y)
for (j in 1:iter)
{
beta=rgamma(1,shape=n+1,rate=s-n*mu)
mu=log(runif(1))/(beta*n)+min(y)
theta[j,]=c(mu,beta)
}

I ran this for 10,000 iterations and the MCMC chain appeared to mix well.
Categories: MCMC

Learning about exponential parameters, part II

In the previous post, I described using a random walk Metropolis algorithm to sample from the posterior of $(\mu, \beta)$, where $\mu$ is the location and $\beta$ is the rate parameter for the exponential.   But since the MCMC algorithm is based on steps generated from a multivariate normal distribution, it seems that this algorithm would perform better if the parameters are real-valued.

So it makes sense to transform each parameter into ones that are real-valued.  We let

$\theta_1 = \log (\min y - \mu), \theta_2 = \log(\beta)$.

When we transform $(\mu, \beta)$ to $(\theta_1, \theta_2)$, we need to include a Jacobian to write down the posterior for $(\theta_1, \theta_2)$.  It can be shown that the Jacobian is equal to $J = \exp(\theta_1+\theta_2)$ and the new posterior is

$g_{new}(\theta_1, \theta_2 | y) = g(\mu, \beta|y) |J|$

Here is the function for computing the log posterior of $(\theta_1, \theta_2)$ and a contour plot of the new parameters.  Note that the shape of the posterior is close to normal, suggesting that one can use the random walk Metropolis algorithm effectively with the new posterior.

logpost2=function(theta,y)
{
mu=min(y)-exp(theta[1])
beta=exp(theta[2])
sum(dexp(y-mu,beta,log=TRUE))+sum(theta)
}
fit=laplace(logpost2,c(0,0),y)
mycontour(logpost2,c(-10,1,-2.2,0),y,
xlab=”LOG(MIN Y – MU)”,ylab=”LOG BETA”)

logpost2=function(theta,y)
{
mu=min(y)-exp(theta[1])
beta=exp(theta[2])
sum(dexp(y-mu,beta,log=TRUE))+sum(theta)
}
mycontour(logpost2,c(-10,1,-2.2,0),y,
xlab=”LOG(MIN Y – MU)”,ylab=”LOG BETA”)

Categories: MCMC

Learning about an exponential rate and location

To illustrate MCMC fitting for a two parameter problem, suppose one observes a sample $y_1, ..., y_n$ from a gamma density with location $\mu$ and rate $\beta$:

$f(y) = \beta exp(-\beta(y - \mu)), y \ge \mu .$

If we assume a uniform prior on $(\mu, \beta)$, then the posterior density is given by

$g(\mu, \beta | y) \propto \prod f(y_i)$,

where $-\infty < \mu < \min y_i, \beta > 0$.

Suppose we observe the following data:

[1]  5.1  5.3  7.1  6.5  3.2  5.2  6.4  9.2 11.4  3.1  3.0  9.4  3.6  5.7  5.0
[16]  3.6  3.1 11.5  3.5  5.5

5.1  5.3  7.1  6.5  3.2  5.2  6.4  9.2 11.4  3.1  3.0  9.4  3.6  5.7  5.0
3.6  3.1 11.5  3.5  5.5

1.  First we write a function logpost that computes the log posterior of $(\mu, \beta)$.  Since $\beta > 0$, we have to be careful when negative values are input for $\beta$.  In the ifelse function, we assign a very large negative value when $\beta < 0$.

logpost=function(theta,y)
ifelse(theta[2]>0,sum(dexp(y-theta[1],theta[2],log=TRUE)),-9e200)

2.  Using the mycontour function in the LearnBayes package, we construct a contour plot of the posterior — we add the line $\mu = \min y$ to show the boundary for $\mu$.

mycontour(logpost,c(1.5,3.5,.1,.8),y,xlab=”MU”,ylab=”BETA”,col=”red”)
abline(v=min(y))

3.   We’ll try using a random walk Metropolis algorithm where a candidate value is found by
$(\mu, \beta)^c = (\mu, \beta)^i + scale \times Z$,
where $Z$ is normal with mean 0 and variance-covariance matrix $V$.  Here it is not quite obvious how to choose $scale$ and $V$ — a normal approximation won’t work here.  Looking at the graph, it seems that most of the probability for $\mu$ is in (2, 3) and most of the probability for $\beta$ is in (0.1, 0.6).  If we use the rough rule of thumb “4 sd = range of data”, I get estimates for the two standard deviations and these can be squared to get estimates at the variances.  I obtain the estimate $V$ = diag(.06, .015).  After some trial and error, it seems that the value scale = 1 works ok — the acceptance rate is 25%.
Here are the arguments for the function rwmetrop — I show the simulated draws on top of the scatterplot.  It looks like a reasonable sample from the posterior.
fit=rwmetrop(logpost,list(var=diag(c(.06,.015)),scale=1),c(2,.3),10000,y)
Categories: MCMC

Learning about a correlation — part IV

Currently we are introducing MCMC fitting of posterior distributions.  Here we illustrate the use of one popular algorithm, the Metropolis random walk chain, to simulate from an arbitrary posterior distribution.

Let’s return to the problem of learning about a correlation from standardized data.  We have programmed the logarithm of the posterior density in the function cor.sampling

cor.sampling=function(rho,data)
{
x=data[,1]
y=data[,2]
ifelse(abs(rho)<1,sum(-0.5*log(1-rho^2)-0.5/(1-rho^2)*(x^2-2*rho*x*y+y^2)),
-9.99e200)
}

Here the parameter rho falls in the interval (-1, 1), but I want to define the log posterior for all values of rho.  Note that in the ifelse command, I’m checking if abs(rho) < 1 — if it isn’t, then I’m defining the log posterior to be a very large negative value (so the posterior is essentially zero outside of that interval).

How does one implement a random walk Metropolis chain for this example?

1.  Recall that in this algorithm, you have a current simulated draw $\rho^{i}$ and one proposes a new simulated draw $\rho^C$ where

$\rho^C = \rho^i + c Z$,

$c$ is a scale parameter and $Z$ is a  simulated draw from a normal distribution with mean 0 and standard deviation $\sqrt{V}$ .  One has to choose a value of the scale parameter $c$ and the variance $V$.

2.  Then when you implement the algorithm, you choose a starting parameter value (where the simulation begins) and the total number of simulation draws you collect.

The function rwmetrop in the LearnBayes package has inputs:  (1)  the name of the function that defines the log posterior, (2) a list containing values of $V$ and $c$, (3) the starting value of the parameter in the simulation, and (4) any data used in the log posterior function.

Here I’m using the function cor.sampling, I’ll choose $V = 0.009, c = 2$, start the simulation at $\rho = 0.5$, collect 10,000 simulation draws, and my data is stored in dz.

R1=rwmetrop(cor.sampling,list(var=.009,scale=2),.5,10000,dz)

The output is R1 — R1$par contains the simulated draws and R$accept gives the acceptance rate of the algorithm.

1.  Typically it is good to let $Z$ have a standard deviation equal to twice the estimated standard deviation of the posterior.  So you let $var$ be an estimated posterior variance and $scale = 2$.
3.  Acceptance rate?  A random walk Metropolis algorithm works well when the acceptance rate is in the 25 – 40% range.  You can monitor this value by displaying the R1$accept value. 4. A trace plot of the simulated parameter values is helpful for seeing how well the simulated sample is exploring the posterior density. One looks for a trend in this graph (is the MCMC algorithm slow in converging to the posterior) and for a snake-like appearance (this indicates that you have poor mixing in the chain). Trace plots and other diagnostics are available as part of the coda package. Let’s look at two runs of this algorithm — one for a good scale value and one for a poor choice of scale. The acceptance rate for the first run is 50% and the acceptance rate for the second run is 94%. library(coda) R1=rwmetrop(cor.sampling,list(var=.009,scale=2),.5,10000,dz) R1$accept
traceplot(mcmc(R1$par),main=”Good choice of scale”) R2=rwmetrop(cor.sampling,list(var=.009,scale=0.2),.5,10000,dz) R2$accept