Here is a basic proportion problem. We are approaching the end of the 2009 baseball season and Ryan Howard of the Philadelphia Phillies currently has 38 home runs in 540 at-bats (opportunities). I project that he will have 85 more at-bats this season. How many additional home runs will he hit?

Let p be the probability that Howard hits a home run in a single at-bat during the 2009 season. We’ll address this prediction problem in three steps. First, we’ll use past data from previous seasons to form a prior distribution for p. Second, we’ll update our beliefs about p using the 2009 data. Last, we’ll predict the number of future at-bats from the posterior predictive distribution. In this process, we’ll illustrate the use of some functions from the LearnBayes package in R.

1. My prior.

In the four previous seasons (2005 through 2008) his home runs/at-bats have been 22/312. 58/581. 47/529, and 48/610. The home run rates are 0.071, 0.100, 0.089, 0.079. Remember that p is the 2009 home run hitting probability. Based on these data, I believe that

(1) P(p < 0.075) = 0.5

(2) P(p < 0.115) = 0.9

In other words, my best guess at p is 0.075 and I’m pretty confident that p is smaller than 0.115. I find the beta(a, b) that matches this information by use of the beta.select function in LearnBayes. I first define the two quantiles as lists — these quantiles are the arguments in beta.select.

> quantile1=list(p=.5,x=.075)

> quantile2=list(p=.9,x=.115)

> beta.select(quantile1, quantile2)

[1] 7.26 85.77

We see that a beta(7.26, 85.77) matches this information.

2. Update my prior with the 2009 data.

In the 2009 season, Howard had 38 successes (home runs) and 502 failures (not home runs). The posterior for p will be beta with updated parameters

a = 7.26 + 38 = 45.26, b = 85.77 + 502 = 587.77

3. Predict.

The predictive distribution with a beta density has the beta/binomial form. These probabilities are computed using the pbetap function in LearnBayes. The arguments are the beta parameters (a, b), the future sample size, and the values of y of interest. Here we use a = 45.66, b = 587.77, we use 85 at-bats, and we’re interested in the probabilities of y = 0, 1, …, 85.

> ab=c(45.66, 587.77)

> n=85

> y=0:85

> pred.probs=pbetap(ab,n,y)

The predictive probabilities are stored in the vector pred.probs. To summarize these probs, we use the function disc.int. The inputs are the probability distribution where the first column are the values of y and the second column are the probabilities, and a probability content of interest. The output is an interval of values that contain y with a given probability. We use this function twice — once to find a 50% interval and a second time to find a 90% interval.

> discint(cbind(y,pred.probs),.5)

$prob

[1] 0.575821

$set

[1] 4 5 6 7

> discint(cbind(y,pred.probs),.9)

$prob

[1] 0.9301308

$set

[1] 2 3 4 5 6 7 8 9 10

So P(4 <= y <= 7) = 0.58 and P(2 <= y <= 10) = 0.93. This means that I’m somewhat confident that Howard will hit between 4 and 7 home runs, and very confident that Howard will hit between 2 and 10 home runs in the remainder of the season.

> quantile1=list(p=.5,x=.075)

> quantile2=list(p=.9,x=.115)

> beta.select(quantile1, quantile2)

[1] 7.26 85.77