## Graphical elicitation of a beta prior

My students are doing a project where they have to elicit a beta prior for a proportion. This is a difficult task and any tools to make this process a little easier are helpful.

I’ve been playing with a R package manipulate that is included as part of the RStudio installation (for those who don’t know, RStudio is a nice interface for R). This package makes it easy to construct sliders, push buttons, and pull-down menus to provide a nice user interface for functions one creates.

Here is a simple illustration of the use of the manipulate package. One assesses the shape parameters of the beta indirectly by specifying (1) the prior median p.med and (2) the prior 90th percentile p.90. One inputs values of p.med and p.90 by sliders and you see the beta density curve that matches this information. I wrote a simple R script that constructs the graph with sliders — the main tools are the manipulate package and the LearnBayes package (the beta.select function).

Here is the script and here is a movie that shows the sliders in action.

## Home run hitting — priors?

At our department’s colloquium last week, there was an interesting presentation on the use of an exchangeable prior in a regression context. The analysis was done using WinBUGS using some standard choices of “weekly informative” priors. That raises the natural question — how did you decide on this prior?

Let’s return to the home run hitting example where an exchangeable prior was placed on the set of home run hitting probabilities. We assumed that were iid and then the hyperparameters and were assumed independent where was assigned the prior and was assigned the log logistic form .

When I described these priors, I just said that they were “noninformative” and didn’t explain how I obtained them.

The main questions are:

1. Does the choice of vague prior matter?

2. If the answer to question 1 is “yes”, what should we do?

When one fits an exchangeable model, one is shrinking the observed rates towards a common value. The posterior of tells us about the common value and the posterior of tells us about the degree of shrinkage. The choice of vague prior for is not a concern — we can assume is uniform or distributed according to the improper prior . However, the prior on can be a concern. If one assigns the improper form , it can be shown that the posterior may be improper. So in general one needs to assign a proper prior to . One convenient choice is to assume that has a logistic density with location and scale . In my example, I assumed that .

Since I was arbitrary in my choice of parameters for this logistic prior on , I tried different choices for and . I tried values of between -4 and 4 and values of between 0.2 and 10. What did I find? The posterior mode of stayed in the 4.56 – 4.61 range for all of the priors I considered. In this example, we have data from 605 players and clearly the likelihood is driving the inference.

What if the choice of and does matter? Then one has to think more carefully about these parameters. In this baseball context, I am pretty familiar with home run hitting rates. Based on this knowledge, I have to figure out a reasonable guess at the variability of the home run probabilities that would give me information about .

## Constructing a prior in a 2 x 2 table

At BGSU, all freshmen take a mathematics placement test and the results of the test, together with some other information, is used to advise students on the appropriate first math class to take.

The other information consists of the student’s score on the ACT Math exam and his/her high school grade point average, or GPA. We are interested in examining the relationship between the GPA and the ACT Math score.

For the student taking Form B of the math placement exam, 63% of the students have a GPA over 3.0. Suppose you categorize these students into two groups, those who have scored 21 or less on the ACT exam, and those who have scored over 21 on the ACT.

GPA <= 3.0 GPA > 3.0

ACT <=21 p1 1 – p1

ACT > 21 p2 1 – p2

How do we construct a prior on the proportions p1 and p2? Let’s step through the process.

1. Here it doesn’t make any sense to assign p1 and p2 independent distributions. If I thought that if student probability of getting a low GPA is 0.37 and then you told me that the student scored low on the ACT exam, then I would think it is more likely that the student gets a low GPA. That is, I would think the student’s chance of getting a GPA <= 3.0 would be higher than 0.37. Clearly, p1 and p2 should have a dependent prior.

2. A popular way of expressing the relationship between p1 and p2 is by use of a log odds-ratio. Let T1 = log (p1/(1-p1)) be the log odds of p1, and let T2 = log(p2/(1-p2)) be the log odds of p2. A log odds converts a proportion (between 0 and 1) to a real-valued quantity. A proportion of 0.5 corresponds to a log odds of 0, a probability less than 0.5 corresponds to a negative odds ratio, and so on. Then one compares the proportions by means of the difference in log odds, or the log odds ratio

.

3. Suppose we transform the proportions and to the log odds ratio and the sum of log odds (or logits)

,

and place independent priors on and .

4. Let’s illustrate using this prior. I assume and are independent with

, .

The prior says that (1) I don’t know much about the sizes of the proportions (a vague prior placed on ), but (2) I believe that a larger proportion of the “low ACT” students will have low GPAs than the “high ACT students” (an informative prior placed on ).

5. This prior induces a dependent prior on and . This is easy to illustrate using R. First I’ll simulate 1000 draws from the posterior of and and transform these values back to the proportions and . I construct a scatterplot of () and put the line on top. This clearly shows the dependence in the prior and it reflects the belief that .

theta1=rnorm(1000,0.5,.3)

theta2=rnorm(1000,0,2)

logitp1 = (theta1 + theta2)/2

logitp2 = (theta2 – theta1)/2

p1=exp(logitp1)/(1+exp(logitp1))

p2=exp(logitp2)/(1+exp(logitp2))

plot(p1,p2,pch=19,col=”red”)

abline(0,1,lwd=3)

## Constructing a prior for a Poisson mean

I am interested in learning about my son’s cell phone use. Suppose the mean number of text messages that he sends and receives per day is equal to . I will observe the number of text messages for days — we’ll assume that conditional on , are independent Poisson().

How do I construct a prior density on ? Suppose I model my beliefs by use of a gamma density with shape parameter and rate parameter . I want to figure out the values of the prior parameters and .

There are two distinct ways of assessing your prior. One can think about plausible values of the mean number of daily text messages . Alternatively, it may be easier to think about the actual number of text messages — if we assume a gamma prior, then one can show that the predictive density of is given by

Personally, I think it is easier to think about the number of text messages . My best prediction at is 10. After some additional thought, I decide that my standard deviation for is equal to 3.5. One can show that the mean and standard deviation of the predictive density are given by

If one matches my guesses with these expressions, one can show .

To see if these values make any sense, I plot my predictive density. This density is a special case of a negative binomial density where (using the R notation)

size = a, prob = b /(b + 1).

I graph using the following R commands. One can compute P(5 <= y <= 15) = 0.89 which means that on 89% of the days, I believe Steven will send between 5 and 15 messages.

a = 44.4; b = 4.4

plot(0:30,dnbinom(0:30,size=a,prob=b/(b+1)),type=”h”,

xlab=”y”, ylab=”Predictive Prob”, col=”red”, lwd=2)

## Probit Modeling

I should have put more prior modeling in my Bayesian R book. One of the obvious advantages of the Bayesian approach is the ability to incorporate prior information. I’ll illustrate the use of informative priors in a simple setting — binary regression modeling with a probit link where one has prior information about the regression vector.

At my school, many students typically take a precalculus class that prepares them to take a business calculus class. We want our students to do well (that is, get a A or B) in the calculus class and wish to understand how the student’s performance in the precalculus class and his/her ACT math score are useful in predicting the student’s success.

Here is the probit model. If y=1 and y=0 represent respectively a student doing well and not well in the calculus class, then we model the probability that y=1 by

Prob(y=1) = Phi(beta0 + beta1 (PrecalculusGrade) + beta2 (ACT))

where Phi() is the standard normal cdf, PrecalculusGrade is 1 (0) if the student gets an A (B or C) in the precalculus class, and ACT is the math ACT score.

Suppose that the regression vector beta = (beta0, beta1, beta2) is assigned a multivariate normal prior with mean vector beta0 and precision matrix P. Then there is a simple Gibbs sampling algorithm for simulating from the posterior distribution of beta. The algorithm is based on the idea of augmenting the problem with latent continuous data from a normal distribution.

Although you might not understand the code, one can implement one iteration of Gibbs sampling by three lines of R code:

z=rtruncated(N,LO,HI,pnorm,qnorm,X%*%beta,1)

mn=solve(BI+t(X)%*%X,BIbeta0+t(X)%*%z)

beta = t(aa) %*% array(rnorm(p), c(p, 1)) + mn

Anyway, I want to focus on using this model with prior information.

1. First suppose I have a “prior dataset” of 50 students. I fit this probit model with a vague prior on beta. The inputs to the function bayes.probit.prior are (1) the vector of binary responses y, (2) the covariate matrix X, and (3) the number of iterations of the Gibbs sampoler.

fit1=bayes.probit.prior(prior.data[,1],prior.data[,-1],1000)

2. I compute the posterior mean and posterior variance-covariance matrix of the simulated draws of beta. I use these values for my multivariate normal prior on beta.

prior=list(beta=apply(fit1,2,mean),P=solve(var(fit1)))

(Note that I’m inputting the precision matrix P that is the inverse of the var-cov matrix.)

3. Using this informative prior, I fit the probit model with a new sample of 100 students.

fit2=bayes.probit.prior(DATA[,1],DATA[,-1],10000,prior=prior)

The only change in the input is that I input a list “prior” that includes the mean “beta” and the precision matrix P.

4. Now I summarize my fit to learn about the relationship of previous grade and ACT on the success of the calculus students. I define a grid of ACT scores and consider two sets of covariates corresponding to students who were not successful (0) and successful (1) in precalculs.

act=seq(15,29)

x0=cbind(1,0,act)

x1=cbind(1,1,act)

Then I use the function bprobit.probs to obtain posterior samples of the probability of success in calculus for each set of covariates.

fit.x0=bprobit.probs(x0,fit2)

fit.x1=bprobit.probs(x1,fit2)

I graph the posterior means of the fitted probabilities in the below graph. There are two lines — one corresponding to the students who aced the precalculus class, and another line corresponding to the students who did not ace precalculus.

Several things are clear from this graph. The performance in the precalc class matters — students who ace precalculus have a 30% higher chance of succeeding in the calculus class. On the other hand, the ACT score (that the student took during high school) has essentially no predictive ability. It is interesting that the slopes of the lines are negative, but these clearly isn’t significant.

## Conditional means prior

In an earlier post, we illustrated Bayesian fitting of a logistic model using a noninformative prior. Suppose instead that we have subjective beliefs about the regression vector. A convenient way of representing these beliefs is by use of a conditional means prior. We illustrate this for our math placement example.

First, we consider the probability p1 that a student in placement level 1 receives an A in the class. Our best guess at p1 is 0.05 and this belief is worth 200 observations — we match this info to a beta(200*0.05, 200*0.95) prior. Next, we consider the probability p5 that a student in level 5 gets an A — our guess at this probability is 0.15 and this guess is worth 200 observations. We match this belief to a beta(200*0.15, 200*0.85) prior. Assuming that our beliefs about p1 and p5 are independent, the joint prior on (p1, p5) is a product of beta densities. Transforming back to the (beta0, beta1) scale, one can show that the prior on beta is given by

g(beta) = p1^a1 (1-p1)^b1 p5^a2 (1-p5)^b2

(Note that the conditional means prior translates to the same functional form as the likelihood where the beta parameters a1, b1, a2, b2 play the role of “prior data”.)

The below figure displays (in red) a sample from the likelihood and (in blue) a sample from the prior. Here we see some conflict between the prior beliefs and the data information about the parameter.