## Bayesian model selection

Here we illustrate one advantage of Bayesian regression modeling. By the use of an informative prior, it is straightforward to implement regression model selection.

Arnold Zellner introduced an attractive method of implementing prior information into a regression model. He assumes [beta | sigma^2] is normal with mean beta0 and variance-covariance matrix of the form

V = c sigma^2 (X’X)^(-1)

and then takes [sigma^2] distributed according to the noninformative prior proportional to 1/sigma^2. This is called Zellner’s G-prior.

One nice thing about this prior is that it requires only two prior inputs from the user: (1) a choice of the prior mean beta0, and (2) c that can be interpreted as the amount of information in the prior relative to the sample.

We can use Zellner’s G-prior to implement model selection in a regression analysis. Suppose we have p predictors of interest — there are 2^p possible models corresponding to the inclusion or exclusion of each predictor in the model.

A G-prior is placed on the full model that contains all parameters. We assume that beta0 is the zero vector and choose c to be a large value reflecting vague prior beliefs. The prior on (beta, sigma^2) for this full model is

N(beta; beta0, c sigma^2 (X’X)^(-1)) (1/sigma^2)

Then for any submodel defined by a reduced design matrix X1, we take the prior on (beta, sigma^2) to be

N(beta; beta0, c sigma^2 (X1’X1)^(-1)) (1/sigma^2)

Then we can compare models by the computation of associated predictive probabilities.

To illustrate this methodology, we consider an interesting dataset on the behavior of puffins from Devore and Peck’s text. One is interesting in understanding the nesting frequency behavior of these birds (the y variable) on the basis of four covariates: x1, the grass cover, x2, the mean soil depth, x3, the angle of slope, and x4, the distance from cliff edge.

We first write a function that computes the log posterior of (beta, log(sigma)) for a regression model with normal sampling and a Zellner G prior with beta0 = 0 and a given value of c.

regpost.mod=function(theta,stuff)

{

y=stuff$y; X=stuff$X; c=stuff$c

beta=theta[-length(theta)]; sigma=exp(theta[length(theta)])

if (length(beta)>1)

loglike=sum(dnorm(y,mean=X%*%as.vector(beta),sd=sigma,log=TRUE))

else

loglike=sum(dnorm(y,mean=X*beta,sd=sigma,log=TRUE))

logprior=dmnorm(beta,mean=0*beta,varcov=c*sigma^2*solve(t(X)%*%X),log=TRUE)

return(loglike+logprior)

}

We read in the puffin data and define the design matrix X for the full model.

puffin=read.table(“puffin.txt”,header=T)

X=cbind(1,puffin$x1,puffin$x2,puffin$x3,puffin$x4)

S, the input for the log posterior function, is a list with components y, X, and c.

S=list(y=puffin$y, X=X, c=100)

Since there are 4 covariates, there are 2^4 = 16 possible models. We define a logical matrix GAM of dimension 16 x 5 that describes the inclusion of exclusion of each covariate in the model. (The first column is TRUE since we want to include the constant term in each regression model.)

GAM=array(T,c(16,5))

TF=c(T,F); k=0

for (i1 in 1:2) {for (i2 in 1:2) {for (i3 in 1:2) {for (i4 in 1:2){

k=k+1; GAM[k,]=cbind(T,TF[i1],TF[i2],TF[i3],TF[i4])}}}}

For each model, we use the laplace function (in the LearnBayes package) to compute the marginal likelihood. The inputs to laplace are the function regpost.mod defining our model, an intelligent guess at the model (given by a least-squares fit), and the list S that contains y, X, and c.

gof=rep(0,16)

for (j in 1:16)

{

S$X=X[,GAM[j,]]

theta=c(lm(S$y~0+S$X)$coef,0)

gof[j]=laplace(regpost.mod,theta,S)$int

}

We display each model and the associated marginal likelihood values (on the log scale).

data.frame(GAM,gof)

X1 X2 X3 X4 X5 gof

1 TRUE TRUE TRUE TRUE TRUE -104.1850

2 TRUE TRUE TRUE TRUE FALSE -115.4042

3 TRUE TRUE TRUE FALSE TRUE -102.3523

4 TRUE TRUE TRUE FALSE FALSE -136.3972

5 TRUE TRUE FALSE TRUE TRUE -105.0931

6 TRUE TRUE FALSE TRUE FALSE -113.1782

7 TRUE TRUE FALSE FALSE TRUE -105.5690

8 TRUE TRUE FALSE FALSE FALSE -134.0486

9 TRUE FALSE TRUE TRUE TRUE -101.8833

10 TRUE FALSE TRUE TRUE FALSE -114.9573

11 TRUE FALSE TRUE FALSE TRUE -100.3735

12 TRUE FALSE TRUE FALSE FALSE -134.5129

13 TRUE FALSE FALSE TRUE TRUE -102.8117

14 TRUE FALSE FALSE TRUE FALSE -112.6721

15 TRUE FALSE FALSE FALSE TRUE -103.2963

16 TRUE FALSE FALSE FALSE FALSE -132.1824

I highlight the model (inclusion of covariates X3 and X5) that has the largest value of the log marginal likelihood. This tells us that the best model for understanding nesting behavior is the one that includes mean soil depth (X3) and the distance from cliff edge (X5) . One can also compare different models by the use of Bayes factors.

****************************************************************************

For students who have to do their own model selection, I’ve written a simple function

bayes.model.selection.R

that will give these log marginal likelihood values for all regression models. You have to download this function from bayes.bgsu.edu/m648 and have LearnBayes 1.11 installed on your machine.

Here’s an example of using this function. I first load in the puffin dataset and define the response vector y, the covariate matrix X, and the prior parameter c.

puffin=read.table(“puffin.txt”,header=T)

X=cbind(1,puffin$x1,puffin$x2,puffin$x3,puffin$x4)

y=puffin$y

c=100

Then I just run the function bayes.model.selection with y, X, and c as inputs.

bayes.model.selection(y,X,c)

The output will be the data frame shown above.

## Looking for True Streakiness

There is a lot of interest in streaky behavior in sports. One observes players or teams that appear streaky with the implicit conclusion that this says something about the character of the athlete.

Eric Byrnes had 412 opportunities to hit during the 2005 baseball season. Here is his sequence of hits (successes) and outs (failures) during the season.

[1] 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0

[38] 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0

[75] 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0

[112] 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0

[149] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0

[186] 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0

[223] 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0

[260] 0 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0

[297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0

[334] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

[371] 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 1 0 0

[408] 0 0 0 0 0

One way of seeing the streaky behavior in this sequence is by a moving average graph where one plots the success rate (batting average) for windows of 40 at-bats. I wrote a short program mavg.R to compute the moving averages. The following R code plots the moving averages and plots a lowess smooth on top to help see the pattern.

MAVG=mavg(byrne$x,k=40)

plot(MAVG,type=”l”,lwd=2,col=”red”,xlab=”GAME”,ylab=”AVG”,

main=”ERIC BYRNES”)

We some interesting patterns.. It seemed that Byrnes had a cold spell in the first part of the season, followed by a hot period, and then a very cold period.

The interesting question is: is this streaky pattern “real” or is it just a byproduct of bernoulli chance variation?

We answer this question by means of a Bayes factor. Suppose we partition Byrnes’ 412 at-bats into groups of 20 at-bats. We observe counts y1, …, yn, where yi is the number of hits in the ith group. Suppose yi is binomial(20, pi) where pi is the probability of a hit in the ith period.

We define two hypotheses:

H (not streaky) the probabilities across periods are equal, p1 = … = pn = p

A (streaky) the probabilities across periods vary according to a beta distribution with mean eta and precision K. This model is indexed by the parameter K.

The functions bfexch and laplace in the LearnBayes package can be used to compute a Bayes factor in support of A over H. Here is how we do it.

1. The raw data is in the matrix BRYNE — the first column contains the data (0’s and 1’s) and the second column contains the attempts (column of 1’s). We regroup the data into periods of 20 at-bats using the regroup function.

regroup(BRYNE, 20)

2. The following R function laplace.int will compute the log, base10 of the Bayes factor in support of streakiness for a fixed value of log(K).

laplace.int=function(logK,data=data1)

log10(exp(laplace(bfexch,0,list(data=data,K=exp(logK)))$int))

To illustrate, suppose we want to compute the log10 Bayes factor for our data for logK = 3:

> laplace.int(3,regroup(BRYNE,20))

[,1]

[1,] 1.386111

This indicates support for streakiness — the log Bayes factor is 1.38 which means that A is over 10 times more likely than H.

3. Generally we’d like to compute the log10 Bayes factor for a sequence of values of log K. I first write a simple function that does this:

s.laplace.int=function(logK,data)

list(x=logK,y=sapply(logK,laplace.int,data))

and then I use this function to compute the Bayes factor for values of log K from 2 to 6 in steps of 0.2. I use the plot command to graph these values. I draw a line at the value log10 BF = 0 — this corresponds to the case where neither model is supported.

plot(s.laplace.int(seq(2,6,by=.2),regroup(BRYNE,20)),type=”l”,

xlab=”LOG K”, ylab=”LOG 10 BAYES FACTOR”, lwd=3, col=”red”, ylim=c(-3,2))

lines(c(1,7),c(0,0),lwd=3,col=”blue”)

title(main=”ERIC BYRNES”)

What we see that, for a range of values of K, the Bayes factor favors the model A by a factor of 10 or more.

Actually we only looked at Eric Byrnes since he exhibited unusually streaky behavior during this 2005 season. What if we look at other players? Here are the Bayes factors graphs for the hitting data for two other players Chase Utley and Damian Miller (we are grouping the data in the same way).

Here for both players, note that the log10 Bayes factors are entirely negative for the range of K values. For both players, there is support for the non-streaky model H. One distinctive features of Bayes factors is that they can provide support for the null or the alternative hypothesis.

## Test of Independence in a 2 x 2 Table

Consider data collected in a study described in Dorn (1954) to assess the relationship between smoking and lung cancer. In this study, a sample of 86 lung-cancer patients and a sample of 86 controls were questioned about their smoking habits. The two groups were chosen to represent random samples from a sub-population of lung-cancer patients and an otherwise similar population of cancer-free individuals.

Here is the 2 x 2 table of responses:

Cancer Control

Smokers 83 72

Non-smokers 3 14

Let pL and pC denote the proportions of lung-cancer patients and controls who smoke. We wish to test H: pL = pC against the alternative A: pL pC.

To construct a Bayesian test, we define a suitable model for H and for A, and then compute the Bayes factor in support of the alternative A.

1. To describe these models, first we transform the proportions to the logits

LogitL = log(pL/(1-pL)), Logit C = log(pC/(1-pC))

2. We then define two parameters theta1, theta2, that are equal to the difference and sum of the logits.

theta1 = LogitL – LogitC, theta2 = LogitL + LogitC.

theta1 is the log odds ratio, a popular measure of association in a 2 x 2 table. Under the hypothesis of independence H, theta1 = 0.

3. Consider the following prior on theta1 and theta2. We assume they are independent where

theta1 is N(0, tau1), theta2 is N(0, tau2).

4. Under H (independence), we assume theta1 = 0, so we set tau1 = 0. theta2 is a nuisance parameter that we arbitrarily be N(0, 1). (The Bayes factor will be insensitive to this choice.)

5. Under A (not independence), we assume theta1 is N(0, tau1), where tau1 reflects our beliefs about the location of theta1 when the proportions are different. We also assume again that theta2 is N(0, 1). (This means that our beliefs about theta2 are insensitive to our beliefs about theta1.)

To compute the marginal densities, we write a function that computes the logarithm of the posterior when (theta1, theta2) have the above prior.

logctable.test=function (theta, datapar)

{

theta1 = theta[1] # log odds ratio

theta2 = theta[2] # log odds product

s1 = datapar$data[1,1]

f1 = datapar$data[1,2]

s2 = datapar$data[2,1]

f2 = datapar$data[2,2]

logitp1 = (theta1 + theta2)/2

logitp2 = (theta2 – theta1)/2

loglike = s1 * logitp1 – (s1 + f1) * log(1 + exp(logitp1))+

s2 * logitp2 – (s2 + f2) * log(1 + exp(logitp2))

logprior = dnorm(theta1,mean=0,sd=datapar$tau1,log=TRUE)+

dnorm(theta2,mean=0,sd=datapar$tau2,log=TRUE)

return(loglike+logprior)

}

We enter the data as a 2 x 2 matrix:

data=matrix(c(83,3,72,14),c(2,2))

data

[,1] [,2]

[1,] 83 72

[2,] 3 14

The argument datapar in the function is a list consisting of data, the 2 x 2 data table, and the values of tau1 and tau2.

Suppose we assume theta1 is N(0, .8) under the alternative hypothesis. This prior is shown in the below figure.

By using the laplace function, we compute the log marginal density under both models. (For H, we are approximating the point mass of theta1 on 1 by a normal density with a tiny standard deviation tau1.)

l.marg0=laplace(logctable.test,c(0,0),list(data=data,tau1=.0001,tau2=1))$int

l.marg1=laplace(logctable.test,c(0,0),list(data=data,tau1=0.8,tau2=1))$int

We compute the Bayes factor in support of the hypothesis A.

BF.10=exp(l.marg1-l.marg0)

BF.10

[1] 7.001088

The conclusion is that the alternative hypothesis A is seven times more plausible than the null hypothesis H.

## Simple Illustration of Bayes Factors

Suppose we collect the number of accidents in a year for 30 Belgium drivers. We assume that y1,…, y30 are independent Poisson(lambda), where lambda is the average number of accidents for all Belgium drivers.

Consider the following four priors for lambda:

PRIOR 1: lambda is gamma with shape 3.8 and rate 8.1. This prior reflects the belief that the quartiles of lambda are 0.29 and 0.60.

PRIOR 2: lambda is gamma with shape 3.8 and rate 4. The mean of this prior is 0.95 so this prior reflects one’s belief that lambda is close to 1.

PRIOR 3: lambda is gamma with shape 0.38 and 0.81. This prior has the same mean as PRIOR 1, but it is much more diffuse, reflecting weaker information about lambda.

PRIOR 4: log lambda is normal with mean -0.87 and standard deviation 0.60. On the surface, this looks different from the previous priors, but this prior also matches the belief that the quartiles of lambda are 0.29 and 0.60.

Suppose we observe some data — for the 30 drivers, 22 had no accidents, 7 had exactly one accident, and 1 had two accidents. The likelihood is given by

LIKE = exp(-30 lambda) lambda^9

In the below graphs, we display the likelihood in blue and show the four priors in red. Here’s the R code to produce one of the graphs. We simulate draws from the likelihood and the prior and display density estimates.

like=rgamma(10000,shape=10,rate=30)

p1=rgamma(10000,shape=3.8,rate=8.1)

plot(density(p1),xlim=c(0,3),ylim=c(0,4),

main=”PRIOR 1″,xlab=”LAMBDA”,lwd=3,col=”red”,col.main=”red”)

lines(density(like),lwd=3,col=”blue”)

text(1.2,3,”LIKELIHOOD”,col=”blue”)

Note that Priors 1 and 4 are pretty consistent with the likelihood. There is some conflict between Prior 2 and the likelihood and Prior 3 is pretty flat relative to the likelihood.

We can compare the four models by use of Bayes factors. We first compute a function that computes the log posterior for each prior. There already is a function logpoissgamma in the LearnBayes package that computes the posterior of log lambda with Poisson sampling and a gamma prior. (This can be used for priors 1, 2, and 3.) The function logpoissnormal can be used for Poisson sampling and a normal prior (prior 4). Then we use the function laplace to approximate the value of the log predictive density.

For example, here’s the code to compute the log marginal density for prior 1.

datapar=list(data=d,par=c(3.8,8.1))

laplace(logpoissgamma,.5,datapar)$int

0.4952788

So log m(y) for prior 1 is about 0.5.

We do this for each prior and get the following values:

model log m(y)

—————–

PRIOR 1 0.495

PRIOR 2 -0.729

PRIOR 3 – 0.454

PRIOR 4 0.558

We can use this output to compute Bayes factors. For example, the Bayes factor in support of PRIOR 1 over PRIOR 2 is

BF_12 = exp(0.495 – (-0.729)) = 3.4

This means that the model with PRIOR 1 is about three and a half times as likely as the model with PRIOR 2. This is not surprising, seeing the conflict between the likelihood and the Bayes factor in the graph.

## Conflict between Bayesian and Frequentist Measures of Evidence

Here’s a simple illustration of the conflict between a p-value and a Bayesian measure of evidence.

Suppose you take a sample y1,…, yn from a normal population with mean mu and known standard deviation sigma. You wish to test

H: mu = mu0 A: mu not equal to mu0

The usual test is based on the statistic Z = sqrt(n)*(ybar – mu0)/sigma. One computes the p-value

p-value = 2 x P(Z >= z0)

and rejects H if the p-value is small. Suppose mu0 = 0, sigma = 1, and one takes a sample of size n = 4 and observe ybar = 0.98. Then one computes

Z = sqrt(4)*0.98 = 1.96

and the p-value is

p-value = 2 * P(Z > = 1.96) = 0.05.

Consider the following Bayes test of H against A. A Bayesian model is a specification of the sampling density and the prior density. One model M0 says that the mean mu = mu0. To complete the second model M1, we place a normal prior with mean mu0 and standard deviation tau on mu. The Bayes factor in support of the M0 over the model M1 is given by the ratio of predictive densities

BF = m(y | M0)/m(y|M1)

and the posterior probability of M0 is given by

P(M0| y) = p0 BF/(p0 BF + p1),

where p0 is the prior probability of M0.

The function mnormt.twosided in the LearnBayes package does this calculation. To use this function, we specify (1) the value m0 to be tested, (2) the prior probability of H, (3) the value of tau (the spread of the prior under A), and (4) the data vector that is (ybar, n, sigma).

Here we specify the inputs:

mu0=0; prob=.5; tau=0.5

ybar = 0.98; n = 4; sigma=1

data=c(ybar,n,sigma)

Then we can use mnormt.twosided — the outputs are the Bayes factor and the posterior probability of H:

mnormt.twosided(mu0,prob,tau,data)

$bf

[1] 0.5412758

$post

[1] 0.3511868

We see that the posterior probability of H0 is 0.35 which is substantially higher than the p-value of 0.05.

In this calculation, we assumed that tau = 0.5 — this reflect our belief about the spread of mu about mu0 under the alternative hypothesis. What if we chose a different value for tau?

We investigate the sensitivity of this posterior probability calculation with respect to tau.

We write a function that computes the posterior probability for a given value of tau.

post.prob=function(tau)

{

data=c(.98,4,1); mu0=0; prob0=.5

mnormt.twosided(mu0,prob,tau,data)$post

}

Then we use the curve function to plot this function for values of tau between 0.01 to 4.

curve(post.prob,from=.01,to=4,xlab=”TAU”,ylab=”PROB(H0)”,lwd=3,col=”red”)

In the figure below, it looks like the probability of H exceeds 0.32 for all tau.

## Illustration of posterior predictive checking

To illustrate posterior predictive checking, suppose we observe data y1,…,yn that we assume is N(mu, sigma). We place a noninformative prior on (mu, sigma) and learn about the parameters based on the posterior distribution. We use the posterior predictive distribution to check out model — to see if our sample is consistent with samples predicted from our fitted model.

In practice, we construct a checking function d that is a function of the future sample y*. In this example (66 measurements of the speed of light), we suspect that the smallest observation is an outlier. So we use a checking function d(y) = y_min.

Here’s an R procedure for checking our model using this diagnostic.

We load in the speed of light data:

> y=scan(“speedoflight.txt”)

Read 66 items

> y

[1] 28 22 36 26 28 28 26 24 32 30 27 24 33 21 36 32 31 25 24

[20] 25 28 36 27 32 34 30 25 26 26 25 -44 23 21 30 33 29 27 29

[39] 28 22 26 27 16 31 29 36 32 28 40 19 37 23 32 29 -2 24 25

[58] 27 24 16 29 20 28 27 39 23

We simulate 1000 draws from the normal/scale x inv-chi-square posterior using the function normpostsim.

parameters=normpostsim(y,1000)

The output is a list — here parameters$mu contains draws from mu, and parameters$sigma2 contains draws of sigma2.

The simulation of samples from the posterior predictive distribution is done by the function normpostpred. By use of comments, we explain how this function works.

normpostpred=function(parameters,sample.size,f=min)

{

# the function normalsample simulates a single sample given parameter values mu and sigma2.

# the index j is the index of the simulation number

normalsample=function(j,parameters,sample.size)

rnorm(sample.size,mean=parameters$mu[j],sd=sqrt(parameters$sigma2[j]))

# we use the sapply command to simulate many samples, where the number of samples

# corresponds to the number of parameter simulations

m=length(parameters$mu)

post.pred.samples=sapply(1:m,normalsample,parameters,sample.size)

# on each posterior predictive sample we compute a stat, the default stat is min

stat=apply(post.pred.samples,2,f)

# the function returns all of the posterior predictive samples and a vector of values of the

# checking diagnostic.

return(list(samples=post.pred.samples,stat=stat))

}

Here the size of the original sample is 66. To generate 1000 samples of size 66 from the posterior predictive distribution, and storing the mins of the samples, we type

post.pred=normpostpred(parameters,66)

The values of the minimum observation in all samples is stored in the vector post.pred$stat. We display the mins in a histogram and draw a vertical line showing the location of the min of the data.

> hist(post.pred$stat,xlim=c(-50,15))

> lines(min(y)*c(1,1),c(0,350),lwd=3,col=”red”)

Clearly the smallest observation is inconsistent with the mins generated from samples from the posterior predictive distribution. This cast doubts on the normality assumption of the data.

## Test of hypothesis that coin is fair

In Section 3.5, I describe a Bayesian test of the hypothesis H that a proportion is equal to 0.5. The R function pbetat.R (in the LearnBayes package) computes the posterior probability of H assuming a beta prior on the alternative hypothesis. We illustrate how the posterior probability depends on the beta parameter a.

Here’s a quick way of constructing Figure 3.5 (page 52) using the curve function.

First we revise the function pbetat to accept a matrix argument for the beta parameters where each row of the matrix corresponds to a pair (a, b). The new function is called pbetat.v. Next, we write a short function best that computes the posterior probability of H for our example (5 successes and 15 failures) for a vector of values of log a. (We are assuming a symmetric beta(a, a) prior on the alternative hypothesis.)

best=function(loga)

{p0=.5; prob=.5

data=c(5,15)

AB=exp(cbind(loga,loga))

s=pbetat.v(p0,prob,AB,data)

return(s$post)}

To generate the figure, we can use the curve function.

curve(best,from=-4,to=5,xlab=”log a”,ylab=”Prob(coin is fair)”,lwd=3)

We see that for all values of the beta parameter a, the posterior probability of the hypothesis exceeds 0.2.