## Comparing sampling models by Bayes factors

In the last posting, I illustrated fitting a t(4) sampling model to some baseball data where I suspected there was an outlier. To compare this model to other sampling models, we can use Bayes factors.

Here is a t sampling model with a convenient choice of prior.

1.is a random sample from

2. and are independent, with distributed and distributed

I define a R function tsampling.R that computes the logarithm of the posterior. Note that I am careful to include all of the normalizing constants, since I am primarily interested in computing the marginal density of .

tsampling=function(theta,datapar)

{

mu=theta[1]; logsigma=theta[2]; sigma=exp(logsigma)

y=datapar$y; df=datapar$df

mu.mean=datapar$mu.mean; mu.sd=datapar$mu.sd

lsigma.mean=datapar$lsigma.mean; lsigma.sd=datapar$lsigma.sd

loglike=sum(dt((y-mu)/sigma,df,log=TRUE)-log(sigma))

logprior=dnorm(mu,mu.mean,mu.sd,log=TRUE)+

dnorm(logsigma,lsigma.mean,lsigma.sd,log=TRUE)

return(loglike+logprior)

}

We use this function together with the function laplace to compute the log marginal density for two models.

Model 1 — t(4) sampling, is normal(90, 20), is normal(1, 1).

Model 2 — t(30) sampling, is normal(90, 20), is normal(1, 1).

Note that I’m using the same prior for both models — the only difference is the choice of degrees of freedom in the sampling density.

dataprior1=list(y=y, df=4, mu.mean=90, mu.sd=20,

lsigma.mean=1, lsigma.sd=1)

log.m1=laplace(tsampling,c(80, 3), dataprior1)$int

dataprior2=list(y=y, df=30, mu.mean=90, mu.sd=20,

lsigma.mean=1, lsigma.sd=1)

log.m2=laplace(tsampling,c(80, 3), dataprior2)$int

BF.12=exp(log.m1-log.m2)

BF.12

[1] 14.8463

We see that there is substantial support for the t(4) model over the “close to normal” t(30) model.

## Robust Modeling using Gibbs Sampling

To illustrate the use of Gibbs sampling for robust modeling, here are the batting statistics for the “regular” players on the San Francisco Giants for the 2007 baseball season:

Pos Player Ag G AB R H 2B 3B HR RBI BB SO BA OBP SLG SB CS GDP HBP SH SF IBB OPS+

---+-------------------+--+----+----+----+----+---+--+---+----+---+----+-----+-----+-----+---+---+---+---+---+---+---+----+

C Bengie Molina 32 134 497 38 137 19 1 19 81 15 53 .276 .298 .433 0 0 13 2 1 2 2 86

1B *Ryan Klesko 36 116 362 51 94 27 3 6 44 46 68 .260 .344 .401 5 1 14 1 1 1 2 92

2B #Ray Durham 35 138 464 56 101 21 2 11 71 53 75 .218 .295 .343 10 2 18 2 0 9 6 65

3B Pedro Feliz 32 150 557 61 141 28 2 20 72 29 70 .253 .290 .418 2 2 15 1 0 3 2 81

SS #Omar Vizquel 40 145 513 54 126 18 3 4 51 44 48 .246 .305 .316 14 6 14 1 14 3 6 62

LF *Barry Bonds 42 126 340 75 94 14 0 28 66 132 54 .276 .480 .565 5 0 13 3 0 2 43 170

CF *Dave Roberts 35 114 396 61 103 17 9 2 23 42 66 .260 .331 .364 31 5 4 0 4 0 1 80

RF #Randy Winn 33 155 593 73 178 42 1 14 65 44 85 .300 .353 .445 15 3 12 7 4 5 3 105

Rich Aurilia 35 99 329 40 83 19 2 5 33 22 45 .252 .304 .368 0 0 8 4 0 3 1 73

Kevin Frandsen 25 109 264 26 71 12 1 5 31 21 24 .269 .331 .379 4 3 17 5 3 3 3 84

*Fred Lewis 26 58 157 34 45 6 2 3 19 19 32 .287 .374 .408 5 1 4 3 1 0 0 103

#Dan Ortmeier 26 62 157 20 45 7 4 6 16 7 41 .287 .317 .497 2 1 2 1 0 2 1 107

Rajai Davis 26 51 142 26 40 9 1 1 7 14 25 .282 .363 .380 17 4 0 4 2 0 1 93

*Nate Schierholtz 23 39 112 9 34 5 3 0 10 2 19 .304 .316 .402 3 1 0 1 0 2 0 85

*Mark Sweeney 37 76 90 18 23 8 0 2 10 13 18 .256 .368 .411 2 0 0 3 1 0 0 102

We’ll focus on the last batting measure OPS that is a summary of a player’s batting effectiveness.

We read the OPS values for the 15 players into R into the vector y.

> y

[1] 86 92 65 81 62 170 80 105 73 84 103 107 93 85 102

We assume that the observations y1, …, y15 are iid from a t distribution with location , scale and 4 degrees of freedom. We place the usual noninformative prior on .

We implement 10,000 iterations of Gibbs sampling by use of the function robustt.R in the LearnBayes package. This function is easy to use — we just input the data vector y, the degrees of freedom, and the number of iterations.

fit=robustt(y,4,10000)

The object fit is a list with components mu, sigma2, and lam — mu is a vector of simulated draws of , sigma2 is a vector of simulated draws of , and lam is a matrix of simulated draws of the scale parameters . (The are helpful for identifying outliers in the data — here the outlier is pretty obvious.)

Below I have graphed the data as solid dots and placed a density estimate of the posterior of on top. We see that the estimate of appears to ignore the one outlier (Barry Bonds) in the data.

plot(y,0*y,cex=1.5,pch=19,ylim=c(0,.1),ylab=”DENSITY”,xlab=”MU”)

lines(density(fit$mu),lwd=3,col=”red”)

By the way, we have assumed that robust modeling was suitable for this dataset since I knew that the Giants had one unusually good hitter on their team. Can we formally show that robust modeling the t(4) distribution is better than normal modeling for these data?

Sure — we can define two models (with the choice of proper prior distributions) and compare the models by use of a Bayes factor. I’ll illustrate this in my next posting.

## Gibbs Sampling for Censored Regression

Gibbs sampling is very convenient for many “missing data” problems. To illustrate this situation, suppose we have the simple regression model

where the errors are iid . The problem is that some of the response variables are right-censored and we actually observe , where is a known censoring value. We know which observations are uncensored () and which observations are censored ().

We illustrate this situation by use of a picture inspired by a similar one in Martin Tanner’s book. We show a scatterplot of the covariate against the observed . The points highlighted in red correspond to censored observations — the actual observations (the ) exceed the censored values, which we indicate by arrows.

To apply Gibbs sampling to this situation, we imagine a complete data set where all of the ‘s are known. The unknowns in this problem are the regression coefficients , the error variance , and the ‘s corresponding to the censored observations.

The joint posterior of all unobservables (assuming the usual vague prior for regression) has the form

where is equal to 1 if the observation is not censored, and if the observation is censored at .

With the introduction of the complete data set, Gibbs sampling is straightforward.

Suppose one has initial estimates at the regression coefficients and the error variance. Then

(1) One simulates from the distribution of the missing data (the for the censored observations) given the parameters. Specifically is simulated from a normal() distribution censored below by .

(2) Given the complete data, one simulates values of the parameters using the usual approach for a normal linear regression model.

To do this on R, I have a couple of useful tools. The function rtruncated.R will simulate draws from an arbitrary truncated distribution

rtruncated=function(n,lo,hi,pf,qf,…)

qf(pf(lo,…)+runif(n)*(pf(hi,…)-pf(lo,…)),…)

For example, if one wishes to simulate 20 draws of a normal(mean = 10, sd = 2) distribution that is truncated below by 4, one writes

rtruncated(20, 4, Inf, pnorm, qnorm, 10, 2)

Also the function blinreg.R in the LearnBayes package will simulate draws of a regression coefficient and the error variance for a normal linear model with a noninformative prior.

## Latex in my blog?

I didn’t know if it was possible to add latex to my postings. I asked John Shonder who is currently working on solutions on my book and has some latex in his WordPress blog. John referred me to a page

http://wolverinex02.googlepages.com/emoticonsforblogger2

that describes a simple procedure for typing in latex in one’s postings.

Anyway, I tried it out on the “Gibbs sampling for hierarchical models” posting and it works!

## A Comparison of Two MCMC Methods

In the last posting, I illustrated a Gibbs sampling algorithm for simulating from a normal/normal hierarchical model. This method was based on successive substitution sampling from the conditional posterior densities of theta1, …, thetak, mu, and tau2. A second sampling method is outlined in Chapter 7 of BCUR. In this method that we’ll call “Exact MCMC” one integrates out the first-stage parameters theta1, …, thetak and uses a random-walk Metropolis algorithm to simulate from the marginal posterior of (mu, log tau).

We’ll use a few pictures to compare these two methods, specifically in learning about the variance hyperparameter tau. Here’s the exact MCMC method:

1. We write a function defining the log posterior of mu and log tau.

2. We simulate from (mu, log tau) using the function rwmetrop.R in the LearnBayes package. We choose a suitable proposal function from output from the laplace.R function. The acceptance rate of this random walk algorithm is 20%. We sample for 10,000 iterations, saving the draws of log tau in a vector.

We then run the Gibbs sampler for 10,000 iterations, also saving the draws of log tau.

DENSITY PLOTS. We first compare density estimates of the simulated samples of log tau using the two methods. The two estimates seem pretty similar. The density estimate for the Gibbs sampling draws looks smoother, but that doesn’t mean this sample is a better estimate of the posterior of log tau.

TRACE PLOTS. We compare trace plots of the two sets of simulated draws. They look pretty similar, but there are differences in closer inspection. The draws from the Gibbs sampling run look more irregular, or perhaps they have more of a “snake-like” appearance.

AUTOCORRELATION PLOTS. A autocorrelation plot is a graph of the autocorrelations

corr (theta(j), theta(j-g))

graphed against the lag g. For most MCMC runs, the values of the stream of simulated draws will show positive autocorrelation, but hopefully the autocorrelation values decrease quickly as the lag increases. It is pretty obvious that the lag autocorrelation values drop off slower for method 2 (Gibbs sampling); in contrast, the lag autocorrelations decrease to zero for method 1 (Exact MCMC) for values of the lag from 1 to 15.

The moral of the story is that the exact MCMC method shows better mixing than Gibbs sampling in this example. This means that, for a given sample size (here 10,000), one will have more accurate estimates at the posterior of tau using the exact MCMC method.

## Gibbs Sampling for Hierarchical Models

Gibbs sampling is an attractive “automatic” method of setting up a MCMC algorithm for many classes of models. Here we illustrate using R to write a Gibbs sampling algorithm for the normal/normal exchangeable model.

We write the model in three stages as follows.

1. The observations y1, …, yk are independent where yj is N(thetaj, sigmaj^2), where we write N(mu, sigma2) to denote the normal density with mean mu and variance sigma2. We assume the sampling variances sigma1^2, …, sigmak^2 are known.

2. The means theta1,…, thetak are a random sample from a N(mu, tau2) population. (tau2 is the variance of the population).

3. The hyperparameters (mu, tau) are assumed to have a uniform distribution. This implies that the parameters (mu, tau2) have a prior proportional to (tau2)^(-1/2).

To write a Gibbs sampling algorithm, we write down the joint posterior of all parameters (theta1, …, thetak, mu, tau2):

From this expression, one can see

1. The posterior of thetaj conditional on all remaining parameters is normal, where the mean and variance are given by the usual normal density/normal prior updating formula.

2. The hyperparameter mu has a normal posterior with mean theta_bar (the sample mean of the thetaj) and variance tau2/k.

3. The hyperparameter tau2 has an inverse gamma posterior with shape (k-1)/2 and rate 1/2 sum(thetaj – mu)^2.

Given that all the conditional posteriors have convenient functional forms, we write a R function to implement the Gibbs sampling. The only inputs are the data matrix (columns containing yj and sigmaj^2) and the number of iterations m.

I’ll display the function normnormexch.gibbs.R with notes.

normnormexch.gibbs=function(data,m)

{

y=data[,1]; k=length(y); sigma2=data[,2] # HERE I READ IN THE DATA

THETA=array(0,c(m,k)); MU=rep(0,m); TAU2=rep(0,m) # SET UP STORAGE

mu=mean(y); tau2=median(sigma2) # INITIAL ESTIMATES OF MU AND TAU2

for (j in 1:m) # HERE’S THE GIBBS SAMPLING

{

p.means=(y/sigma2+mu/tau2)/(1/sigma2+1/tau2) # CONDITIONAL POSTERIORS

p.sds=sqrt(1/(1/sigma2+1/tau2)) # OF THETA1,…THETAK

theta=rnorm(k,mean=p.means,sd=p.sds)

mu=rnorm(1,mean=mean(theta),sd=sqrt(tau2/k)) # CONDITIONAL POSTERIOR OF MU

tau2=rigamma(1,(k-1)/2,sum((theta-mu)^2)/2) # CONDITIONAL POSTERIOR OF TAU2

THETA[j,]=theta; MU[j]=mu; TAU2[j]=tau2 # STORE SIMULATED DRAWS

}

return(list(theta=THETA,mu=MU,tau2=TAU2)) # RETURN A LIST WITH SAMPLES

}

Here is an illustration of this algorithm for the SAT example from Gelman et al.

y=c(28,8,-3,7,-1,1,18,12)

sigma=c(15,10,16,11,9,11,10,18)

data=cbind(y,sigma^2)

fit=normnormexch.gibbs(data,1000)

In the Chapter 7 exercise that used this example, a different sampling algorithm was used to simulate from the joint posterior of (theta, mu, tau2) — it was a direct sampling algorithm based on the decomposition

[theta, mu, tau2] = [mu, tau2] [theta | mu, tau2]

In a later posting, I’ll compare the two sampling algorithms.

## Bayesian model selection

Here we illustrate one advantage of Bayesian regression modeling. By the use of an informative prior, it is straightforward to implement regression model selection.

Arnold Zellner introduced an attractive method of implementing prior information into a regression model. He assumes [beta | sigma^2] is normal with mean beta0 and variance-covariance matrix of the form

V = c sigma^2 (X’X)^(-1)

and then takes [sigma^2] distributed according to the noninformative prior proportional to 1/sigma^2. This is called Zellner’s G-prior.

One nice thing about this prior is that it requires only two prior inputs from the user: (1) a choice of the prior mean beta0, and (2) c that can be interpreted as the amount of information in the prior relative to the sample.

We can use Zellner’s G-prior to implement model selection in a regression analysis. Suppose we have p predictors of interest — there are 2^p possible models corresponding to the inclusion or exclusion of each predictor in the model.

A G-prior is placed on the full model that contains all parameters. We assume that beta0 is the zero vector and choose c to be a large value reflecting vague prior beliefs. The prior on (beta, sigma^2) for this full model is

N(beta; beta0, c sigma^2 (X’X)^(-1)) (1/sigma^2)

Then for any submodel defined by a reduced design matrix X1, we take the prior on (beta, sigma^2) to be

N(beta; beta0, c sigma^2 (X1’X1)^(-1)) (1/sigma^2)

Then we can compare models by the computation of associated predictive probabilities.

To illustrate this methodology, we consider an interesting dataset on the behavior of puffins from Devore and Peck’s text. One is interesting in understanding the nesting frequency behavior of these birds (the y variable) on the basis of four covariates: x1, the grass cover, x2, the mean soil depth, x3, the angle of slope, and x4, the distance from cliff edge.

We first write a function that computes the log posterior of (beta, log(sigma)) for a regression model with normal sampling and a Zellner G prior with beta0 = 0 and a given value of c.

regpost.mod=function(theta,stuff)

{

y=stuff$y; X=stuff$X; c=stuff$c

beta=theta[-length(theta)]; sigma=exp(theta[length(theta)])

if (length(beta)>1)

loglike=sum(dnorm(y,mean=X%*%as.vector(beta),sd=sigma,log=TRUE))

else

loglike=sum(dnorm(y,mean=X*beta,sd=sigma,log=TRUE))

logprior=dmnorm(beta,mean=0*beta,varcov=c*sigma^2*solve(t(X)%*%X),log=TRUE)

return(loglike+logprior)

}

We read in the puffin data and define the design matrix X for the full model.

puffin=read.table(“puffin.txt”,header=T)

X=cbind(1,puffin$x1,puffin$x2,puffin$x3,puffin$x4)

S, the input for the log posterior function, is a list with components y, X, and c.

S=list(y=puffin$y, X=X, c=100)

Since there are 4 covariates, there are 2^4 = 16 possible models. We define a logical matrix GAM of dimension 16 x 5 that describes the inclusion of exclusion of each covariate in the model. (The first column is TRUE since we want to include the constant term in each regression model.)

GAM=array(T,c(16,5))

TF=c(T,F); k=0

for (i1 in 1:2) {for (i2 in 1:2) {for (i3 in 1:2) {for (i4 in 1:2){

k=k+1; GAM[k,]=cbind(T,TF[i1],TF[i2],TF[i3],TF[i4])}}}}

For each model, we use the laplace function (in the LearnBayes package) to compute the marginal likelihood. The inputs to laplace are the function regpost.mod defining our model, an intelligent guess at the model (given by a least-squares fit), and the list S that contains y, X, and c.

gof=rep(0,16)

for (j in 1:16)

{

S$X=X[,GAM[j,]]

theta=c(lm(S$y~0+S$X)$coef,0)

gof[j]=laplace(regpost.mod,theta,S)$int

}

We display each model and the associated marginal likelihood values (on the log scale).

data.frame(GAM,gof)

X1 X2 X3 X4 X5 gof

1 TRUE TRUE TRUE TRUE TRUE -104.1850

2 TRUE TRUE TRUE TRUE FALSE -115.4042

3 TRUE TRUE TRUE FALSE TRUE -102.3523

4 TRUE TRUE TRUE FALSE FALSE -136.3972

5 TRUE TRUE FALSE TRUE TRUE -105.0931

6 TRUE TRUE FALSE TRUE FALSE -113.1782

7 TRUE TRUE FALSE FALSE TRUE -105.5690

8 TRUE TRUE FALSE FALSE FALSE -134.0486

9 TRUE FALSE TRUE TRUE TRUE -101.8833

10 TRUE FALSE TRUE TRUE FALSE -114.9573

11 TRUE FALSE TRUE FALSE TRUE -100.3735

12 TRUE FALSE TRUE FALSE FALSE -134.5129

13 TRUE FALSE FALSE TRUE TRUE -102.8117

14 TRUE FALSE FALSE TRUE FALSE -112.6721

15 TRUE FALSE FALSE FALSE TRUE -103.2963

16 TRUE FALSE FALSE FALSE FALSE -132.1824

I highlight the model (inclusion of covariates X3 and X5) that has the largest value of the log marginal likelihood. This tells us that the best model for understanding nesting behavior is the one that includes mean soil depth (X3) and the distance from cliff edge (X5) . One can also compare different models by the use of Bayes factors.

****************************************************************************

For students who have to do their own model selection, I’ve written a simple function

bayes.model.selection.R

that will give these log marginal likelihood values for all regression models. You have to download this function from bayes.bgsu.edu/m648 and have LearnBayes 1.11 installed on your machine.

Here’s an example of using this function. I first load in the puffin dataset and define the response vector y, the covariate matrix X, and the prior parameter c.

puffin=read.table(“puffin.txt”,header=T)

X=cbind(1,puffin$x1,puffin$x2,puffin$x3,puffin$x4)

y=puffin$y

c=100

Then I just run the function bayes.model.selection with y, X, and c as inputs.

bayes.model.selection(y,X,c)

The output will be the data frame shown above.