To illustrate Bayesian regression, I’ll use an example from Simon Sheather’s new book on linear regression.  A restaurant guide collects a number of variables from a group of Italian restaurants in New York City.  One is interested in predicting the Price of a meal based on Food (measure of quality), Decor (measure of decor of the restaurant), Service (measure of quality of service), and East (dummy variable indicating if the restaurant is east or west of 5th Avenue).

Here is the dataset in csv format:  http://www.stat.tamu.edu/~sheather/book/docs/datasets/nyc.csv

Assuming the usual regression model

y_i = x_i' \beta + \epsilon_i, \epsilon_i \sim N(0, \sigma^2),

suppose we assign the following prior to (\beta, \sigma^2):  we assume independence with \beta \sim N(\beta_0, B_0) and \sigma^2 \sim IG(c_0/2, d_0/2).  Here B_0 is the precision matrix, the inverse of the variance-covariance matrix.

How do we obtain values of the parameters \beta_0, B_0, c_0, d_0?  We’ll use a small part of the data to construct our prior, and then use the remaining part of the data to do inference and eventually do some model selection.

First, I read in the data, select a random sample of rows, and partition the data into the two datasets italian1 and italian2.

italian=read.csv(“http://www.stat.tamu.edu/~sheather/book/docs/datasets/nyc.csv”)
indices=sample(168,size=25)
italian1=italian[indices,]
italian2=italian[-indices,]

I use the function blinreg in the LearnBayes package to simulate from the posterior using a vague prior.
library(LearnBayes)
y=italian1$Price
X=with(italian1,cbind(1,Food,Decor,Service,East))
fit=blinreg(y,X,5000)
From the simulated draws, I can now get estimates of my prior.  The parameters for the normal prior on \beta are found by finding the sample mean and sample var-cov matrix of the simulated draws.  The precision matrix B_0 is the inverse of the var-cov matrix.  The parameters c_0, d_0 correspond to the sum of squared errors and degrees of freedom of my initial fit.
beta0=apply(fit$beta,2,mean)
Sigma0=cov(fit$beta)
B0=solve(Sigma0)
c0=(25-5)
d0=sum((y-X%*%beta0)^2)
The function MCMCregress in the MCMCpack package will simulate from the posterior of (\beta, \sigma^2) using this informative prior.  It seems pretty to use.  I show the function including all of the arguments.  Note that  I’m using the second part of the data italian2, I input my prior through the b0, B0, c0, d0 arguments, and I’m going to simulate 10,000 draws (through the mcmc=10000 input).
fit2=MCMCregress(Price~Food+Decor+Service+East,
data = italian2, burnin = 0, mcmc = 10000,
thin = 1, verbose = 0, seed = NA, beta.start = NA,
b0 = beta0, B0 = B0, c0 = c0, d0 = d0,
marginal.likelihood = c(“Chib95″))
The output of MCMCregress, fit2, is a MCMC object that one can easily summarize and display using the summary and plot methods in the coda package.
Suppose we are interested in estimating the mean price at a restaurant on the east side (east = 1) with Food = 30, Decor = 30 and Service = 30.  We define a row matrix that consists of these covariates.  Then we compute the linear predictor x' \beta for each of the simulated beta vectors.  The vector lp contains 10,000 draws from the posterior of x' \beta.
x=matrix(c(1,20,20,20,1),5,1)
beta=fit2[,1:5]
sigma=sqrt(fit2[,6])
lp=c(beta%*%x)
Suppose next that we’re interested in predicting the price of a meal at the same covariates.  Since we have a sample from the posterior of the linear predictor, we just need to do an additional step to draw from the posterior predictive distribution — we store these values in ys.
ys=rnorm(10000,lp,sigma)
We illustrate the difference between estimating the mean price and predicting the price by summarizing the two sets of simulated draws.  The point estimates are similar, but, as expected, the predictive interval estimate is much wider.
summary(lp)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
43.59   46.44   47.00   46.99   47.53   50.17
summary(ys)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
25.10   43.12   46.90   46.93   50.75   67.82

A few weeks ago, I asked my class for an example of data that would be normally distributed and someone mentioned weights of people.

Are weights, more specifically birthweights of newborns normally distributed?

Peter Dunn in a JSE article here analyzes statistics for 44 babies born in a single day in an Australian hospital.  The article has a link to the dataset babyboom.dat that contains the data.

If we construct a histogram of the birthweights, we get the following display.

babyWe see some left skewness in this distribution — does this indicate that our normality assumption is false?

One can check the normality assumption by means of the posterior predictive distribution.  Here’s an outline of the approach.  (There are special functions in the LearnBayes package that make this process simple to implement.)

1.  First we fit the normal(\mu, \sigma^2)) model using the traditional noninformative prior proportional to 1/\sigma^2.  A simulated sample from the posterior is found using the function normpostsim.

fit=normpostsim(birth.weight,10000)

2.  Next we think of an appropriate testing function.  Since the birth weights don’t appear symmetric, a natural checking function is the sample skewness.  We write a short function “skew” to compute this statistic.

skew=function(y)
sum((y-mean(y))^3)/sd(y)^3/(length(y)-1)
3.  Last, we simulate draws from the posterior predictive distribution of the sample skewness.  This is implemented using the function normpostpred — the arguments are the list of simulated draws of the posterior (obtained by normpostsim), the size of the future sample (we’ll use the same sample size as that for the observed sample), and the function defining the testing function.  The vector of simulated draws from the testing function are stored in SKEW.
SKEW=normpostpred(fit,length(birth.weight),skew)
4.  I plot a histogram of the predictive distribution of the skew statistic and overlay a vertical line showing the skew of the observed data.
hist(SKEW); abline(v=skew(birth.weight),col=”red”,lwd=3)
skew
Clearly the skew in the data is not representative of the skew values predicted from the fitted model.   The normal population assumption appears inappropriate.

Here is another illustration of Bayes factors.   We are worried about the possibility of outliers and so we assume the observations y_1, ..., y_n are distributed from a t distribution with location \mu, scale \sigma, and degrees of freedom df.  We have some prior beliefs about the location and scale:  we assume \mu, \sigma independent with \mu \sim N(100, 10) and \log \sigma \sim N(0, 2).

We observe the following data:

y=c(114.1,113.1,103.5,89.4,94.7,110.6,103.2,123.0,112.3,110.976.8,
112.6,86.0,99.1,106.3,101.5,80.5,99.6,108.6,95.9)
We define a model M_{df} that defines the degrees of freedom parameter to be a specific value.
We can use Bayes factors to decide on appropriate values of df.
I define a function tlikepost defining the log posterior of \theta = (\mu, \log \sigma).  Here the argument stuff is a list with two components, the data vector y and the degrees of freedom value df.  Here the sampling density of y_i is given by
f(y_i) = \frac{1}{\sigma} dt\left(\frac{y_i-\mu}{\sigma}, df\right)
tlikepost=function(theta,stuff)
{
mu=theta[1]
sigma=exp(theta[2])
y=stuff$y
df=stuff$df
loglike=sum(dt((y-mu)/sigma, df, log=TRUE)-log(sigma))
logprior=dnorm(mu,100,10,log=TRUE)+dnorm(log(sigma),0,2, log=TRUE)
loglike+logprior
}
By using the function laplace with the “int” component, we can approximate the log of the marginal density of y.   To illustrate, suppose we wish to compute \log f(y) for a model that assumes that we have 30 degrees of freedom.
> laplace(tlikepost,c(0,0),list(y=y.out,df=30))$int
[1] -89.23436
So for the model M_{30}, we have log f(y) = -89.23.  To compare this model with, say the model M_4, we repeat this using df=4, compare the log marginal density, and then we can compare the two models by a Bayes factor.
Here are some things to try.
1.  With this data set, try comparing different values for the degrees of freedom parameter.  What you should find is that large df values are supported — this indicates that a normal sampling model is appropriate.
2.  Now change the dataset by introducing an outlier and repeat the analysis, comparing different M_{df} models.  What you should find that a t model with small degrees of freedom is supported with this new dataset.

One problem with a p-value is that it is hard to interpret if you don’t know the sample size.  Andy Gelman talks about this problem on his blog.

Here I’ll contrast the message communicated by a p-value and a Bayesian test by looking at the following three tables:

Table A:  [32 18; 18 32]
Table B: [272 228; 228,272]
Table C: [2570 2430; 2430,2570]

Each of the tables has a Pearson chi-square value in the (7.4, 7.8) range and so each has (approximately) the same p-value.  But, following Gelman’s reasoning, the small p-value for the large sample size in Table C doesn’t have much significance and and the same p-value for Table A (small sample size) is more impressive.
Here is a reasonable Bayesian test of the equality of two proportions.  First, we define a class of priors indexed by the parameter K:
1.  p_1, p_2 are iid beta(K\eta, K(1-\eta))
2.  \eta has a uniform density
Testing H: p_1 = p_2, A: p_1 \neq p_2 is equivalent to testing H: K = \infty against K = 1 (I’ll talk more about this in class.)
Chapter 8 in BCWR gives an expression for the Bayes factor in support of A over H.   In the LearnBayes package, suppose the data is a matrix of two columns where the first column contains the counts of successes of the two samples and the second column contains the sample sizes.  Then the Bayes factor (on the log scale) is computed by the single line
s=laplace(bfexch,0,list(data=data,K=1))$int
Let’s illustrate using this for the three tables:
> laplace(bfexch,0,list(data=cbind(c(32,18),c(50,50)),K=1))$int
[1] 0.9436246
> laplace(bfexch,0,list(data=cbind(c(272,228),c(500,500)),K=1))$int
[1] -0.3035413
> laplace(bfexch,0,list(data=cbind(c(2570,2430),c(5000,5000)),K=1))$int
[1] -1.411366
So the Bayes factor against equality of proportions is exp(.94) = 2.56 for Table A, exp(-0.30) = 0.74 for Table B, and exp(-1.41) = 0.24 for Table C.  In contrast to the message communicated by a p-value, the Bayes factor calculation indicates that Table A is the most significant table.

Here is today’s example illustrating a Bayesian test of a point null hypothesis.  In Agresti and Franklin’s intro stats book, there is an example wondering if Americans work, on average, 40 hours per week.   If \mu represents the mean hours worked per week for all Americans, we are interested in testing H: \mu = 40 against the alternative A: \mu \neq 40.

In the example in Agresti and Franklin’s text, a sample of 868 workers were selected and one observes that the workers worked an average of 39.11 hours and one assumes the population standard deviation is equal to \sigma = 14.  The test statistic is

z = \frac{\sqrt{868}(\bar y - 40)}{14} = -1.87

and one computes a p-value of 0.061.  There seems to be moderately significant evidence that the mean working hours for all workers is different from 40.

Here’s an outline of a Bayesian test (I give more details in class):

1.  We assign probabilities to the hypotheses H and A.  Here we assume each hypothesis is equally likely.

2.  Under the hypothesis H, the prior for \mu is a point mass at the hypotheized value of 40.

3.  Under the alternative hypothesis, we’ll place a N(40, \tau) prior on \mu.  This reflects the belief that, under the alternative, values of \mu near 40 are more likely that values far from 40.

Under these assumptions, we give (in class) the posterior probability of the hypothesis H. This probability is programmed in the function mnormt.twosided in the LearnBayes package.  The inputs to this function are (1) the value to be tested, (2) the prior probability of H, (3) a vector of values of \tau and (4) a data vector consisting of \bar y, n, \sigma.  The outputs are the Bayes factor in support of H and the posterior probability of H.

Since it is not clear how to specify the prior standard deviation \tau — we compute the posterior probability for a range of values of \tau.  I write a short function to output the posterior probability as a function of \tau and then I use the curve function to plot this probability for values of \tau from 0.01 to 2.

What we see is that the posterior probability of \mu = 40 exceeds 0.35 for all values of \tau.  This suggests that the p-value overstates the evidence against the null hypothesis.

> data=c(39.11, 868, 14)
> prob.H=function(tau)
+   mnormt.twosided(40,.5,tau,data)$post

> curve(prob.H(x),from=.01,to=2,xlab=”TAU”,ylab=”P(H)”)testing

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


In R, I think one of the most useful functions is sapply.   By using this function, one can avoid writing a loop.  Let me illustrate using this function for the binomial/beta exchangeable model.

The model is

1.  y_i distributed binomial(n_i, p_i)
2.  p_1, ..., p_k are iid beta(K, \eta)
3.  (K, \eta) have a vague prior

The log posterior of \theta_1 = logit(\eta), \theta_2 = \log K is programmed in the function betabinexch in the LearnBayes package.

Suppose we wish to obtain a 90% interval estimate for p_j.  To simulate from the posterior, we first simulate a value of (K, \eta) from its posterior and then simulate p_j from the posterior  beta(y_j + K \eta, n_j - y_j + K (1-\eta) )

Let’s consider the famous Efron and Morris baseball data — y_j is the number of hits in n_j at-bats for the jth player.

I’ll read in the data, get a normal approximation to the posterior (function laplace), and then do a random walk Metropolis chain — I obtain a simulated sample from (\theta_1, \theta_2).

y=c(12,7,14,9,18,16,15,13,8,10,17,10,11,10,14,11,10,10)
n=rep(45,18)
library(LearnBayes)
data=as.matrix(cbind(y,n))
fit=laplace(betabinexch,c(-1,4),data)
mcmc.fit=rwmetrop(betabinexch,list(var=fit$var,scale=2),c(-1,4),10000,data)
Now I write a function interval.estimate that has a single input j — it will simulate from the posterior of p_j and output a 90% interval estimate.
interval.estimate=function(j)
{
K=exp(mcmc.fit$par[,2])
eta=exp(mcmc.fit$par[,1])/(1+exp(mcmc.fit$par[,1]))
thetaj=rbeta(1000,y[j]+K*eta,n[j]-y[j]+K*(1-eta))
quantile(thetaj,c(.05,.95))
}
If I wish to obtain a 90% interval estimate for p_1, I type
> interval.estimate(1)
5%       95%
0.2052578 0.3288738
To do this in a single stroke for all 18 probabilities, I use sapply.  I wish to apply interval.estimate over all values 1, …, 18:
> sapply(1:18,interval.estimate)
[,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
5%  0.1981516 0.1535653 0.2240657 0.1825992 0.2481277 0.2339640 0.2203171
95% 0.3314778 0.2906053 0.3578334 0.3076716 0.4013249 0.3805313 0.3658903
[,8]      [,9]     [,10]     [,11]     [,12]     [,13]     [,14]
5%  0.2061546 0.1637018 0.1841709 0.2375183 0.1837387 0.1948228 0.1878043
95% 0.3435467 0.3000630 0.3211896 0.3969801 0.3151900 0.3279358 0.3136143
[,15]     [,16]     [,17]     [,18]
5%  0.2193047 0.1949027 0.1821835 0.1864202
95% 0.3616939 0.3240116 0.3109836 0.3109164
I get interval estimates for all probabilities.  The 18th column contains the 5th and 95th percentiles of the posterior of p_{18}.

The Phillies are still alive in the World Series and so I’m allowed to continue to talk about baseball.

In the situation where we observe a sample of measurement data y_1, ..., y_n, a common assumption to assume that they are a random sample from a normal population with mean \mu and standard deviation \sigma.  But sometimes this assumption is inappropriate.  One measure of hitting effectiveness is the OBP, the proportion of time that a hitter gets on base.  I collected the OBPs (actually 1000 x OBP) for a group of 24 players who played for the west coast NL teams (Dodgers, Giants, and Padres) in the 2004 season.  Here’s a jittered dotplot of the OBPs:

bonds1

Clearly, a normal population is inappropriate here since we have one obvious outlier — Barry Bonds in that season had an usually large OBP of 609.

In many situations, it makes more sense to assume that y_1, ..., y_n are distributed from a “flat-tailed” distribution where outliers are more likely to occur.  One example of a flat-tailed distribution is the t form that depends on a location \mu, a scale \sigma and a degrees of freedom \nu.  Here we assume a small value of the degrees of freedom and focus on estimating (\mu, \sigma).

This is an attractive way of fitting a t sampling model.  First, a t distribution can represented as a scale mixtures of normals.  Suppose we let y be distributed normal with mean \mu and variance \sigma^2/\lambda, and then we let the parameter \lambda be distributed Gamma with shape \nu/2 and rate \nu/2.  When you integrate out \lambda, one sees this is equivalent to assuming y is N(\mu, \sigma^2).

Suppose we use this scale mixture representation for each observation.  We can write this as a hierarchical model:

1.  y_i is distributed N(\mu, \sigma^2/\lambda_i)

2. \lambda_i is distributed Gamma (\nu/2,\nu/2)

3. (\mu, \sigma^2) has the prior 1/\sigma^2.

In class, we’ll show that can sample from this model easily using Gibbs sampling.  This algorithm is implemented using the function robustt in the LearnBayes package.

The posteriors of the extra scale parameters \lambda_i are of interest, since they reflect the influence of the ith observation on the estimation of the mean and variance.  Below I display error bars that show 90% interval estimates for each of the \lambda_i for my OBP example.  Note that the posterior for the scale parameter for Barry Bonds is concentrated near 0.  This is reasonable, since it is desirable to downweight the influence of Bonds OBP in the estimation of, say, the average OBP.

bonds


At our department’s colloquium last week, there was an interesting presentation on the use of an exchangeable prior in a regression context.  The analysis was done using WinBUGS using some standard choices of “weekly informative” priors.  That raises the natural question — how did you decide on this prior?

Let’s return to the home run hitting example where an exchangeable prior was placed on the set of home run hitting probabilities.  We assumed that p_1, ..., p_k were iid B(\eta, K) and then the hyperparameters \eta and K were assumed independent where \eta was assigned the prior \eta^{-1}(1-\eta)^{-1} and K was assigned the log logistic form g(K) = 1/(K+1)^2.

When I described these priors, I just said that they were “noninformative” and didn’t explain how I obtained them.

The main questions are:

1.  Does the choice of vague prior matter?
2.  If the answer to question 1 is “yes”, what should we do?

When one fits an exchangeable model, one is shrinking the observed rates y_i/n_i towards a common value.  The posterior of \eta tells us about the common value and the posterior of K tells us about the degree of shrinkage.  The choice of vague prior for \eta is not a concern — we can assume \eta is uniform or distributed according to the improper prior \eta^{-1}(1-\eta)^{-1}.   However, the prior on K can be a concern.  If one assigns the improper form g(K) = 1/K, it can be shown that the posterior may be improper.  So in general one needs to assign a proper prior to K.  One convenient choice is to assume that \log K has a logistic density with location \mu and scale \tau.  In my example, I assumed that \mu = 0, \tau = 1.

Since I was arbitrary in my choice of parameters for this logistic prior on \log K, I tried different choices for \mu and \tau.  I tried values of \mu between -4 and 4 and values of \tau between 0.2 and 10.  What did I find?  The posterior mode of \log K stayed in the 4.56 – 4.61 range for all of the priors I considered.  In this example, we have data from 605 players and clearly the likelihood is driving the inference.

What if the choice of \mu and \tau does matter?  Then one has to think more carefully about these parameters.  In this baseball context, I am pretty familiar with home run hitting rates.  Based on this knowledge, I have to figure out a reasonable guess at the variability of the home run probabilities that would give me information about K.

 


Last post, we illustrated fitting an exchangeable model on home run hitting probabilities for all players (non pitchers) in the 2007 season.

But actually, teams aren’t really interested in estimating hitting probabilities.  They are primarily interested in predicting a player’s home run performance the following 2008 baseball season.

We already have the posterior distribution for the 2007 home run probabilities.  Assuming that the probabilities stay the same for the following season and that the players will have the same number of at-bats, we can easily predict their 2008 home run counts from the posterior predictive distribution.

For each player, I simulated the predictive distribution of the number of home runs y_i^* in n_i at-bats in the 2008 season.  I summarize each distribution by its mode — this is the most probable value of y_i^* from the predictive distribution.

In the first graph, for all players I plot the number of 2007 home runs (horizontal) against the predicted number of 2008 home runs – the number of 2007 home runs (vertical).   Given the 2007 home run count, I’m interested in my prediction of the change to the following season.  (I jittered the points so you can see the individual observations.)  For the weak home run hitters (on the left side of the graph), I predict they will perform a little worse or better the following season.  But for the heavy hitters, I predict they will hit 5-7 home runs worse in the 2008 season.  Generally, there is a negative slope in the graph that illustrates the “regression to the mean” effect.

Since we have the actual 2008 home run counts, we can actually compare the 2007 and 2008 counts to see if there is a similar effect.  In the following graph, I plot the 2007 counts against the change (2008 count – 2007 count). What do we see?  There is a negative correlation as we suspected.  Hitters who did well in 2007 tended to get worse in 2008, and a number of light hitters in 2007 tended to do better in 2008.  The biggest difference between the actual data and our predictions is the magnitude of the changes.  Three of the top hitters, for example, hit 15-20 home runs in 2008.

prediction2

prediction1

 


homerun3

Continuing our home run hitting example, we observe home run counts y_1, ..., y_k where y_i is distributed Bin(n_i, p_i). Suppose we assign the home run probabilities the following exchangeable model.

1.  p_1, ..., p_k distributed Beta(K, \eta)

2. K, \eta independent, \eta distributed \eta^{-1}(1-\eta)^{-1}, K distributed \frac{1}{1+K^2}.

We described the computing strategy in class to summarize the posterior distribution of (p_1, .., p_k), \eta, K.  We find

1.  For \eta, the posterior median is 0.0283 and a 90% interval estimate is (.0268, .0296).

2. For K, the posterior median is 99 and a 90% interval estimate is (85, 116).

We find posterior means for the home run probabilities p_1, ..., p_k.  We plot the posterior means against the square root of the at-bats for the 605 non-pitchers.

This is a very different graph pattern compared to the figure plotting sqrt(AB) against the observed home run rates.  Several things to note.

1.  For the players with few AB, their home run rates are strongly shrunk towards the overall home run rate.  This is reasonable — we have little information about these players’ true home run hitting abilities.

2.  Who are the best home run hitters?  It is clear from the figure that four hitters stand out — they are the ones in the upper right corner of the graph that I’ve labeled with the number of home runs.  The best home run hitters, that is the ones with the highest estimated home run probabilities, are the two player with 46 and 54 at-bats.

3.  I’ve drawn a smoothing curve to indicate the main pattern in the graph.  Note that it’s an increasing curve — this means that players with more AB tend to have higher home run probabilities.  This makes sense — the better hitters tend to get more playing time and more at-bats.