### Archive

Archive for November, 2009

## Bayesian regression – part II

In the earlier posting (part I), I demonstrated how one obtains a prior using a portion of the data and then uses this as a prior with the remaining portion of the data to obtain the posterior.

One advantage of using proper priors is that one can implement model selection.

Recall that the output of the MCMCregress function was fit2 — this is the regression model including all four predictors Food, Decor, Service, and East.

Suppose I want to compare this model with the model that removes the Food variable.  I just adjust my prior.  If I don’t believe this variable should be in the model, my prior for that regression coefficient should be 0 with a tiny variance or a large precision.   Currently my prior mean is stored in the variable beta0 and the precision matrix is stored in B0.   I change the component of beta0 corresponding to Food to be 0 and then assign a really big value to the (Food, Food) entry in the precision matrix B0.  I rerun MCMCregress using this new model and the output is stored in fit2.a.

beta0.a=beta0; beta0.a[2]=0
B0.a=B0; B0.a[2,2]=200000
fit2.a=MCMCregress(Price~Food+Decor+Service+East,
data = italian2, burnin = 0, mcmc = 10000,
thin = 1, verbose = 0, seed = NA, beta.start = NA,
b0 = beta0.a, B0 = B0.a, c0 = c0, d0 = d0,
marginal.likelihood = c(“Chib95”))

To compare the two models (the full one and the one with Food removed), I use the BayesFactor function in MCMCpack.  This gives the log marginal likelihood for each model and compares models by Bayes factors.  Here is the key output.

BayesFactor(fit2,fit2.a)
The matrix of the natural log Bayes Factors is:

fit2 fit2.a
fit2      0    303
fit2.a -303      0

On the ln scale, the full model is far superior to the model with Food removed since the log Bayes factor in support of the full model is 303.  Clearly, Food must be an important variable.
By the way, in LearnBayes, I have a function bayes.model.selection that implements Bayesian model selection using a g-prior on the regression coefficients.   This function gives reasonable results for this example.
Categories: Regression

## Logistic regression exercise

In my Bayesian class, I assigned a problem (exercise 8.3 from BCWR) where one is using logistic regression to model the trajectory of Mike Schmidt’s home run rates.

First, I have written a function to compute the log posterior for a logistic regression model with a vague prior placed on the regression vector.  The function together with a short example how it works can be found here.

Second, it seems that there is a problem getting the laplace function to converge if you don’t use a really good starting value.  Here is an alternative way of getting the posterior mode.

Suppose y is a vector containing the home run counts and n is a vector with the sample sizes.  Let age be the vector of ages and age2 be the vector of ages squared.  Define the response to be a matrix containing the vectors of successes and failures.

response=cbind(y,n-y)

Then the mle of beta is found using the glm command with the family=binomial option

fit=glm(response~age+age2,family=binomial)

The posterior mode is the mle:

beta=coef(fit)

The approximation to the posterior variance-covariance matrix is found by

V=vcov(fit)

Now you should be able to use the random walk metropolis algorithm to simulate from the posterior.

Categories: Regression

## Bayesian regression — part I

To illustrate Bayesian regression, I’ll use an example from Simon Sheather’s new book on linear regression.  A restaurant guide collects a number of variables from a group of Italian restaurants in New York City.  One is interested in predicting the Price of a meal based on Food (measure of quality), Decor (measure of decor of the restaurant), Service (measure of quality of service), and East (dummy variable indicating if the restaurant is east or west of 5th Avenue).

Here is the dataset in csv format:  http://www.stat.tamu.edu/~sheather/book/docs/datasets/nyc.csv

Assuming the usual regression model

$y_i = x_i' \beta + \epsilon_i$, $\epsilon_i \sim N(0, \sigma^2)$,

suppose we assign the following prior to $(\beta, \sigma^2)$:  we assume independence with $\beta \sim N(\beta_0, B_0)$ and $\sigma^2 \sim IG(c_0/2, d_0/2)$.  Here $B_0$ is the precision matrix, the inverse of the variance-covariance matrix.

How do we obtain values of the parameters $\beta_0, B_0, c_0, d_0$?  We’ll use a small part of the data to construct our prior, and then use the remaining part of the data to do inference and eventually do some model selection.

First, I read in the data, select a random sample of rows, and partition the data into the two datasets italian1 and italian2.

indices=sample(168,size=25)
italian1=italian[indices,]
italian2=italian[-indices,]

I use the function blinreg in the LearnBayes package to simulate from the posterior using a vague prior.
library(LearnBayes)
y=italian1$Price X=with(italian1,cbind(1,Food,Decor,Service,East)) fit=blinreg(y,X,5000) From the simulated draws, I can now get estimates of my prior. The parameters for the normal prior on $\beta$ are found by finding the sample mean and sample var-cov matrix of the simulated draws. The precision matrix $B_0$ is the inverse of the var-cov matrix. The parameters $c_0, d_0$ correspond to the sum of squared errors and degrees of freedom of my initial fit. beta0=apply(fit$beta,2,mean)
Sigma0=cov(fit$beta) B0=solve(Sigma0) c0=(25-5) d0=sum((y-X%*%beta0)^2) The function MCMCregress in the MCMCpack package will simulate from the posterior of $(\beta, \sigma^2)$ using this informative prior. It seems pretty to use. I show the function including all of the arguments. Note that I’m using the second part of the data italian2, I input my prior through the b0, B0, c0, d0 arguments, and I’m going to simulate 10,000 draws (through the mcmc=10000 input). fit2=MCMCregress(Price~Food+Decor+Service+East, data = italian2, burnin = 0, mcmc = 10000, thin = 1, verbose = 0, seed = NA, beta.start = NA, b0 = beta0, B0 = B0, c0 = c0, d0 = d0, marginal.likelihood = c(“Chib95”)) The output of MCMCregress, fit2, is a MCMC object that one can easily summarize and display using the summary and plot methods in the coda package. Suppose we are interested in estimating the mean price at a restaurant on the east side (east = 1) with Food = 30, Decor = 30 and Service = 30. We define a row matrix that consists of these covariates. Then we compute the linear predictor $x' \beta$ for each of the simulated beta vectors. The vector lp contains 10,000 draws from the posterior of $x' \beta$. x=matrix(c(1,20,20,20,1),5,1) beta=fit2[,1:5] sigma=sqrt(fit2[,6]) lp=c(beta%*%x) Suppose next that we’re interested in predicting the price of a meal at the same covariates. Since we have a sample from the posterior of the linear predictor, we just need to do an additional step to draw from the posterior predictive distribution — we store these values in ys. ys=rnorm(10000,lp,sigma) We illustrate the difference between estimating the mean price and predicting the price by summarizing the two sets of simulated draws. The point estimates are similar, but, as expected, the predictive interval estimate is much wider. summary(lp) Min. 1st Qu. Median Mean 3rd Qu. Max. 43.59 46.44 47.00 46.99 47.53 50.17 summary(ys) Min. 1st Qu. Median Mean 3rd Qu. Max. 25.10 43.12 46.90 46.93 50.75 67.82 Categories: Regression ## Are birth weights normal? November 16, 2009 Leave a comment A few weeks ago, I asked my class for an example of data that would be normally distributed and someone mentioned weights of people. Are weights, more specifically birthweights of newborns normally distributed? Peter Dunn in a JSE article here analyzes statistics for 44 babies born in a single day in an Australian hospital. The article has a link to the dataset babyboom.dat that contains the data. If we construct a histogram of the birthweights, we get the following display. We see some left skewness in this distribution — does this indicate that our normality assumption is false? One can check the normality assumption by means of the posterior predictive distribution. Here’s an outline of the approach. (There are special functions in the LearnBayes package that make this process simple to implement.) 1. First we fit the normal($\mu, \sigma^2)$) model using the traditional noninformative prior proportional to $1/\sigma^2$. A simulated sample from the posterior is found using the function normpostsim. fit=normpostsim(birth.weight,10000) 2. Next we think of an appropriate testing function. Since the birth weights don’t appear symmetric, a natural checking function is the sample skewness. We write a short function “skew” to compute this statistic. skew=function(y) sum((y-mean(y))^3)/sd(y)^3/(length(y)-1) 3. Last, we simulate draws from the posterior predictive distribution of the sample skewness. This is implemented using the function normpostpred — the arguments are the list of simulated draws of the posterior (obtained by normpostsim), the size of the future sample (we’ll use the same sample size as that for the observed sample), and the function defining the testing function. The vector of simulated draws from the testing function are stored in SKEW. SKEW=normpostpred(fit,length(birth.weight),skew) 4. I plot a histogram of the predictive distribution of the skew statistic and overlay a vertical line showing the skew of the observed data. hist(SKEW); abline(v=skew(birth.weight),col=”red”,lwd=3) Clearly the skew in the data is not representative of the skew values predicted from the fitted model. The normal population assumption appears inappropriate. ## A “robustness” illustration of Bayes factors November 16, 2009 Leave a comment Here is another illustration of Bayes factors. We are worried about the possibility of outliers and so we assume the observations $y_1, ..., y_n$ are distributed from a t distribution with location $\mu$, scale $\sigma$, and degrees of freedom $df$. We have some prior beliefs about the location and scale: we assume $\mu, \sigma$ independent with $\mu \sim N(100, 10)$ and $\log \sigma \sim N(0, 2)$. We observe the following data: y=c(114.1,113.1,103.5,89.4,94.7,110.6,103.2,123.0,112.3,110.976.8, 112.6,86.0,99.1,106.3,101.5,80.5,99.6,108.6,95.9) We define a model $M_{df}$ that defines the degrees of freedom parameter to be a specific value. We can use Bayes factors to decide on appropriate values of $df$. I define a function tlikepost defining the log posterior of $\theta = (\mu, \log \sigma)$. Here the argument stuff is a list with two components, the data vector y and the degrees of freedom value df. Here the sampling density of $y_i$ is given by $f(y_i) = \frac{1}{\sigma} dt\left(\frac{y_i-\mu}{\sigma}, df\right)$ tlikepost=function(theta,stuff) { mu=theta[1] sigma=exp(theta[2]) y=stuff$y
df=stuff$df loglike=sum(dt((y-mu)/sigma, df, log=TRUE)-log(sigma)) logprior=dnorm(mu,100,10,log=TRUE)+dnorm(log(sigma),0,2, log=TRUE) loglike+logprior } By using the function laplace with the “int” component, we can approximate the log of the marginal density of $y$. To illustrate, suppose we wish to compute $\log f(y)$ for a model that assumes that we have 30 degrees of freedom. > laplace(tlikepost,c(0,0),list(y=y.out,df=30))$int
[1] -89.23436
So for the model $M_{30}$, we have log f(y) = -89.23.  To compare this model with, say the model $M_4$, we repeat this using $df=4$, compare the log marginal density, and then we can compare the two models by a Bayes factor.
Here are some things to try.
1.  With this data set, try comparing different values for the degrees of freedom parameter.  What you should find is that large $df$ values are supported — this indicates that a normal sampling model is appropriate.
2.  Now change the dataset by introducing an outlier and repeat the analysis, comparing different $M_{df}$ models.  What you should find that a t model with small degrees of freedom is supported with this new dataset.

## A Bayesian test of the equality of proportions

One problem with a p-value is that it is hard to interpret if you don’t know the sample size.  Andy Gelman talks about this problem on his blog.

Here I’ll contrast the message communicated by a p-value and a Bayesian test by looking at the following three tables:

Table A:  [32 18; 18 32]
Table B: [272 228; 228,272]
Table C: [2570 2430; 2430,2570]

Each of the tables has a Pearson chi-square value in the (7.4, 7.8) range and so each has (approximately) the same p-value.  But, following Gelman’s reasoning, the small p-value for the large sample size in Table C doesn’t have much significance and and the same p-value for Table A (small sample size) is more impressive.
Here is a reasonable Bayesian test of the equality of two proportions.  First, we define a class of priors indexed by the parameter K:
1.  $p_1, p_2$ are iid beta($K\eta, K(1-\eta)$)
2.  $\eta$ has a uniform density
Testing H: $p_1 = p_2$, A: $p_1 \neq p_2$ is equivalent to testing H: $K = \infty$ against $K = 1$ (I’ll talk more about this in class.)
Chapter 8 in BCWR gives an expression for the Bayes factor in support of $A$ over $H$.   In the LearnBayes package, suppose the data is a matrix of two columns where the first column contains the counts of successes of the two samples and the second column contains the sample sizes.  Then the Bayes factor (on the log scale) is computed by the single line
s=laplace(bfexch,0,list(data=data,K=1))$int Let’s illustrate using this for the three tables: > laplace(bfexch,0,list(data=cbind(c(32,18),c(50,50)),K=1))$int
[1] 0.9436246
> laplace(bfexch,0,list(data=cbind(c(272,228),c(500,500)),K=1))$int [1] -0.3035413 > laplace(bfexch,0,list(data=cbind(c(2570,2430),c(5000,5000)),K=1))$int
[1] -1.411366
So the Bayes factor against equality of proportions is exp(.94) = 2.56 for Table A, exp(-0.30) = 0.74 for Table B, and exp(-1.41) = 0.24 for Table C.  In contrast to the message communicated by a p-value, the Bayes factor calculation indicates that Table A is the most significant table.

## Bayesian testing of a point null hypothesis

Here is today’s example illustrating a Bayesian test of a point null hypothesis.  In Agresti and Franklin’s intro stats book, there is an example wondering if Americans work, on average, 40 hours per week.   If $\mu$ represents the mean hours worked per week for all Americans, we are interested in testing $H: \mu = 40$ against the alternative $A: \mu \neq 40$.

In the example in Agresti and Franklin’s text, a sample of 868 workers were selected and one observes that the workers worked an average of 39.11 hours and one assumes the population standard deviation is equal to $\sigma = 14$.  The test statistic is

$z = \frac{\sqrt{868}(\bar y - 40)}{14} = -1.87$

and one computes a p-value of 0.061.  There seems to be moderately significant evidence that the mean working hours for all workers is different from 40.

Here’s an outline of a Bayesian test (I give more details in class):

1.  We assign probabilities to the hypotheses $H$ and $A$.  Here we assume each hypothesis is equally likely.

2.  Under the hypothesis $H$, the prior for $\mu$ is a point mass at the hypotheized value of 40.

3.  Under the alternative hypothesis, we’ll place a N(40, $\tau$) prior on $\mu$.  This reflects the belief that, under the alternative, values of $\mu$ near 40 are more likely that values far from 40.

Under these assumptions, we give (in class) the posterior probability of the hypothesis $H$. This probability is programmed in the function mnormt.twosided in the LearnBayes package.  The inputs to this function are (1) the value to be tested, (2) the prior probability of $H$, (3) a vector of values of $\tau$ and (4) a data vector consisting of $\bar y, n, \sigma$.  The outputs are the Bayes factor in support of $H$ and the posterior probability of $H$.

Since it is not clear how to specify the prior standard deviation $\tau$ — we compute the posterior probability for a range of values of $\tau$.  I write a short function to output the posterior probability as a function of $\tau$ and then I use the curve function to plot this probability for values of $\tau$ from 0.01 to 2.

What we see is that the posterior probability of $\mu = 40$ exceeds 0.35 for all values of $\tau$.  This suggests that the p-value overstates the evidence against the null hypothesis.

> data=c(39.11, 868, 14)
> prob.H=function(tau)
+   mnormt.twosided(40,.5,tau,data)\$post

> curve(prob.H(x),from=.01,to=2,xlab=”TAU”,ylab=”P(H)”)