## Increasing popularity of Bayesian thinking

It is remarkable to see the growth in the development and use of Bayesian methods over my academic lifetime. One way of measuring this growth is to simply count the number of Bayesian papers presented at meetings. In Statistics, our major meeting is JSM (Joint Statistical Meeting) that is held each summer in a major U.S. city. I pulled out the program for the 1983 JSM. Scanning through the abstracts, I found 18 presented papers that had Bayes in the title. Looking at the online program for the 2011 JSM, I found 58 sessions where Bayes was in the title of the session and typically a session will include 4-5 papers. Using this data, I would guess that there were approximately 15 times as many Bayesian papers presented in 2011 than in 1983.

Another way of measuring the growth is to look at the explosion of Bayesian texts that have been recently published. At first, the Bayesian books were more advanced with limited applications. Now there are many applied statistics books that illustrate the application of Bayesian thinking in many disciplines like economics, biology, ecology, and the social sciences.

Categories: General

## Welcome to MATH 6480 – Fall 2011

Welcome to MATH 6480 Bayesian Statistical Analysis. I’ll be using this blog to help you with Bayesian computation (specifically R examples) and helping you with general issues that come up in this course.

$g(\theta|y) \propto g(\theta) \times L(\theta)$

Categories: General

## Bayesian communication

Here are some thoughts about the project that my Bayesian students are working on now.

1.  When one communicates a Bayesian analysis, one should clearly state the prior beliefs and the prior distribution that matches these beliefs, the likelihood, and the posterior distribution.

2.  In any problem, there are particular inferential questions, and a Bayesian report should give the summaries of the posterior that answer the inferential questions.

3.  In the project, the questions are to compare two proportions, and if it is reasonable to assume that the proportions are equal.   (The first question is an estimation problem and the second question relates to the choice of model.)

4.  What role does the informative prior play in the final inference?  In the project, the students perform two analyses, one with an informative prior and one with a vague prior.  By comparing the two posterior inferences, one can better understand the influence of the prior information.

5.  There is a computational aspect involved in obtaining the posterior distribution.   In a Bayesian report, one can talk about the general algorithms that were used.  But the computational details (like R code) has to be in the background, say in an appendix.

6.  The focus of the project (of course) is the Bayesian analysis.  But it is helpful to contrast the Bayesian analysis with frequentist methods.   The student should think of frequentist methods for estimation and testing and which methods are appropriate for addressing these questions.   In the project drafts, it seemed the weakest part of the draft was the description of the frequentist methods.

Categories: General

## Bayesian software

A lot has happened in, say, the last 10 years with respect to Bayesian software.  This has been a controversial subject and it would be worthwhile to talk about some of the main issues.

1.  First, if one was going to design a Bayesian software package, what would it look like?  One advantage of the Bayesian paradigm is its flexibility in defining models, doing inference, and checking and comparing models.  So any software should allow for the input of “general” models including priors, a variety of methods for fitting the models, and also a variety of methods for doing inference (say, find a marginal posterior for a function of parameters of interest) and checking the validity of the model.

2.  Of course, the most popular Bayesian software program is Bugs that includes all of the derivatives of Bugs including WinBugs and OpenBugs.  It allows for general model specifications by writing a “model script”, it has a general MCMC computing engine that works for many problems, and it allows for general inference and model checking.

3.  Ok, we should all use bugs for Bayesian computing?  Actually, I purposely don’t use bugs in my Bayesian class and instead use my package LearnBayes in the R system.  Why?  Well, although bugs is pretty easy to use, it is sort of a black box where one can use it without understanding the issues in MCMC computing and diagnostics.   I want my students to understand the basic MCMC algorithms like Gibbs and Metropolis sampling and get some experience in implementing these algorithms to understand the pitfalls.  I would feel more comfortable teaching bugs after the student has had some practice with MCMC, especially for examples where MCMC hasn’t converged or has mixing problems.

4.  Another approach is to program MCMC algorithms for specific Bayesian models.  This approach is taken using the R package MCMCpack.  For example, suppose I want to do a Bayesian linear regression using a normal prior on the regression vector and a inverse gamma prior on the variance.  Then there is a function in MCMCpack that will work fine, implement the Gibbs sampling, and give you a matrix of simulated draws and also the prior predictive density value that can be used in model comparison.  But suppose I want to use a t prior instead of normal for beta — then I’m stuck.  These specific algorithms are useful if you want to fit “standard” models, but we lose the flexibility that is one of the advantages of Bayes.

5.  Of course, as the programs become more flexible, it takes a more experienced Bayesian who can actually run the programs.  If we wish to introduce Bayes to the masses, maybe we need to provide a suite of canned programs.

It will be interesting to see how Bayesian software will evolve.  It is pretty clear that bugs will be a major player in the future, perhaps with a new interface.

Categories: Bayesian software

## Bayesian regression – part II

In the earlier posting (part I), I demonstrated how one obtains a prior using a portion of the data and then uses this as a prior with the remaining portion of the data to obtain the posterior.

One advantage of using proper priors is that one can implement model selection.

Recall that the output of the MCMCregress function was fit2 — this is the regression model including all four predictors Food, Decor, Service, and East.

Suppose I want to compare this model with the model that removes the Food variable.  I just adjust my prior.  If I don’t believe this variable should be in the model, my prior for that regression coefficient should be 0 with a tiny variance or a large precision.   Currently my prior mean is stored in the variable beta0 and the precision matrix is stored in B0.   I change the component of beta0 corresponding to Food to be 0 and then assign a really big value to the (Food, Food) entry in the precision matrix B0.  I rerun MCMCregress using this new model and the output is stored in fit2.a.

beta0.a=beta0; beta0.a[2]=0
B0.a=B0; B0.a[2,2]=200000
fit2.a=MCMCregress(Price~Food+Decor+Service+East,
data = italian2, burnin = 0, mcmc = 10000,
thin = 1, verbose = 0, seed = NA, beta.start = NA,
b0 = beta0.a, B0 = B0.a, c0 = c0, d0 = d0,
marginal.likelihood = c(“Chib95”))

To compare the two models (the full one and the one with Food removed), I use the BayesFactor function in MCMCpack.  This gives the log marginal likelihood for each model and compares models by Bayes factors.  Here is the key output.

BayesFactor(fit2,fit2.a)
The matrix of the natural log Bayes Factors is:

fit2 fit2.a
fit2      0    303
fit2.a -303      0

On the ln scale, the full model is far superior to the model with Food removed since the log Bayes factor in support of the full model is 303.  Clearly, Food must be an important variable.
By the way, in LearnBayes, I have a function bayes.model.selection that implements Bayesian model selection using a g-prior on the regression coefficients.   This function gives reasonable results for this example.
Categories: Regression

## Logistic regression exercise

In my Bayesian class, I assigned a problem (exercise 8.3 from BCWR) where one is using logistic regression to model the trajectory of Mike Schmidt’s home run rates.

First, I have written a function to compute the log posterior for a logistic regression model with a vague prior placed on the regression vector.  The function together with a short example how it works can be found here.

Second, it seems that there is a problem getting the laplace function to converge if you don’t use a really good starting value.  Here is an alternative way of getting the posterior mode.

Suppose y is a vector containing the home run counts and n is a vector with the sample sizes.  Let age be the vector of ages and age2 be the vector of ages squared.  Define the response to be a matrix containing the vectors of successes and failures.

response=cbind(y,n-y)

Then the mle of beta is found using the glm command with the family=binomial option

fit=glm(response~age+age2,family=binomial)

The posterior mode is the mle:

beta=coef(fit)

The approximation to the posterior variance-covariance matrix is found by

V=vcov(fit)

Now you should be able to use the random walk metropolis algorithm to simulate from the posterior.

Categories: Regression

## Bayesian regression — part I

To illustrate Bayesian regression, I’ll use an example from Simon Sheather’s new book on linear regression.  A restaurant guide collects a number of variables from a group of Italian restaurants in New York City.  One is interested in predicting the Price of a meal based on Food (measure of quality), Decor (measure of decor of the restaurant), Service (measure of quality of service), and East (dummy variable indicating if the restaurant is east or west of 5th Avenue).

Here is the dataset in csv format:  http://www.stat.tamu.edu/~sheather/book/docs/datasets/nyc.csv

Assuming the usual regression model

$y_i = x_i' \beta + \epsilon_i$, $\epsilon_i \sim N(0, \sigma^2)$,

suppose we assign the following prior to $(\beta, \sigma^2)$:  we assume independence with $\beta \sim N(\beta_0, B_0)$ and $\sigma^2 \sim IG(c_0/2, d_0/2)$.  Here $B_0$ is the precision matrix, the inverse of the variance-covariance matrix.

How do we obtain values of the parameters $\beta_0, B_0, c_0, d_0$?  We’ll use a small part of the data to construct our prior, and then use the remaining part of the data to do inference and eventually do some model selection.

First, I read in the data, select a random sample of rows, and partition the data into the two datasets italian1 and italian2.

indices=sample(168,size=25)
italian1=italian[indices,]
italian2=italian[-indices,]

I use the function blinreg in the LearnBayes package to simulate from the posterior using a vague prior.
library(LearnBayes)
y=italian1$Price X=with(italian1,cbind(1,Food,Decor,Service,East)) fit=blinreg(y,X,5000) From the simulated draws, I can now get estimates of my prior. The parameters for the normal prior on $\beta$ are found by finding the sample mean and sample var-cov matrix of the simulated draws. The precision matrix $B_0$ is the inverse of the var-cov matrix. The parameters $c_0, d_0$ correspond to the sum of squared errors and degrees of freedom of my initial fit. beta0=apply(fit$beta,2,mean)
Sigma0=cov(fit\$beta)
B0=solve(Sigma0)
c0=(25-5)
d0=sum((y-X%*%beta0)^2)
The function MCMCregress in the MCMCpack package will simulate from the posterior of $(\beta, \sigma^2)$ using this informative prior.  It seems pretty to use.  I show the function including all of the arguments.  Note that  I’m using the second part of the data italian2, I input my prior through the b0, B0, c0, d0 arguments, and I’m going to simulate 10,000 draws (through the mcmc=10000 input).
fit2=MCMCregress(Price~Food+Decor+Service+East,
data = italian2, burnin = 0, mcmc = 10000,
thin = 1, verbose = 0, seed = NA, beta.start = NA,
b0 = beta0, B0 = B0, c0 = c0, d0 = d0,
marginal.likelihood = c(“Chib95”))
The output of MCMCregress, fit2, is a MCMC object that one can easily summarize and display using the summary and plot methods in the coda package.
Suppose we are interested in estimating the mean price at a restaurant on the east side (east = 1) with Food = 30, Decor = 30 and Service = 30.  We define a row matrix that consists of these covariates.  Then we compute the linear predictor $x' \beta$ for each of the simulated beta vectors.  The vector lp contains 10,000 draws from the posterior of $x' \beta$.
x=matrix(c(1,20,20,20,1),5,1)
beta=fit2[,1:5]
sigma=sqrt(fit2[,6])
lp=c(beta%*%x)
Suppose next that we’re interested in predicting the price of a meal at the same covariates.  Since we have a sample from the posterior of the linear predictor, we just need to do an additional step to draw from the posterior predictive distribution — we store these values in ys.
ys=rnorm(10000,lp,sigma)
We illustrate the difference between estimating the mean price and predicting the price by summarizing the two sets of simulated draws.  The point estimates are similar, but, as expected, the predictive interval estimate is much wider.
summary(lp)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
43.59   46.44   47.00   46.99   47.53   50.17
summary(ys)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
25.10   43.12   46.90   46.93   50.75   67.82
Categories: Regression