Archive

Archive for November, 2009

Using the sapply function in R

November 5, 2009 Leave a comment

In R, I think one of the most useful functions is sapply.   By using this function, one can avoid writing a loop.  Let me illustrate using this function for the binomial/beta exchangeable model.

The model is

1.  y_i distributed binomial(n_i, p_i)
2.  p_1, ..., p_k are iid beta(K, \eta)
3.  (K, \eta) have a vague prior

The log posterior of \theta_1 = logit(\eta), \theta_2 = \log K is programmed in the function betabinexch in the LearnBayes package.

Suppose we wish to obtain a 90% interval estimate for p_j.  To simulate from the posterior, we first simulate a value of (K, \eta) from its posterior and then simulate p_j from the posterior  beta(y_j + K \eta, n_j - y_j + K (1-\eta) )

Let’s consider the famous Efron and Morris baseball data — y_j is the number of hits in n_j at-bats for the jth player.

I’ll read in the data, get a normal approximation to the posterior (function laplace), and then do a random walk Metropolis chain — I obtain a simulated sample from (\theta_1, \theta_2).

y=c(12,7,14,9,18,16,15,13,8,10,17,10,11,10,14,11,10,10)
n=rep(45,18)
library(LearnBayes)
data=as.matrix(cbind(y,n))
fit=laplace(betabinexch,c(-1,4),data)
mcmc.fit=rwmetrop(betabinexch,list(var=fit$var,scale=2),c(-1,4),10000,data)
Now I write a function interval.estimate that has a single input j — it will simulate from the posterior of p_j and output a 90% interval estimate.
interval.estimate=function(j)
{
K=exp(mcmc.fit$par[,2])
eta=exp(mcmc.fit$par[,1])/(1+exp(mcmc.fit$par[,1]))
thetaj=rbeta(1000,y[j]+K*eta,n[j]-y[j]+K*(1-eta))
quantile(thetaj,c(.05,.95))
}
If I wish to obtain a 90% interval estimate for p_1, I type
> interval.estimate(1)
5%       95%
0.2052578 0.3288738
To do this in a single stroke for all 18 probabilities, I use sapply.  I wish to apply interval.estimate over all values 1, …, 18:
> sapply(1:18,interval.estimate)
[,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
5%  0.1981516 0.1535653 0.2240657 0.1825992 0.2481277 0.2339640 0.2203171
95% 0.3314778 0.2906053 0.3578334 0.3076716 0.4013249 0.3805313 0.3658903
[,8]      [,9]     [,10]     [,11]     [,12]     [,13]     [,14]
5%  0.2061546 0.1637018 0.1841709 0.2375183 0.1837387 0.1948228 0.1878043
95% 0.3435467 0.3000630 0.3211896 0.3969801 0.3151900 0.3279358 0.3136143
[,15]     [,16]     [,17]     [,18]
5%  0.2193047 0.1949027 0.1821835 0.1864202
95% 0.3616939 0.3240116 0.3109836 0.3109164
I get interval estimates for all probabilities.  The 18th column contains the 5th and 95th percentiles of the posterior of p_{18}.
Advertisements
Categories: R work

Robust modeling

November 4, 2009 Leave a comment

The Phillies are still alive in the World Series and so I’m allowed to continue to talk about baseball.

In the situation where we observe a sample of measurement data y_1, ..., y_n, a common assumption to assume that they are a random sample from a normal population with mean \mu and standard deviation \sigma.  But sometimes this assumption is inappropriate.  One measure of hitting effectiveness is the OBP, the proportion of time that a hitter gets on base.  I collected the OBPs (actually 1000 x OBP) for a group of 24 players who played for the west coast NL teams (Dodgers, Giants, and Padres) in the 2004 season.  Here’s a jittered dotplot of the OBPs:

bonds1

Clearly, a normal population is inappropriate here since we have one obvious outlier — Barry Bonds in that season had an usually large OBP of 609.

In many situations, it makes more sense to assume that y_1, ..., y_n are distributed from a “flat-tailed” distribution where outliers are more likely to occur.  One example of a flat-tailed distribution is the t form that depends on a location \mu, a scale \sigma and a degrees of freedom \nu.  Here we assume a small value of the degrees of freedom and focus on estimating (\mu, \sigma).

This is an attractive way of fitting a t sampling model.  First, a t distribution can represented as a scale mixtures of normals.  Suppose we let y be distributed normal with mean \mu and variance \sigma^2/\lambda, and then we let the parameter \lambda be distributed Gamma with shape \nu/2 and rate \nu/2.  When you integrate out \lambda, one sees this is equivalent to assuming y is N(\mu, \sigma^2).

Suppose we use this scale mixture representation for each observation.  We can write this as a hierarchical model:

1.  y_i is distributed N(\mu, \sigma^2/\lambda_i)

2. \lambda_i is distributed Gamma (\nu/2,\nu/2)

3. (\mu, \sigma^2) has the prior 1/\sigma^2.

In class, we’ll show that can sample from this model easily using Gibbs sampling.  This algorithm is implemented using the function robustt in the LearnBayes package.

The posteriors of the extra scale parameters \lambda_i are of interest, since they reflect the influence of the ith observation on the estimation of the mean and variance.  Below I display error bars that show 90% interval estimates for each of the \lambda_i for my OBP example.  Note that the posterior for the scale parameter for Barry Bonds is concentrated near 0.  This is reasonable, since it is desirable to downweight the influence of Bonds OBP in the estimation of, say, the average OBP.

bonds

Categories: Hierarchical modeling

Home run hitting — priors?

November 1, 2009 Leave a comment

At our department’s colloquium last week, there was an interesting presentation on the use of an exchangeable prior in a regression context.  The analysis was done using WinBUGS using some standard choices of “weekly informative” priors.  That raises the natural question — how did you decide on this prior?

Let’s return to the home run hitting example where an exchangeable prior was placed on the set of home run hitting probabilities.  We assumed that p_1, ..., p_k were iid B(\eta, K) and then the hyperparameters \eta and K were assumed independent where \eta was assigned the prior \eta^{-1}(1-\eta)^{-1} and K was assigned the log logistic form g(K) = 1/(K+1)^2.

When I described these priors, I just said that they were “noninformative” and didn’t explain how I obtained them.

The main questions are:

1.  Does the choice of vague prior matter?
2.  If the answer to question 1 is “yes”, what should we do?

When one fits an exchangeable model, one is shrinking the observed rates y_i/n_i towards a common value.  The posterior of \eta tells us about the common value and the posterior of K tells us about the degree of shrinkage.  The choice of vague prior for \eta is not a concern — we can assume \eta is uniform or distributed according to the improper prior \eta^{-1}(1-\eta)^{-1}.   However, the prior on K can be a concern.  If one assigns the improper form g(K) = 1/K, it can be shown that the posterior may be improper.  So in general one needs to assign a proper prior to K.  One convenient choice is to assume that \log K has a logistic density with location \mu and scale \tau.  In my example, I assumed that \mu = 0, \tau = 1.

Since I was arbitrary in my choice of parameters for this logistic prior on \log K, I tried different choices for \mu and \tau.  I tried values of \mu between -4 and 4 and values of \tau between 0.2 and 10.  What did I find?  The posterior mode of \log K stayed in the 4.56 – 4.61 range for all of the priors I considered.  In this example, we have data from 605 players and clearly the likelihood is driving the inference.

What if the choice of \mu and \tau does matter?  Then one has to think more carefully about these parameters.  In this baseball context, I am pretty familiar with home run hitting rates.  Based on this knowledge, I have to figure out a reasonable guess at the variability of the home run probabilities that would give me information about K.