Archive

Archive for November, 2009

Using the sapply function in R

In R, I think one of the most useful functions is sapply.   By using this function, one can avoid writing a loop.  Let me illustrate using this function for the binomial/beta exchangeable model.

The model is

1.  $y_i$ distributed binomial($n_i, p_i$)
2.  $p_1, ..., p_k$ are iid beta($K, \eta$)
3.  $(K, \eta)$ have a vague prior

The log posterior of $\theta_1 = logit(\eta), \theta_2 = \log K$ is programmed in the function betabinexch in the LearnBayes package.

Suppose we wish to obtain a 90% interval estimate for $p_j$.  To simulate from the posterior, we first simulate a value of $(K, \eta)$ from its posterior and then simulate $p_j$ from the posterior  beta($y_j + K \eta, n_j - y_j + K (1-\eta) )$

Let’s consider the famous Efron and Morris baseball data — $y_j$ is the number of hits in $n_j$ at-bats for the $j$th player.

I’ll read in the data, get a normal approximation to the posterior (function laplace), and then do a random walk Metropolis chain — I obtain a simulated sample from $(\theta_1, \theta_2)$.

y=c(12,7,14,9,18,16,15,13,8,10,17,10,11,10,14,11,10,10)
n=rep(45,18)
library(LearnBayes)
data=as.matrix(cbind(y,n))
fit=laplace(betabinexch,c(-1,4),data)
mcmc.fit=rwmetrop(betabinexch,list(var=fit$var,scale=2),c(-1,4),10000,data) Now I write a function interval.estimate that has a single input j — it will simulate from the posterior of $p_j$ and output a 90% interval estimate. interval.estimate=function(j) { K=exp(mcmc.fit$par[,2])
eta=exp(mcmc.fit$par[,1])/(1+exp(mcmc.fit$par[,1]))
thetaj=rbeta(1000,y[j]+K*eta,n[j]-y[j]+K*(1-eta))
quantile(thetaj,c(.05,.95))
}
If I wish to obtain a 90% interval estimate for $p_1$, I type
> interval.estimate(1)
5%       95%
0.2052578 0.3288738
To do this in a single stroke for all 18 probabilities, I use sapply.  I wish to apply interval.estimate over all values 1, …, 18:
> sapply(1:18,interval.estimate)
[,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
5%  0.1981516 0.1535653 0.2240657 0.1825992 0.2481277 0.2339640 0.2203171
95% 0.3314778 0.2906053 0.3578334 0.3076716 0.4013249 0.3805313 0.3658903
[,8]      [,9]     [,10]     [,11]     [,12]     [,13]     [,14]
5%  0.2061546 0.1637018 0.1841709 0.2375183 0.1837387 0.1948228 0.1878043
95% 0.3435467 0.3000630 0.3211896 0.3969801 0.3151900 0.3279358 0.3136143
[,15]     [,16]     [,17]     [,18]
5%  0.2193047 0.1949027 0.1821835 0.1864202
95% 0.3616939 0.3240116 0.3109836 0.3109164
I get interval estimates for all probabilities.  The 18th column contains the 5th and 95th percentiles of the posterior of $p_{18}$.
Categories: R work

Robust modeling

The Phillies are still alive in the World Series and so I’m allowed to continue to talk about baseball.

In the situation where we observe a sample of measurement data $y_1, ..., y_n$, a common assumption to assume that they are a random sample from a normal population with mean $\mu$ and standard deviation $\sigma$.  But sometimes this assumption is inappropriate.  One measure of hitting effectiveness is the OBP, the proportion of time that a hitter gets on base.  I collected the OBPs (actually 1000 x OBP) for a group of 24 players who played for the west coast NL teams (Dodgers, Giants, and Padres) in the 2004 season.  Here’s a jittered dotplot of the OBPs:

Clearly, a normal population is inappropriate here since we have one obvious outlier — Barry Bonds in that season had an usually large OBP of 609.

In many situations, it makes more sense to assume that $y_1, ..., y_n$ are distributed from a “flat-tailed” distribution where outliers are more likely to occur.  One example of a flat-tailed distribution is the t form that depends on a location $\mu$, a scale $\sigma$ and a degrees of freedom $\nu$.  Here we assume a small value of the degrees of freedom and focus on estimating $(\mu, \sigma)$.

This is an attractive way of fitting a t sampling model.  First, a t distribution can represented as a scale mixtures of normals.  Suppose we let $y$ be distributed normal with mean $\mu$ and variance $\sigma^2/\lambda$, and then we let the parameter $\lambda$ be distributed Gamma with shape $\nu/2$ and rate $\nu/2$.  When you integrate out $\lambda$, one sees this is equivalent to assuming $y$ is N($\mu, \sigma^2)$.

Suppose we use this scale mixture representation for each observation.  We can write this as a hierarchical model:

1.  $y_i$ is distributed N($\mu, \sigma^2/\lambda_i$)

2. $\lambda_i$ is distributed Gamma ($\nu/2,\nu/2$)

3. $(\mu, \sigma^2)$ has the prior $1/\sigma^2$.

In class, we’ll show that can sample from this model easily using Gibbs sampling.  This algorithm is implemented using the function $robustt$ in the LearnBayes package.

The posteriors of the extra scale parameters $\lambda_i$ are of interest, since they reflect the influence of the $i$th observation on the estimation of the mean and variance.  Below I display error bars that show 90% interval estimates for each of the $\lambda_i$ for my OBP example.  Note that the posterior for the scale parameter for Barry Bonds is concentrated near 0.  This is reasonable, since it is desirable to downweight the influence of Bonds OBP in the estimation of, say, the average OBP.

Categories: Hierarchical modeling

Home run hitting — priors?

At our department’s colloquium last week, there was an interesting presentation on the use of an exchangeable prior in a regression context.  The analysis was done using WinBUGS using some standard choices of “weekly informative” priors.  That raises the natural question — how did you decide on this prior?

Let’s return to the home run hitting example where an exchangeable prior was placed on the set of home run hitting probabilities.  We assumed that $p_1, ..., p_k$ were iid $B(\eta, K)$ and then the hyperparameters $\eta$ and $K$ were assumed independent where $\eta$ was assigned the prior $\eta^{-1}(1-\eta)^{-1}$ and $K$ was assigned the log logistic form $g(K) = 1/(K+1)^2$.

When I described these priors, I just said that they were “noninformative” and didn’t explain how I obtained them.

The main questions are:

1.  Does the choice of vague prior matter?
2.  If the answer to question 1 is “yes”, what should we do?

When one fits an exchangeable model, one is shrinking the observed rates $y_i/n_i$ towards a common value.  The posterior of $\eta$ tells us about the common value and the posterior of $K$ tells us about the degree of shrinkage.  The choice of vague prior for $\eta$ is not a concern — we can assume $\eta$ is uniform or distributed according to the improper prior $\eta^{-1}(1-\eta)^{-1}$.   However, the prior on $K$ can be a concern.  If one assigns the improper form $g(K) = 1/K$, it can be shown that the posterior may be improper.  So in general one needs to assign a proper prior to $K$.  One convenient choice is to assume that $\log K$ has a logistic density with location $\mu$ and scale $\tau$.  In my example, I assumed that $\mu = 0, \tau = 1$.

Since I was arbitrary in my choice of parameters for this logistic prior on $\log K$, I tried different choices for $\mu$ and $\tau$.  I tried values of $\mu$ between -4 and 4 and values of $\tau$ between 0.2 and 10.  What did I find?  The posterior mode of $\log K$ stayed in the 4.56 – 4.61 range for all of the priors I considered.  In this example, we have data from 605 players and clearly the likelihood is driving the inference.

What if the choice of $\mu$ and $\tau$ does matter?  Then one has to think more carefully about these parameters.  In this baseball context, I am pretty familiar with home run hitting rates.  Based on this knowledge, I have to figure out a reasonable guess at the variability of the home run probabilities that would give me information about $K$.