Home > MCMC > Robust Modeling using Gibbs Sampling

Robust Modeling using Gibbs Sampling

To illustrate the use of Gibbs sampling for robust modeling, here are the batting statistics for the “regular” players on the San Francisco Giants for the 2007 baseball season:

`Pos Player              Ag   G   AB    R    H   2B 3B  HR  RBI  BB  SO   BA    OBP   SLG  SB  CS  GDP HBP  SH  SF IBB  OPS+---+-------------------+--+----+----+----+----+---+--+---+----+---+----+-----+-----+-----+---+---+---+---+---+---+---+----+C   Bengie Molina       32  134  497   38  137  19  1  19   81  15   53  .276  .298  .433   0   0  13   2   1   2   2   861B *Ryan Klesko         36  116  362   51   94  27  3   6   44  46   68  .260  .344  .401   5   1  14   1   1   1   2   922B #Ray Durham          35  138  464   56  101  21  2  11   71  53   75  .218  .295  .343  10   2  18   2   0   9   6   653B  Pedro Feliz         32  150  557   61  141  28  2  20   72  29   70  .253  .290  .418   2   2  15   1   0   3   2   81SS #Omar Vizquel        40  145  513   54  126  18  3   4   51  44   48  .246  .305  .316  14   6  14   1  14   3   6   62LF *Barry Bonds         42  126  340   75   94  14  0  28   66 132   54  .276  .480  .565   5   0  13   3   0   2  43  170CF *Dave Roberts        35  114  396   61  103  17  9   2   23  42   66  .260  .331  .364  31   5   4   0   4   0   1   80RF #Randy Winn          33  155  593   73  178  42  1  14   65  44   85  .300  .353  .445  15   3  12   7   4   5   3  105    Rich Aurilia        35   99  329   40   83  19  2   5   33  22   45  .252  .304  .368   0   0   8   4   0   3   1   73    Kevin Frandsen      25  109  264   26   71  12  1   5   31  21   24  .269  .331  .379   4   3  17   5   3   3   3   84   *Fred Lewis          26   58  157   34   45   6  2   3   19  19   32  .287  .374  .408   5   1   4   3   1   0   0  103   #Dan Ortmeier        26   62  157   20   45   7  4   6   16   7   41  .287  .317  .497   2   1   2   1   0   2   1  107    Rajai Davis         26   51  142   26   40   9  1   1    7  14   25  .282  .363  .380  17   4   0   4   2   0   1   93   *Nate Schierholtz    23   39  112    9   34   5  3   0   10   2   19  .304  .316  .402   3   1   0   1   0   2   0   85   *Mark Sweeney        37   76   90   18   23   8  0   2   10  13   18  .256  .368  .411   2   0   0   3   1   0   0  102`

We’ll focus on the last batting measure OPS that is a summary of a player’s batting effectiveness.

We read the OPS values for the 15 players into R into the vector y.

> y
[1] 86 92 65 81 62 170 80 105 73 84 103 107 93 85 102

We assume that the observations y1, …, y15 are iid from a t distribution with location , scale  and 4 degrees of freedom. We place the usual noninformative prior  on .

We implement 10,000 iterations of Gibbs sampling by use of the function robustt.R in the LearnBayes package. This function is easy to use — we just input the data vector y, the degrees of freedom, and the number of iterations.

fit=robustt(y,4,10000)

The object fit is a list with components mu, sigma2, and lam — mu is a vector of simulated draws of , sigma2 is a vector of simulated draws of , and lam is a matrix of simulated draws of the scale parameters . (The  are helpful for identifying outliers in the data — here the outlier is pretty obvious.)

Below I have graphed the data as solid dots and placed a density estimate of the posterior of  on top. We see that the estimate of  appears to ignore the one outlier (Barry Bonds) in the data.

plot(y,0*y,cex=1.5,pch=19,ylim=c(0,.1),ylab=”DENSITY”,xlab=”MU”)
lines(density(fit\$mu),lwd=3,col=”red”)

By the way, we have assumed that robust modeling was suitable for this dataset since I knew that the Giants had one unusually good hitter on their team. Can we formally show that robust modeling the t(4) distribution is better than normal modeling for these data?

Sure — we can define two models (with the choice of proper prior distributions) and compare the models by use of a Bayes factor. I’ll illustrate this in my next posting.