Sunday, March 14, 2010

A glimpse at my current baseball research

I spent this weekend battling OpenOffice's slowness over coffee while compiling ball in play stats. My current project is to assemble run data for the use of projecting pitchers for 2010, and perhaps projecting fielders These are going to be based on a four year composite that weighs last year's run expectancy stats more heavily (to account for the present day parks and talent), and preceding seasons less so until 2006, which is weighed the least. I could go as far back as 2005 if the sample's still small enough that the averages shift. But I'm already seeing the run values settle in around firm numbers. Based on my composite RE data, groundouts are worth -0.088 run, flyballs are worth about 0.025 run and line drives are worth 0.324 run. I still need to verify my walk and strikeout numbers (0.30 and -0.28, respectively), as well as verify stolen bases and caught stealing.

Ultimately, even if someone were to copy this approach verbatim (always a risk especially among bitter enemies), I would be using different weights on year to year data than the incumbent sabermetric community probably would, and I would probably tweak my methodology and breakdown as I discovered new trends, compared to testing methodologies within the context of incumbent theories and methods as is the norm.

Plus I'm not doing this for any sort of sabermetric glory, or anything other than being able to project players and perhaps succeed at fantasy leagues. Anyone trying these ideas out in the context of wOBA, SIERA or tRA can certainly do so: Those numbers have their uses and their limits, and this approach can be productive within the context of those metrics.

I may look into doing ball in play run values by base/out situation, but that could take some time and I may not end up with enough of a reliable sample to give me firm data there. I ended up with about 40000 flyballs, over 20000 line drives and over 500000 groundballs for 2009. Broken up by base and out situation, would I have enough data to conclusively state a run value for, say, a line drive hit with two outs and men on the corners? You could go back ten years and get all the sample you need, but the farther back you go, the less contextually relevant the data you compile is. The run environments of 2002 aren't the run environments of today. Sure, you could park adjust, but any adjustment is based on incumbent assumptions, which belies the point of collecting a sample to determine accurate inherent values anyway.

This is just a sliver of what I've been working on with my baseball research. More will come as things flesh out.

No comments:

Post a Comment