Sunday, April 23, 2006

Everyday they write the book


Baseball Prospectus 2006, Goldman et al, Workman, $18.95


Two nights ago, stuck in a Barnes and Noble waiting out my daughter's community service (honor society, not criminal justice!), I caved and picked up a copy of BP2006. BP is must-read for anybody interested in the data-analysis side of baseball, from general managers to player agents to serious fans. The book and the website are always thought-provoking, usually entertaining, and often well written as well.

I just wish they weren't so damned arrogant. From the blurbs on the back cover trumpeting what they got right to the self-congratulatory throwing-around of "best" everywhere inside (which they are, but they don't have to keep telling me this), the book is indelibly marred by the constant self-promotion. Take this: "In this book, Baseball Prospectus presents the most advanced analytical view..." (p. 1). Or this: "PECOTA is already the best system of its kind" (p. 6). Both may be true. Both are throwaway sentences that should have been edited out, as they convey no information (other than that the author is inordinately proud) and waste paper and ink. BP is to baseball analysis what Ein Heldenleben was to symphony orchestra performances: bombast to the point of vulgarity critically marring elegant and deft work elsewhere.

The saving grace is the book is an ensemble production, and contains priceless essays like Gary Huckabay's "Where Does Statistical Analysis Fall Down? Reality and Perception" and "Iceberg Stories", maybe the best baseball-as-business analysis I've read in forever. There are thousands of clever , well-informed player comments. At 553 pages, there are enough nuggets to entertain for months.

Then there's figure 1, on page 510, the most pompous example of oversimplification imaginable. This figure presents a computer-science-like decision tree for a stolen base attempt, showing the three outcomes of the decision: no attempt, successful steal, and base stolen. Fine. Then Table 7 shows the breakeven percentage to 1/10th percent. My problem with this is the outcome tree for "attempt steal" is indistinguishable in the data from "try hit and run" (or "try run and hit"), and the number of outcomes is not three, it's dozens. The overwhelming majority of the time, an attempted steal involves a pitch delivered to a batter who may or may not swing. They ignore "catcher throws ball into center field", or "pitcher balks", or "batter fouls off pitch", or "batter lines into double play". A stolen base or a caught stealing, as data, is the precipitate in the bottom of the flask. The decision to run (or not) may be made by the manager, and is several events away from the actual recorded data column. These factors are difficult to divine, happen far more often than the percents place in the numbers would indicate, but ignoring them poisons the subsequent analysis fatally. Drawing the conclusion (as Keith Woolner does) that Tad Iguchi was the least opportune base stealer in MLB last year is like Sherlock Holmes solving cases from his monographs on tobacco ash -- fanciful, fun, and utter fiction. Printing three-digit breakeven numbers from this oversimplified decision tree leads to analysis paralysis and, rather than contributing to the knowledge base, just fuels the skepticism so elegantly detailed in the Huckabay essay's Woodwardian interview section. Drawing conclusions about player abilities from this noise is hubris worthy of Greek tragedy.

So buy the book, it's entertaining (and lacks the Kenny Williams character assassination this year), but don't take them at their words. It's hardly scholarly, as the peer review is a bunch of people interested in making money from the book, and the methods are usually held as proprietary data. Hold your nose through the self promotion, as hard as it is.

No comments: