2nd Edition
Analyzing Baseball Data with R, Second Edition
Analyzing Baseball Data with R Second Edition introduces R to sabermetricians, baseball enthusiasts, and students interested in exploring the richness of baseball data. It equips you with the necessary skills and software tools to perform all the analysis steps, from importing the data to transforming them into an appropriate format to visualizing the data via graphs to performing a statistical analysis.
The authors first present an overview of publicly available baseball datasets and a gentle introduction to the type of data structures and exploratory and data management capabilities of R. They also cover the ggplot2 graphics functions and employ a tidyverse-friendly workflow throughout. Much of the book illustrates the use of R through popular sabermetrics topics, including the Pythagorean formula, runs expectancy, catcher framing, career trajectories, simulation of games and seasons, patterns of streaky behavior of players, and launch angles and exit velocities. All the datasets and R code used in the text are available online.
New to the second edition are a systematic adoption of the tidyverse and incorporation of Statcast player tracking data (made available by Baseball Savant). All code from the first edition has been revised according to the principles of the tidyverse. Tidyverse packages, including dplyr, ggplot2, tidyr, purrr, and broom are emphasized throughout the book. Two entirely new chapters are made possible by the availability of Statcast data: one explores the notion of catcher framing ability, and the other uses launch angle and exit velocity to estimate the probability of a home run. Through the book’s various examples, you will learn about modern sabermetrics and how to conduct your own baseball analyses.
Max Marchi is a Baseball Analytics Analyst for the Cleveland Indians. He was a regular contributor to The Hardball Times and Baseball Prospectus websites and previously consulted for other MLB clubs.
Jim Albert is a Distinguished University Professor of statistics at Bowling Green State University. He has authored or coauthored several books including Curve Ball and Visualizing Baseball and was the editor of the Journal of Quantitative Analysis of Sports.
Ben Baumer is an assistant professor of statistical & data sciences at Smith College. Previously a statistical analyst for the New York Mets, he is a co-author of The Sabermetric Revolution and Modern Data Science with R.
- Introduction
- Introduction to R
- Graphics
- The Relation Between Runs and Wins
- Value of Plays Using Run Expectancy
- Balls and Strikes Effects
- Catcher Framing
- Career Trajectories
- Simulation
- Exploring Streaky Performances
- Using a Database to Compute Park Factors
- Batted Ball Data from Statcast
The Lahman Database: Season-by-Season Data
Bonds, Aaron, Ruth, and Rodriguez home run trajectories
Obtaining the database
The Master table
The Batting table
The Pitching table
The Fielding table
The Teams table
Baseball questions
Retrosheet Game-by-Game Data
The McGwire and Sosa home run race
Retrosheet
Game logs
Obtaining the game logs from Retrosheet
Game log example
Baseball questions
Retrosheet Play-by- Play Data
Event files
Event example
Baseball questions
Pitch-by-Pitch Data
MLBAM Gameday and PITCHf/x
PITCHf/x Example
Baseball questions
Player Movement and Off-the-Bat Data PLAYER
Statcast
Baseball Savant data
Baseball questions
Summary
Further Reading
Exercises
Introduction
Installing R and RStudio
The Tidyverse
dplyr
The pipe
ggplot
Other packages
Data Frames
Career of Warren Spahn
Introduction
Manipulations with data frames
Merging and selecting from data frames
Vectors
Defining and computing with vectors
Vector functions
Vector index and logical variables
Objects and Containers in R
Character data and data frames
Factors
Lists
Collection of R Commands
R scripts
R functions
Reading and Writing Data in R
Importing data from a file
Saving datasets
Packages
Splitting, Applying and Combining Data
Iterating using map()
Another example
Getting Help
Further Reading
Exercises
Introduction
Character Variable
A bar graph
Add axes labels and a title
Other graphs of a character variable
Saving Graphs
Numeric Variable: One-Dimensional Scatterplot and Histogram
Two Numeric Variables
Scatterplot
Building a graph, step-by-step
A Numeric Variable and a Factor Variable
Parallel stripcharts
Parallel boxplots
Comparing Ruth, Aaron, Bonds and A-Rod
Getting the data
Creating the player data frames
Constructing the graph
The Home Run Race
Getting the data
Extracting the variables
Constructing the graph
Further Reading
Exercises
Introduction
The Teams Table in the Lahman Databse
Linear Regression
The Pythagorean Formula for Winning Percentage
The Exponent in the Pythagorean model
Good and Bad Predictions by the Pythagorean model
How Many Runs for a Win?
Further Reading
Exercises
The Run Expectancy Matrix
Runs Scored in the Remainder of the Innings
Creating the Matrix
Measuring Success of a Batting Play
Jose Altuve
Opportunity and Success for all Hitters
Position in the Batting Lineup
Run Values of Different Base Hits
Value of a home run
Value of a single
Value of Base Stealing
Further Reading
Exercises
Introduction
Hitter’s Counts and Pitcher’s Counts
An example for a single pitcher
Pitch sequences from Retrosheet
Functions for string manipulation
Finding plate appearances going through a given count
Expected run value by count
The importance of the previous count
Behavior by Count
Swinging tendencies by count
Propensity to swing by location
Effect of the ball/strike count
Pitch selection by count
Umpires' behavior by count
Further Reading
Exercises
Introduction
Acquiring Pitch-Level Data
Where is the Strike Zone?
Modeling Called Strike Percentage
Visualizing the estimates
Visualizing the estimated surface
Controlling for handedness
Modeling Catcher Framing
Further Reading
Exercises
Introduction
Mickey Mantle’s Batting Trajectory
Comparing Trajectories
Some preliminary work
Computing career statistics
Computing similarity scores
Defining age, OBP, SLG, and OPS variables
Fitting and plotting trajectories
General Patterns of Peak Ages
Computing all affected trajectories
Patterns of peak age over time
Peak age and career at-bats
Trajectories and Fielding Position
Further Reading
Exercises
Introduction
Simulating a Half Inning
Markov chains
Review of work in run expectancy
Computing the transition probabilities
Simulating the Markov chain
Beyond run expectancy
Transition probabilities for individual teams
Simulating a Baseball Season
The Bradley-Terry model
Making up a schedule
Simulating talents and computing win probabilities
Simulating the regular season
Simulating the post-season
Function to simulate one season
Simulating many seasons
Further Reading
Exercises
Introduction
The Great Streak
Finding game hitting streaks
Moving batting averages
Streaks in Individual at-Bats
Streaks of hits and outs
Moving batting averages
Finding hitting slumps for all players
Were Ichiro Suzuki and Mike Trout unusually streaky?
Local Patterns of Statcast Launch Velocity
Further Reading
Exercises
Introduction
Installing MYSQL and Creating a Database
Connection R to MYSQL
Connecting using RMySQL
Connecting R to other SQL backends
Filling a MYSQL Game Log Database From
From Retrosheet to R
From R to MySQL
Querying Data From R
Introduction
Coors Field and run scoring
Building Your Own Baseball Database
Lahman's database
Retrosheet database
PITCHf/x database
Statcast database
Calculating Basic Park Factors
Loading the data into R
Home run park factor
Assumptions of the proposed approach
Applying park factors
Further Reading
Exercises
Introduction
Spray Charts
Acquiring a year's worth of Statcast data
Hitters' spray tendencies and in field defense
Launch Angles and Exit Velocities
Scatterplot of launch angle vs exit velocity
Modeling Home Run Probabilities
Generalized additive model
Smooth predictions
Using this model to estimate home run production
Are Launch Angles Skills?
Distribution of launch angle
Is launch angle a skill?
Further Reading
Exercises
Appendix A Retrosheet Files Reference
Appendix B Accessing and Using MLBAM Gameday and PITCHf/x Data
Appendix C Accessing and Using Statcast Data from Baseball-Savant
Biography
Max Marchi is a Baseball Analytics Analyst for the Cleveland Indians. He was a regular contributor to The Hardball Times and Baseball Prospectus websites and previously consulted for other MLB clubs.
Jim Albert is a Distinguished University Professor of statistics at Bowling Green State University. He has authored or coauthored several books including Curve Ball and Visualizing Baseball and was the editor of the Journal of Quantitative Analysis of Sports.
Ben Baumer is an assistant professor of statistical & data sciences at Smith College. Previously a statistical analyst for the New York Mets, he is a co-author of The Sabermetric Revolution and Modern Data Science with R.
"Overall, the book meets its main aim of teaching the reader to analyze real data using R. It is well suited to baseball fans, who have a solid statistical background, and want to learn R or modernize their style of R programming. Baseball fans with a more basic statistical education will also learn from this book . . ."
~Tim Downie, Journal of Statistical Software