2nd Edition

Analyzing Baseball Data with R, Second Edition

By Jim Albert, Benjamin S. Baumer Copyright 2019
    360 Pages
    by Chapman & Hall

    360 Pages
    by Chapman & Hall

    Analyzing Baseball Data with R Second Edition introduces R to sabermetricians, baseball enthusiasts, and students interested in exploring the richness of baseball data. It equips you with the necessary skills and software tools to perform all the analysis steps, from importing the data to transforming them into an appropriate format to visualizing the data via graphs to performing a statistical analysis.

    The authors first present an overview of publicly available baseball datasets and a gentle introduction to the type of data structures and exploratory and data management capabilities of R. They also cover the ggplot2 graphics functions and employ a tidyverse-friendly workflow throughout. Much of the book illustrates the use of R through popular sabermetrics topics, including the Pythagorean formula, runs expectancy, catcher framing, career trajectories, simulation of games and seasons, patterns of streaky behavior of players, and launch angles and exit velocities. All the datasets and R code used in the text are available online.

    New to the second edition are a systematic adoption of the tidyverse and incorporation of Statcast player tracking data (made available by Baseball Savant). All code from the first edition has been revised according to the principles of the tidyverse. Tidyverse packages, including dplyr, ggplot2, tidyr, purrr, and broom are emphasized throughout the book. Two entirely new chapters are made possible by the availability of Statcast data: one explores the notion of catcher framing ability, and the other uses launch angle and exit velocity to estimate the probability of a home run. Through the book’s various examples, you will learn about modern sabermetrics and how to conduct your own baseball analyses.

    Max Marchi is a Baseball Analytics Analyst for the Cleveland Indians. He was a regular contributor to The Hardball Times and Baseball Prospectus websites and previously consulted for other MLB clubs.

    Jim Albert is a Distinguished University Professor of statistics at Bowling Green State University. He has authored or coauthored several books including Curve Ball and Visualizing Baseball and was the editor of the Journal of Quantitative Analysis of Sports.

    Ben Baumer is an assistant professor of statistical & data sciences at Smith College. Previously a statistical analyst for the New York Mets, he is a co-author of The Sabermetric Revolution and Modern Data Science with R.

     

    1. Introduction

    2. The Lahman Database: Season-by-Season Data

      Bonds, Aaron, Ruth, and Rodriguez home run trajectories

      Obtaining the database

      The Master table

      The Batting table

      The Pitching table

      The Fielding table

      The Teams table

      Baseball questions


      Retrosheet Game-by-Game Data

      The McGwire and Sosa home run race

      Retrosheet

      Game logs

      Obtaining the game logs from Retrosheet

      Game log example

      Baseball questions


      Retrosheet Play-by- Play Data

      Event files

      Event example

      Baseball questions

      Pitch-by-Pitch Data

      MLBAM Gameday and PITCHf/x

      PITCHf/x Example

      Baseball questions

      Player Movement and Off-the-Bat Data PLAYER

      Statcast

      Baseball Savant data

      Baseball questions

      Summary

      Further Reading

      Exercises

    3. Introduction to R
    4. Introduction

      Installing R and RStudio

      The Tidyverse

      dplyr

      The pipe

      ggplot

      Other packages


      Data Frames

      Career of Warren Spahn

      Introduction

      Manipulations with data frames

      Merging and selecting from data frames


      Vectors

      Defining and computing with vectors

      Vector functions

      Vector index and logical variables


      Objects and Containers in R

      Character data and data frames

      Factors

      Lists


      Collection of R Commands

      R scripts

      R functions

      Reading and Writing Data in R

      Importing data from a file

      Saving datasets


      Packages

      Splitting, Applying and Combining Data

      Iterating using map()

      Another example

      Getting Help

      Further Reading

      Exercises

    5. Graphics
    6. Introduction

      Character Variable

      A bar graph

      Add axes labels and a title

      Other graphs of a character variable

      Saving Graphs

      Numeric Variable: One-Dimensional Scatterplot and Histogram

      Two Numeric Variables

      Scatterplot

      Building a graph, step-by-step

      A Numeric Variable and a Factor Variable

      Parallel stripcharts

      Parallel boxplots

      Comparing Ruth, Aaron, Bonds and A-Rod

      Getting the data

      Creating the player data frames

      Constructing the graph

      The Home Run Race

      Getting the data

      Extracting the variables

      Constructing the graph

      Further Reading

      Exercises

    7. The Relation Between Runs and Wins
    8. Introduction

      The Teams Table in the Lahman Databse

      Linear Regression

      The Pythagorean Formula for Winning Percentage

      The Exponent in the Pythagorean model

      Good and Bad Predictions by the Pythagorean model

      How Many Runs for a Win?

      Further Reading

      Exercises

    9. Value of Plays Using Run Expectancy
    10. The Run Expectancy Matrix

      Runs Scored in the Remainder of the Innings

      Creating the Matrix

      Measuring Success of a Batting Play

      Jose Altuve

      Opportunity and Success for all Hitters

      Position in the Batting Lineup

      Run Values of Different Base Hits

      Value of a home run

      Value of a single

      Value of Base Stealing

      Further Reading

      Exercises

    11. Balls and Strikes Effects
    12. Introduction

      Hitter’s Counts and Pitcher’s Counts

      An example for a single pitcher

      Pitch sequences from Retrosheet

      Functions for string manipulation

      Finding plate appearances going through a given count

      Expected run value by count

      The importance of the previous count

      Behavior by Count

      Swinging tendencies by count

      Propensity to swing by location

      Effect of the ball/strike count

      Pitch selection by count

      Umpires' behavior by count

      Further Reading

      Exercises

    13. Catcher Framing
    14. Introduction

      Acquiring Pitch-Level Data

      Where is the Strike Zone?

      Modeling Called Strike Percentage

      Visualizing the estimates

      Visualizing the estimated surface

      Controlling for handedness

      Modeling Catcher Framing

      Further Reading

      Exercises

    15. Career Trajectories
    16. Introduction

      Mickey Mantle’s Batting Trajectory

      Comparing Trajectories

      Some preliminary work

      Computing career statistics

      Computing similarity scores

      Defining age, OBP, SLG, and OPS variables

      Fitting and plotting trajectories

      General Patterns of Peak Ages

      Computing all affected trajectories

      Patterns of peak age over time

      Peak age and career at-bats

      Trajectories and Fielding Position

      Further Reading

      Exercises

    17. Simulation
    18. Introduction

      Simulating a Half Inning

      Markov chains

      Review of work in run expectancy

      Computing the transition probabilities

      Simulating the Markov chain

      Beyond run expectancy

      Transition probabilities for individual teams

      Simulating a Baseball Season

      The Bradley-Terry model

      Making up a schedule

      Simulating talents and computing win probabilities

      Simulating the regular season

      Simulating the post-season

      Function to simulate one season

      Simulating many seasons

      Further Reading

      Exercises

       

    19. Exploring Streaky Performances
    20. Introduction

      The Great Streak

      Finding game hitting streaks

      Moving batting averages

      Streaks in Individual at-Bats

      Streaks of hits and outs

      Moving batting averages

      Finding hitting slumps for all players

      Were Ichiro Suzuki and Mike Trout unusually streaky?

      Local Patterns of Statcast Launch Velocity

      Further Reading

      Exercises

    21. Using a Database to Compute Park Factors
    22. Introduction

      Installing MYSQL and Creating a Database

      Connection R to MYSQL

      Connecting using RMySQL

      Connecting R to other SQL backends

      Filling a MYSQL Game Log Database From

      From Retrosheet to R

      From R to MySQL

      Querying Data From R

      Introduction

      Coors Field and run scoring

      Building Your Own Baseball Database

      Lahman's database

      Retrosheet database

      PITCHf/x database

      Statcast database

      Calculating Basic Park Factors

      Loading the data into R

      Home run park factor

      Assumptions of the proposed approach

      Applying park factors

      Further Reading

      Exercises

    23. Batted Ball Data from Statcast

    Introduction

    Spray Charts

    Acquiring a year's worth of Statcast data

    Hitters' spray tendencies and in field defense

    Launch Angles and Exit Velocities

    Scatterplot of launch angle vs exit velocity

    Modeling Home Run Probabilities

    Generalized additive model

    Smooth predictions

    Using this model to estimate home run production

    Are Launch Angles Skills?

    Distribution of launch angle

    Is launch angle a skill?

    Further Reading

    Exercises

    Appendix A Retrosheet Files Reference

    Appendix B Accessing and Using MLBAM Gameday and PITCHf/x Data

    Appendix C Accessing and Using Statcast Data from Baseball-Savant

     

    Biography

    Max Marchi is a Baseball Analytics Analyst for the Cleveland Indians. He was a regular contributor to The Hardball Times and Baseball Prospectus websites and previously consulted for other MLB clubs.

    Jim Albert is a Distinguished University Professor of statistics at Bowling Green State University. He has authored or coauthored several books including Curve Ball and Visualizing Baseball and was the editor of the Journal of Quantitative Analysis of Sports.

    Ben Baumer is an assistant professor of statistical & data sciences at Smith College. Previously a statistical analyst for the New York Mets, he is a co-author of The Sabermetric Revolution and Modern Data Science with R.

    "Overall, the book meets its main aim of teaching the reader to analyze real data using R. It is well suited to baseball fans, who have a solid statistical background, and want to learn R or modernize their style of R programming. Baseball fans with a more basic statistical education will also learn from this book . . ."
    ~Tim Downie, Journal of Statistical Software