Analyzing Baseball Data with R, Second Edition: 2nd Edition (e-Book) book cover

Analyzing Baseball Data with R, Second Edition

2nd Edition

By Max Marchi, Jim Albert, Benjamin S. Baumer

Chapman and Hall/CRC

342 pages

Purchasing Options:$ = USD
Paperback: 9780815353515
pub: 2018-12-03
SAVE ~$12.39
$61.95
$49.56
x
Hardback: 9780367024864
pub: 2018-11-22
SAVE ~$25.99
$129.95
$103.96
x
eBook (VitalSource) : 9781351107099
pub: 2018-11-19
from $30.98


FREE Standard Shipping!

Description

Analyzing Baseball Data with R Second Edition introduces R to sabermetricians, baseball enthusiasts, and students interested in exploring the richness of baseball data. It equips you with the necessary skills and software tools to perform all the analysis steps, from importing the data to transforming them into an appropriate format to visualizing the data via graphs to performing a statistical analysis.

The authors first present an overview of publicly available baseball datasets and a gentle introduction to the type of data structures and exploratory and data management capabilities of R. They also cover the ggplot2 graphics functions and employ a tidyverse-friendly workflow throughout. Much of the book illustrates the use of R through popular sabermetrics topics, including the Pythagorean formula, runs expectancy, catcher framing, career trajectories, simulation of games and seasons, patterns of streaky behavior of players, and launch angles and exit velocities. All the datasets and R code used in the text are available online.

New to the second edition are a systematic adoption of the tidyverse and incorporation of Statcast player tracking data (made available by Baseball Savant). All code from the first edition has been revised according to the principles of the tidyverse. Tidyverse packages, including dplyr, ggplot2, tidyr, purrr, and broom are emphasized throughout the book. Two entirely new chapters are made possible by the availability of Statcast data: one explores the notion of catcher framing ability, and the other uses launch angle and exit velocity to estimate the probability of a home run. Through the book’s various examples, you will learn about modern sabermetrics and how to conduct your own baseball analyses.

Max Marchi is a Baseball Analytics Analyst for the Cleveland Indians. He was a regular contributor to The Hardball Times and Baseball Prospectus websites and previously consulted for other MLB clubs.

Jim Albert is a Distinguished University Professor of statistics at Bowling Green State University. He has authored or coauthored several books including Curve Ball and Visualizing Baseball and was the editor of the Journal of Quantitative Analysis of Sports.

Ben Baumer is an assistant professor of statistical & data sciences at Smith College. Previously a statistical analyst for the New York Mets, he is a co-author of The Sabermetric Revolution and Modern Data Science with R.

 

Table of Contents

  1. Introduction
  2. The Lahman Database: Season-by-Season Data

    Bonds, Aaron, Ruth, and Rodriguez home run trajectories

    Obtaining the database

    The Master table

    The Batting table

    The Pitching table

    The Fielding table

    The Teams table

    Baseball questions

    Retrosheet Game-by-Game Data

    The McGwire and Sosa home run race

    Retrosheet

    Game logs

    Obtaining the game logs from Retrosheet

    Game log example

    Baseball questions

    Retrosheet Play-by- Play Data

    Event files

    Event example

    Baseball questions

    Pitch-by-Pitch Data

    MLBAM Gameday and PITCHf/x

    PITCHf/x Example

    Baseball questions

    Player Movement and Off-the-Bat Data PLAYER

    Statcast

    Baseball Savant data

    Baseball questions

    Summary

    Further Reading

    Exercises

  3. Introduction to R
  4. Introduction

    Installing R and RStudio

    The Tidyverse

    dplyr

    The pipe

    ggplot

    Other packages

    Data Frames

    Career of Warren Spahn

    Introduction

    Manipulations with data frames

    Merging and selecting from data frames

    Vectors

    Defining and computing with vectors

    Vector functions

    Vector index and logical variables

    Objects and Containers in R

    Character data and data frames

    Factors

    Lists

    Collection of R Commands

    R scripts

    R functions

    Reading and Writing Data in R

    Importing data from a file

    Saving datasets

    Packages

    Splitting, Applying and Combining Data

    Iterating using map()

    Another example

    Getting Help

    Further Reading

    Exercises

  5. Graphics
  6. Introduction

    Character Variable

    A bar graph

    Add axes labels and a title

    Other graphs of a character variable

    Saving Graphs

    Numeric Variable: One-Dimensional Scatterplot and Histogram

    Two Numeric Variables

    Scatterplot

    Building a graph, step-by-step

    A Numeric Variable and a Factor Variable

    Parallel stripcharts

    Parallel boxplots

    Comparing Ruth, Aaron, Bonds and A-Rod

    Getting the data

    Creating the player data frames

    Constructing the graph

    The Home Run Race

    Getting the data

    Extracting the variables

    Constructing the graph

    Further Reading

    Exercises

  7. The Relation Between Runs and Wins
  8. Introduction

    The Teams Table in the Lahman Databse

    Linear Regression

    The Pythagorean Formula for Winning Percentage

    The Exponent in the Pythagorean model

    Good and Bad Predictions by the Pythagorean model

    How Many Runs for a Win?

    Further Reading

    Exercises

  9. Value of Plays Using Run Expectancy
  10. The Run Expectancy Matrix

    Runs Scored in the Remainder of the Innings

    Creating the Matrix

    Measuring Success of a Batting Play

    Jose Altuve

    Opportunity and Success for all Hitters

    Position in the Batting Lineup

    Run Values of Different Base Hits

    Value of a home run

    Value of a single

    Value of Base Stealing

    Further Reading

    Exercises

  11. Balls and Strikes Effects
  12. Introduction

    Hitter’s Counts and Pitcher’s Counts

    An example for a single pitcher

    Pitch sequences from Retrosheet

    Functions for string manipulation

    Finding plate appearances going through a given count

    Expected run value by count

    The importance of the previous count

    Behavior by Count

    Swinging tendencies by count

    Propensity to swing by location

    Effect of the ball/strike count

    Pitch selection by count

    Umpires' behavior by count

    Further Reading

    Exercises

  13. Catcher Framing
  14. Introduction

    Acquiring Pitch-Level Data

    Where is the Strike Zone?

    Modeling Called Strike Percentage

    Visualizing the estimates

    Visualizing the estimated surface

    Controlling for handedness

    Modeling Catcher Framing

    Further Reading

    Exercises

  15. Career Trajectories
  16. Introduction

    Mickey Mantle’s Batting Trajectory

    Comparing Trajectories

    Some preliminary work

    Computing career statistics

    Computing similarity scores

    Defining age, OBP, SLG, and OPS variables

    Fitting and plotting trajectories

    General Patterns of Peak Ages

    Computing all affected trajectories

    Patterns of peak age over time

    Peak age and career at-bats

    Trajectories and Fielding Position

    Further Reading

    Exercises

  17. Simulation
  18. Introduction

    Simulating a Half Inning

    Markov chains

    Review of work in run expectancy

    Computing the transition probabilities

    Simulating the Markov chain

    Beyond run expectancy

    Transition probabilities for individual teams

    Simulating a Baseball Season

    The Bradley-Terry model

    Making up a schedule

    Simulating talents and computing win probabilities

    Simulating the regular season

    Simulating the post-season

    Function to simulate one season

    Simulating many seasons

    Further Reading

    Exercises

     

  19. Exploring Streaky Performances
  20. Introduction

    The Great Streak

    Finding game hitting streaks

    Moving batting averages

    Streaks in Individual at-Bats

    Streaks of hits and outs

    Moving batting averages

    Finding hitting slumps for all players

    Were Ichiro Suzuki and Mike Trout unusually streaky?

    Local Patterns of Statcast Launch Velocity

    Further Reading

    Exercises

  21. Using a Database to Compute Park Factors
  22. Introduction

    Installing MYSQL and Creating a Database

    Connection R to MYSQL

    Connecting using RMySQL

    Connecting R to other SQL backends

    Filling a MYSQL Game Log Database From

    From Retrosheet to R

    From R to MySQL

    Querying Data From R

    Introduction

    Coors Field and run scoring

    Building Your Own Baseball Database

    Lahman's database

    Retrosheet database

    PITCHf/x database

    Statcast database

    Calculating Basic Park Factors

    Loading the data into R

    Home run park factor

    Assumptions of the proposed approach

    Applying park factors

    Further Reading

    Exercises

  23. Batted Ball Data from Statcast

Introduction

Spray Charts

Acquiring a year's worth of Statcast data

Hitters' spray tendencies and in field defense

Launch Angles and Exit Velocities

Scatterplot of launch angle vs exit velocity

Modeling Home Run Probabilities

Generalized additive model

Smooth predictions

Using this model to estimate home run production

Are Launch Angles Skills?

Distribution of launch angle

Is launch angle a skill?

Further Reading

Exercises

Appendix A Retrosheet Files Reference

Appendix B Accessing and Using MLBAM Gameday and PITCHf/x Data

Appendix C Accessing and Using Statcast Data from Baseball-Savant

 

About the Authors

Max Marchi is a Baseball Analytics Analyst for the Cleveland Indians. He was a regular contributor to The Hardball Times and Baseball Prospectus websites and previously consulted for other MLB clubs.

Jim Albert is a Distinguished University Professor of statistics at Bowling Green State University. He has authored or coauthored several books including Curve Ball and Visualizing Baseball and was the editor of the Journal of Quantitative Analysis of Sports.

Ben Baumer is an assistant professor of statistical & data sciences at Smith College. Previously a statistical analyst for the New York Mets, he is a co-author of The Sabermetric Revolution and Modern Data Science with R.

About the Series

Chapman & Hall/CRC The R Series

Learn more…

Subject Categories

BISAC Subject Codes/Headings:
MAT029000
MATHEMATICS / Probability & Statistics / General