650 Pages
    by Chapman & Hall

    From a review of the first edition: "Modern Data Science with R… is rich with examples and is guided by a strong narrative voice. What’s more, it presents an organizing framework that makes a convincing argument that data science is a course distinct from applied statistics" (The American Statistician).

    Modern Data Science with R is a comprehensive data science textbook for undergraduates that incorporates statistical and computational thinking to solve real-world data problems. Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in the state-of-the-art R/RStudio computing environment can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling questions.

    The second edition is updated to reflect the growing influence of the tidyverse set of packages. All code in the book has been revised and styled to be more readable and easier to understand. New functionality from packages like sf, purrr, tidymodels, and tidytext is now integrated into the text. All chapters have been revised, and several have been split, re-organized, or re-imagined to meet the shifting landscape of best practice.

    Preface

    Background and motivation

    Intended audience

    Key features of this book

    Changes in the second edition

    Key role of technology

    How to use this book

    Acknowledgments

    I Part I: Introduction to Data Science

    1.  Prologue: Why data science?

    What is data science?

    Case study: The evolution of sabermetrics

    Datasets

    Further resources

    2.  Data visualization

    The federal election cycle

    Composing data graphics

    Importance of data graphics: Challenger

    Creating effective presentations

    The wider world of data visualization

    Further resources

    Exercises

    Supplementary exercises

    3.  A grammar for graphics

    A grammar for data graphics

    Canonical data graphics in R

    Extended example: Historical baby names

    Further resources

    Exercises

    Supplementary exercises

    4.  Data wrangling on one table

    A grammar for data wrangling

    Extended example: Ben’s time with the Mets

    Further resources

    Exercises

    Supplementary exercises

    5.  Data wrangling on multiple tables

    inner_join()

    left_join()

    Extended example: Manny Ramirez

    Further resources

    Exercises

    Supplementary exercises

    6.  Tidy data

    Tidy data

    Reshaping data

    Naming conventions

    Data intake

    Further resources

    Exercises

    Supplementary exercises

    7.  Iteration

    Vectorized operations

    Using across() with dplyr functions

    The map() family of functions

    Iterating over a one-dimensional vector

    Iteration over subgroups

    Simulation

    Extended example: Factors associated with BMI

    Further resources

    Exercises

    Supplementary exercises

    8.  Data Science Ethics

    Introduction

    Truthful falsehoods

    Role of data science in society

    Some settings for professional ethics

    Some principles to guide ethical action

    Algorithmic bias

    Data and disclosure

    Reproducibility

    Ethics, collectively

    Professional guidelines for ethical conduct

    Further resources

    Exercises

    Supplementary exercises

    II Part II: Statistics and Modeling

    9.  Statistical foundations

    Samples and populations

    Sample statistics

    The bootstrap

    Outliers

    Statistical models: Explaining variation

    Confounding and accounting for other factors

    The perils of p-values

    Further resources

    Exercises

    Supplementary exercises

    10. Predictive modeling

    Predictive modeling

    Simple classification models

    Evaluating models

    Extended example: Who has diabetes?

    Further resources

    Exercises

    Supplementary exercises

    11. Supervised learning

    Non-regression classifiers

    Parameter tuning

    Example: Evaluation of income models redux

    Extended example: Who has diabetes this time?

    Regularization

    Further resources

    Exercises

    Supplementary exercises

    12. Unsupervised learning

    Clustering

    Dimension reduction

    Further resources

    Exercises

    Supplementary exercises

    13. Simulation

    Reasoning in reverse

    Extended example: Grouping cancers

    Randomizing functions

    Simulating variability

    Random networks

    Key principles of simulation

    Further resources

    Exercises

    Supplementary exercises

    III Part III: Topics in Data Science

    14. Dynamic and customized data graphics

    Rich Web content using Djs and htmlwidgets

    Animation

    Flexdashboard

    Interactive Web apps with Shiny

    Customization of library(ggplot)ggplot graphics

    Extended example: Hot dog eating

    Further resources

    Exercises

    Supplementary exercises

    15. Database querying using SQL

    From dplyr to SQL

    Flat-file databases

    The SQL universe

    The SQL data manipulation language

    Extended example: FiveThirtyEight flights

    SQL vs R

    Further resources

    Exercises

    Supplementary exercises

    16. Database administration

    Constructing efficient SQL databases

    Changing SQL data

    Extended example: Building a database

    Scalability

    Further resources

    Exercises

    Supplementary exercises

    17. Working with geospatial data

    Motivation: What’s so great about geospatial data?

    Spatial data structures

    Making maps

    Extended example: Congressional districts

    Effective maps: How (not) to lie

    Projecting polygons

    Playing well with others

    Further resources

    Exercises

    Supplementary exercises

    18. Geospatial computations

    Geospatial operations

    Geospatial aggregation

    Geospatial joins

    Extended example: Trail elevations at MacLeish

    Further resources

    Exercises

    Supplementary exercises

    19. Text as data

    Regular expressions using Macbeth

    Extended example: Analyzing textual data from arXivorg

    Ingesting text

    Further resources

    Exercises

    Supplementary exercises

    20. Network science

    Introduction to network science

    Extended example: Six degrees of Kristen Stewart

    PageRank

    Extended example: men’s college basketball

    Further resources

    Exercises

    Supplementary exercises

    21. Epilogue: Towards "big data"

    Notions of big data

    Tools for bigger data

    Alternatives to R

    Closing thoughts

    Further resources

    IV Part IV: Appendices

    A Packages used in this book

    The mdsr package

    Other packages

    Further resources

    B Introduction to R and RStudio

    Installation

    Learning R

    Fundamental structures and objects

    Add-ons: Packages

    Further resources

    Exercises

    Supplementary exercises

    C Algorithmic thinking

    Introduction

    Simple example

    Extended example: Law of large numbers

    Non-standard evaluation

    Debugging and defensive coding

    Further resources

    Exercises

    Supplementary exercises

    D Reproducible analysis and workflow

    Scriptable statistical computing

    Reproducible analysis with R Markdown

    Projects and version control

    Further resources

    Exercises

    Supplementary exercises

    E Regression modeling

    Multiple regression

    Inference for regression

    Assumptions underlying regression

    Logistic regression

    Further resources

    Exercises

    Supplementary exercises

    F Setting up a database server

    SQLite

    MySQL

    PostgreSQL

    Connecting to SQL

    Biography

    Benjamin S. Baumer is an associate professor in the Statistical & Data Sciences program at Smith College. He has been a practicing data scientist since 2004, when he became the first full-time statistical analyst for the New York Mets. Ben is a co-author of The Sabermetric Revolution and Analyzing Baseball Data with R. He received the 2019 Waller Education Award and the 2016 Significant Contributor Award from the Society for American Baseball Research.

    Daniel T. Kaplan is the DeWitt Wallace emeritus professor of mathematics and computer science at Macalester College. He is the author of several textbooks on statistical modeling and statistical computing. Danny received the 2006 Macalester Excellence in Teaching award and the 2017 CAUSE Lifetime Achievement Award.

    Nicholas J. Horton is Beitzel Professor of Technology and Society (Statistics and Data Science) at Amherst College. He is a Fellow of the ASA and the AAAS, co-chair of the National Academies Committee on Applied and Theoretical Statistics, recipient of a number of national teaching awards, author of a series of books on statistical computing, and actively involved in data science curriculum efforts to help students "think with data".

    "[...] To answer a wide range of modern research questions, this book by Baumer, Kaplan, and Horton features an excellent introduction to data wrangling, visualization, statistical modeling, machine learning, and other advanced statistical applications through the RStudio environment following the tidyverse syntax. [...] Overall, Modern Data Science with R, 2nd edition serves as an excellent introductory resource to help develop techniques to extract, transform, visualize, and learn from datasets through the R environment. It focuses on implementing those techniques in R and does not provide a theoretical background for the discussed methods. The book will be a perfect reference for a broad audience ranging from undergraduates in data science courses to advanced graduate students and professionals from a variety of research fields."
    -Kohma Arai and Vyacheslav Lyubchich, in Technometrics, July 2022

    "Overall, I enjoyed reading this book. The authors were very good at creating a complete tool for studying data science. Therefore, I recommend this book, for its content, writing, and organization, to graduate students in data science and statistics. I also recommend the book to professionals who should prepare themselves for the challenges they are going to face in the future with the voluminous and heterogenous amount of data that should be timely analyzed to extract meaningful information to guide action."
    -Georgios Nikolopoulos, in ISCB News, June 2022

    "The authors have successfully completed the job of choosing the content with relevant topics and, deciding the extent of knowledge to be delivered, and finally, putting them in an understandable sequence. This is a well-written book and does not cover much theory. .. The book’s second edition contents are updated, expanded, revised, split, rewritten and rearranged compared to the first edition. The key changes are the use of recently developed R packages, .... (and) updated exercises in the chapters ..."
    -Shalabh,in Journal of the Royal Statistical Society Series A, August 2021

    "[This book] provides an excellent basis for statisticians who want to dig deeper into, for example, data handling, for computer scientists who aim to strengthen their knowledge of statistical methods as well as for all other researchers who are interested in data science in general. ... Each section is structured as an interplay between R-code and explanatory text for understanding. The division into several stand-alone segments is an advantage, because the reader may easily choose the section she or he is interested in without missing relevant information. A key feature of the book is its focus on different example data sets that are available via R-packages or from URLs that are embedded in the text. These data sets are used to illustrate the methodology presented using R-code. Their availability allows the reader to reproduce the code while working with the book. ... It can be warmly recommended to practical researchers who seek a comprehensive overview of different topics in data science with focus on implementations in R."
    -Annika Hoyer, in Biometrical Journal, August 2021

    "This text continues to be fantastic! There are a number of courses for which I would require this book and others that I would recommend it as a supplement. I would likely require it for courses focused on computing in R or courses in data science. I would include it as a recommended text in introductory and other statistics courses that used R as the software of choice, where this text could be used as a supplemental resource in how to use R to work with data."
    -Hunter Glanz, Cal Poly San Luis Obispo

    "Easy for students to read and relate to the exercises and examples. Many questions and hands-on activities with data sets to practice skills."
    -Lynn Collen, St. Cloud Stat University

    "I used the first edition of this book as the primary text for an intermediate data science course a few years ago and I liked it very much…I think that the technical breadth, writing style, and level of difficulty are very clear strengths. Also, my students and I found the `tidyverse` approach to be particularly well-suited for teaching and learning R…and I love that the MDSR book includes such complete code. Students can program everything they see in the book, and often times there are tips & tricks for them to discover along the way just by studying expert code provided by the authors. This really sets MDSR apart from other books I considered for the course."
    -Matthew Beckman, Penn State University

    "The authors have covered almost all aspects of data science, a revolutionary field that marries elements of computational thinking and traditional statistical theory. The book can thus equip the readers with the necessary knowledge and skills to extract data from a variety of sources, restructure observations in a form that allows analysis, store data in efficient databases, and work effectively on massive and complex data sets in order to produce actionable information."
    - Georgios Nikolopoulos, University of Cyprus, ISCB Book Reviews, June 2022.