2nd Edition

Modern Data Science with R



  • Available for pre-order. Item will ship after March 12, 2021
ISBN 9780367191498
March 12, 2021 Forthcoming by Chapman and Hall/CRC
673 Pages

USD $99.95

Prices & shipping based on shipping country


Preview

Book Description

From a review of the first edition: "Modern Data Science with R… is rich with examples and is guided by a strong narrative voice. What’s more, it presents an organizing framework that makes a convincing argument that data science is a course distinct from applied statistics" (The American Statistician).

Modern Data Science with R is a comprehensive data science textbook for undergraduates that incorporates statistical and computational thinking to solve real-world data problems. Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in the state-of-the-art R/RStudio computing environment can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling questions.

The second edition is updated to reflect the growing influence of the tidyverse set of packages. All code in the book has been revised and styled to be more readable and easier to understand. New functionality from packages like sf, purrr, tidymodels, and tidytext is now integrated into the text. All chapters have been revised, and several have been split, re-organized, or re-imagined to meet the shifting landscape of best practice.

Table of Contents

Preface

Background and motivation

Intended audience

Key features of this book

Changes in the second edition

Key role of technology

How to use this book

Acknowledgments

I Part I: Introduction to Data Science

1.  Prologue: Why data science?

What is data science?

Case study: The evolution of sabermetrics

Datasets

Further resources

2.  Data visualization

The federal election cycle

Composing data graphics

Importance of data graphics: Challenger

Creating effective presentations

The wider world of data visualization

Further resources

Exercises

Supplementary exercises

3.  A grammar for graphics

A grammar for data graphics

Canonical data graphics in R

Extended example: Historical baby names

Further resources

Exercises

Supplementary exercises

4.  Data wrangling on one table

A grammar for data wrangling

Extended example: Ben’s time with the Mets

Further resources

Exercises

Supplementary exercises

5.  Data wrangling on multiple tables

inner_join()

left_join()

Extended example: Manny Ramirez

Further resources

Exercises

Supplementary exercises

6.  Tidy data

Tidy data

Reshaping data

Naming conventions

Data intake

Further resources

Exercises

Supplementary exercises

7.  Iteration

Vectorized operations

Using across() with dplyr functions

The map() family of functions

Iterating over a one-dimensional vector

Iteration over subgroups

Simulation

Extended example: Factors associated with BMI

Further resources

Exercises

Supplementary exercises

8.  Data Science Ethics

Introduction

Truthful falsehoods

Role of data science in society

Some settings for professional ethics

Some principles to guide ethical action

Algorithmic bias

Data and disclosure

Reproducibility

Ethics, collectively

Professional guidelines for ethical conduct

Further resources

Exercises

Supplementary exercises

II Part II: Statistics and Modeling

9.  Statistical foundations

Samples and populations

Sample statistics

The bootstrap

Outliers

Statistical models: Explaining variation

Confounding and accounting for other factors

The perils of p-values

Further resources

Exercises

Supplementary exercises

10. Predictive modeling

Predictive modeling

Simple classification models

Evaluating models

Extended example: Who has diabetes?

Further resources

Exercises

Supplementary exercises

11. Supervised learning

Non-regression classifiers

Parameter tuning

Example: Evaluation of income models redux

Extended example: Who has diabetes this time?

Regularization

Further resources

Exercises

Supplementary exercises

12. Unsupervised learning

Clustering

Dimension reduction

Further resources

Exercises

Supplementary exercises

13. Simulation

Reasoning in reverse

Extended example: Grouping cancers

Randomizing functions

Simulating variability

Random networks

Key principles of simulation

Further resources

Exercises

Supplementary exercises

III Part III: Topics in Data Science

14. Dynamic and customized data graphics

Rich Web content using Djs and htmlwidgets

Animation

Flexdashboard

Interactive Web apps with Shiny

Customization of library(ggplot)ggplot graphics

Extended example: Hot dog eating

Further resources

Exercises

Supplementary exercises

15. Database querying using SQL

From dplyr to SQL

Flat-file databases

The SQL universe

The SQL data manipulation language

Extended example: FiveThirtyEight flights

SQL vs R

Further resources

Exercises

Supplementary exercises

16. Database administration

Constructing efficient SQL databases

Changing SQL data

Extended example: Building a database

Scalability

Further resources

Exercises

Supplementary exercises

17. Working with geospatial data

Motivation: What’s so great about geospatial data?

Spatial data structures

Making maps

Extended example: Congressional districts

Effective maps: How (not) to lie

Projecting polygons

Playing well with others

Further resources

Exercises

Supplementary exercises

18. Geospatial computations

Geospatial operations

Geospatial aggregation

Geospatial joins

Extended example: Trail elevations at MacLeish

Further resources

Exercises

Supplementary exercises

19. Text as data

Regular expressions using Macbeth

Extended example: Analyzing textual data from arXivorg

Ingesting text

Further resources

Exercises

Supplementary exercises

20. Network science

Introduction to network science

Extended example: Six degrees of Kristen Stewart

PageRank

Extended example: men’s college basketball

Further resources

Exercises

Supplementary exercises

21. Epilogue: Towards "big data"

Notions of big data

Tools for bigger data

Alternatives to R

Closing thoughts

Further resources

IV Part IV: Appendices

A Packages used in this book

The mdsr package

Other packages

Further resources

B Introduction to R and RStudio

Installation

Learning R

Fundamental structures and objects

Add-ons: Packages

Further resources

Exercises

Supplementary exercises

C Algorithmic thinking

Introduction

Simple example

Extended example: Law of large numbers

Non-standard evaluation

Debugging and defensive coding

Further resources

Exercises

Supplementary exercises

D Reproducible analysis and workflow

Scriptable statistical computing

Reproducible analysis with R Markdown

Projects and version control

Further resources

Exercises

Supplementary exercises

E Regression modeling

Multiple regression

Inference for regression

Assumptions underlying regression

Logistic regression

Further resources

Exercises

Supplementary exercises

F Setting up a database server

SQLite

MySQL

PostgreSQL

Connecting to SQL

...
View More

Author(s)

Biography

Benjamin S. Baumer is an associate professor in the Statistical & Data Sciences program at Smith College. He has been a practicing data scientist since 2004, when he became the first full-time statistical analyst for the New York Mets. Ben is a co-author of The Sabermetric Revolution and Analyzing Baseball Data with R. He received the 2019 Waller Education Award and the 2016 Significant Contributor Award from the Society for American Baseball Research.

Daniel T. Kaplan is the DeWitt Wallace emeritus professor of mathematics and computer science at Macalester College. He is the author of several textbooks on statistical modeling and statistical computing. Danny received the 2006 Macalester Excellence in Teaching award and the 2017 CAUSE Lifetime Achievement Award.

Nicholas J. Horton is Beitzel Professor of Technology and Society (Statistics and Data Science) at Amherst College. He is a Fellow of the ASA and the AAAS, co-chair of the National Academies Committee on Applied and Theoretical Statistics, recipient of a number of national teaching awards, author of a series of books on statistical computing, and actively involved in data science curriculum efforts to help students "think with data".

 

 

Reviews

"This text continues to be fantastic! There are a number of courses for which I would require this book and others that I would recommend it as a supplement. I would likely require it for courses focused on computing in R or courses in data science. I would include it as a recommended text in introductory and other statistics courses that used R as the software of choice, where this text could be used as a supplemental resource in how to use R to work with data." (Hunter Glanz Cal Poly San Luis Obispo)

"Easy for students to read and relate to the exercises and examples. Many questions and hands-on activities with data sets to practice skills." (Lynn Collen, St. Cloud Stat Univ.)

"I used the first edition of this book as the primary text for an intermediate data science course a few years ago and I liked it very much…I think that the technical breadth, writing style, and level of difficulty are very clear strengths. Also, my students and I found the `tidyverse` approach to be particularly well-suited for teaching and learning R…and I love that the MDSR book includes such complete code. Students can program everything they see in the book, and often times there are tips & tricks for them to discover along the way just by studying expert code provided by the authors. This really set MDSR apart from other books I considered for the course. This really set MDSR apart from other books I considered for the course." (Matthew Beckman, Penn State University)