Modern Data Science with R is a comprehensive data science textbook for undergraduates that incorporates statistical and computational thinking to solve real-world problems with data. Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in the state-of-the-art R/RStudio computing environment can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling statistical questions.
Contemporary data science requires a tight integration of knowledge from statistics, computer science, mathematics, and a domain of application. This book will help readers with some background in statistics and modest prior experience with coding develop and practice the appropriate skills to tackle complex data science projects. The book features a number of exercises and has a flexible organization conducive to teaching a variety of semester courses.
Table of Contents
This site includes additional resources:
Introduction to Data Science
Prologue: Why data science?
A grammar for graphics
Tidy data and iteration
Statistics and Modeling
Statistical learning and predictive analytics
Topics in Data Science
Interactive data graphics
Database querying using SQL
Working with spatial data
Text as data
Epilogue: Towards \big data"
Packages used in this book
Introduction to R and RStudio
Reproducible analysis and workflow
Setting up a database server
Benjamin S. Baumer is an assistant professor in the Statistical & Data Sciences program at Smith College. He has been a practicing data scientist since 2004, when he became the first full-time statistical analyst for the New York Mets. Ben is a co-author of The Sabermetric Revolution and won the 2016 Contemporary Baseball Analysis Award from the Society for American Baseball Research.
Daniel T. Kaplan is the DeWitt Wallace professor of mathematics and computer science at Macalester College. He is the author of several textbooks on statistical modeling and statistical computing, and received the 2006 Macalester Excellence in Teaching award.
Nicholas J. Horton is a professor of statistics at Amherst College. He is a Fellow of the American Statistical Association (ASA), member of the NRC Committee on Applied and Theoretical Statistics, recipient of a number of national teaching awards, author of a series of books on statistical computing, and actively involved in curricular reform to help students "think with data."
"Modern Data Science with R is one of the first textbooks to provide a comprehensive introduction to data science for students at the undergraduate level (it is also suitable for graduate students and professionals in other fields). The authors follow the approach taken by Garrett Grolemund and Hadley Wickham in their book, R for Data Science, and David Robinson in Teach the Tidyverse to Beginners, which emphasizes the teaching of data visualization and the tidyverse (using dplyr and chained pipes) before covering base R, along with using real-world data and modern data science methods. The textbook includes end of chapter exercises (an instructor’s solution manual is available), and a series of lab activities is also under development. The result is an excellent textbook that provides a solid foundation in data science for students and professionals alike... Modern Data Science with R is a breakthrough textbook." ~ ACM SIGACT News
"Only about 60 of the book’s 551 pages address the questions of uncertainty and inference that constitute the core of the statistics tradition. The remaining pages attend the other components of working with data—the import, wrangling, tidying, visualization, and storage—that are often the more prominent barriers to understanding modern datasets...Modern Data Science with R is a landmark: the first full textbook in data science. (It can serve) as the backbone of a semester-long course targeted at students with little background in statistics or computing. It is rich with examples and is guided by a strong narrative voice. What’s more, it presents an organizing framework that makes a convincing argument that data science is a course distinct from applied statistics…By using the tidyverse, the textbook authors are able to seamlessly interweave a conceptual framework for data science with the corresponding implementation in R code….Even though this book is heavily dependent on R, readers come away with a more general natural language with which to talk and think about data. Indeed, if R were to cease to exist tomorrow, these readers would still be well-situated to be data scientists. In a nutshell, that approach is what makes this such a successful textbook." ~The American Statistician
"Baumer, Kaplan, and Horton have managed to write a book that will serve a huge variety of educators while being endlessly interesting and useful to students of a modern era. Modern Data Science in R is a compilation of ideas from both ends of the data science and statistics spectrum—tools for setting up databases and working with regular expressions are intermixed with fundamentals like regression analysis. Additionally, the authors pull together fantastic examples from the scientific community as well as the media at large. Their examples will engage today's students into understanding why data wrangling, reproducibility, and ethics are a fundamental part of any data analysis.
Good visualization skills (Tukey) and ethical analyses (Hoff, "How to Lie with Statistics") are not new ideas. However, they have recently been lost in the drive for more sophisticated mathematical and computational methods for working with data. Baumer et al. modernize the need for good visualization and communication in ways that will resonate with today's practitioners. Like Wickham's "ggplot2" and "The Elements of Statistical Learning" by Hastie et al., "Modern Data Science in R" promises to be a staple on every data analyst's bookshelf. Accessible to students and a valuable resource for those who have been in the field for many years, this book promises to be a treasure you will want to discover." ~ Jo Hardin, Pomona College
"This book would be an excellent text book for an introductory data science course. Many academic institutions are now trying to open data science programs. But, there is not a good text book available for data science courses." ~ Mahbubul Majumder, U. of Nebraska Omaha
"The book is unique. It is an encyclopedia of Data Science, and it covers a wide variety of modern topics; another positive aspect is that it contains lots of examples and code, and the layout is quite catchy. One can learn (and teach) subjects as diverse as: How to give talks, administrating databases, how to model spatial data, and even ethics---all in one book." ~ Miguel de Carvalho, The University of Edinburgh
"It would undoubtedly be useful to many postgraduate students of applied statistics. The handbook style will also be of use to statisticians who want to keep up to date in this area. In particular the book utilizes functions from many different R packages, and will be helpful for data analysts to keep their R skills up to date. Although one of the appendices covers an introduction to R (R Core Team 2017) and RStudio (RStudio Team 2017), realistically it is expected that the reader has some experience with R. Existing R users with no experience of RStudio might find the appendix useful, but RStudio is not required to work through this book. Overall the book is well written, well structured and the general writing style is both objective and entertaining . . . The book is divided into three major parts, Introduction to Data Science, Statistics and Modeling, and Topics in Data Science, followed by six appendices . . . In conclusion, I recommend this book as a course companion to a master’s level course in data analysis and to statisticians who want to keep their skills in the field of data science up to date." ~ Tim Downie, Journal of Statistical Software
"Modern Data Science with R is different . . .as it presents an abundance of R codes, functions and packages clearly with several useful examples. For people with a statistical background, the book covers computational topics like simulation and also includes appropriate computer science topics such as Data Wrangling, Database Querying using SQL and Text as Data. The book is well-structured and is presented in an easy-to-understand manner, making it suitable for a wide range of readers. . . This book is unique because it incorporates theoretical fundamentals such as statistical learning and regression modelling with the modern, practical elements of data science, including setting up databases and debugging . . . This book is a valuable resource to all those studying and interested in data science."
~Shuangzhe Liu, University of Canberra