1st Edition

Foundations of Statistics for Data Scientists With R and Python

By Alan Agresti, Maria Kateri Copyright 2022
    486 Pages 104 Color & 3 B/W Illustrations
    by Chapman & Hall

    486 Pages 104 Color & 3 B/W Illustrations
    by Chapman & Hall

    Foundations of Statistics for Data Scientists: With R and Python is designed as a textbook for a one- or two-term introduction to mathematical statistics for students training to become data scientists. It is an in-depth presentation of the topics in statistical science with which any data scientist should be familiar, including probability distributions, descriptive and inferential statistical methods, and linear modeling. The book assumes knowledge of basic calculus, so the presentation can focus on "why it works" as well as "how to do it." Compared to traditional "mathematical statistics" textbooks, however, the book has less emphasis on probability theory and more emphasis on using software to implement statistical methods and to conduct simulations to illustrate key concepts. All statistical analyses in the book use R software, with an appendix showing the same analyses with Python.

    Key Features:

    • Shows the elements of statistical science that are important for students who plan to become data scientists.
    • Includes Bayesian and regularized fitting of models (e.g., showing an example using the lasso), classification and clustering, and implementing methods with modern software (R and Python).
    • Contains nearly 500 exercises.

    The book also introduces modern topics that do not normally appear in mathematical statistics texts but are highly relevant for data scientists, such as Bayesian inference, generalized linear models for non-normal responses (e.g., logistic regression and Poisson loglinear models), and regularized model fitting. The nearly 500 exercises are grouped into "Data Analysis and Applications" and "Methods and Concepts." Appendices introduce R and Python and contain solutions for odd-numbered exercises. The book's website (http://stat4ds.rwth-aachen.de/) has expanded R, Python, and Matlab appendices and all data sets from the examples and exercises.

    1. Introduction to Statistical Science

    1.1 Statistical science: Description and inference

    Design, descriptive statistics, and inferential statistics

    Populations and samples

    Parameters: Numerical summaries of the population

    Defining populations: actual and conceptual

    1.2 Types of data and variables

    Data files

    Example: The General Social Survey (GSS)

    Variables

    Quantitative variables and categorical variables

    Discrete variables and continuous variables

    Associations: response variables and explanatory variables

    1.3 Data collection and randomization

    Randomization

    Collecting data with a sample survey

    Collecting data with an experiment

    Collecting data with an observational study

    Establishing cause and effect: observational versus experimental studies

    1.4 Descriptive statistics: Summarizing data

    Example: Carbon dioxide emissions in European nations

    Frequency distribution and histogram graphic

    Describing the center of the data: mean and median

    Describing data variability: standard deviation and variance

    Describing position: percentiles, quantiles, and box plots

    1.5 Descriptive statistics: Summarizing multivariate data

    Bivariate quantitative data: The scatterplot, correlation, and

    regression

    Bivariate categorical data: Contingency tables

    Descriptive statistics for samples and for populations

    1.6 Chapter summary

    Exercises

    2. Probability Distributions

    2.1 Introduction to probability

    Probabilities and long-run relative frequencies

    Sample spaces and events

    Probability axioms and implied probability rules

    Example: Diagnostics for disease screening

    Bayes' theorem

    Multiplicative law of probability, and independent events

    2.2 Random variables and probability distributions

    Probability distributions for discrete random variables

    Example: Geometric probability distribution

    Probability distributions for continuous random variables

    Example: Uniform distribution

    Probability functions (pdf, pmf) and cumulative distribution

    function (cdf)

    Example: Exponential random variable

    Families of probability distributions indexed by parameters

    2.3 Expectations of random variables

    Expected value and variability of a discrete random variable

    Expected values for continuous random variables

    Example: Mean and variability for uniform random variable

    Higher moments: Skewness

    Expectations of linear functions of random variables

    Standardizing a random variable

    2.4 Discrete probability distributions

    Binomial distribution

    Example: Hispanic composition of jury list

    Mean, variability, and skewness of binomial distribution

    Example: Predicting results of a sample survey

    The sample proportion as a scaled binomial random variable

    Poisson distribution

    Poisson variability and overdispersion

    2.5 Continuous probability distributions

    The normal distribution

    The standard normal distribution

    Examples: Finding normal probabilities and percentiles

    The gamma distribution

    The exponential distribution and Poisson processes

    Quantiles of a probability distribution

    Using the uniform to randomly generate a continuous random variable

    2.6 Joint and conditional distributions and independence

    Joint and marginal probability distributions

    Example: Joint and marginal distributions of happiness and family income

    Conditional probability distributions

    Trials with multiple categories: the multinomial distribution

    Expectations of sums of random variables

    Independence of random variables

    Markov chain dependence and conditional independence

    2.7 Correlation between random variables

    Covariance and correlation

    Example: Correlation between income and happiness

    Independence implies zero correlation, but not converse

    Bivariate normal distribution *

    2.8 Chapter summary

    Exercises

    3. Sampling Distributions

    3.1 Sampling distributions: Probability distributions for statistics

    Example: Predicting an election result from an exit poll

    Sampling distribution: Variability of a statistic's value among samples

    Constructing a sampling distribution

    Example: Simulating to estimate mean restaurant sales

    3.2 Sampling distributions of sample means

    Mean and variance of sample mean of random variables

    Standard error of a statistic

    Example: Standard error of sample mean sales

    Example: Standard error of sample proportion in exit poll

    Law of large numbers: Sample mean converges to population mean

    Normal, binomial, and Poisson sums of random variables have the same distribution

    3.3 Central limit theorem: Normal sampling distribution for large samples

    Sampling distribution of sample mean is approximately normal

    Simulations illustrate normal sampling distribution in CLT

    Summary: Population, sample data, and sampling distributions

    3.4 Large-sample normal sampling distributions for many statistics *

    Delta method

    Delta method applied to root Poisson stabilizes the variance

    Simulating sampling distributions of other statistics

    The key role of sampling distributions in statistical inference

    3.5 Chapter summary

    Exercises

    4. Statistical Inference: Estimation

    4.1 Point estimates and confidence intervals

    Properties of estimators: Unbiasedness, consistency, efficiency

    Evaluating properties of estimators

    Interval estimation: Confidence intervals for parameters

    4.2 The likelihood function and maximum likelhood estimation

    The likelihood function

    Maximum likelihood method of estimation

    Properties of maximum likelihood estimators

    Example: Variance of ML estimator of binomial parameter

    Example: Variance of ML estimator of Poisson mean

    Sufficiency and invariance for ML estimators

    4.3 Constructing confidence intervals

    Using a pivotal quantity to induce a confidence interval

    A large-sample confidence interval for the mean

    Confidence intervals for proportions

    Example: Atheists and agnostics in Europe

    Using simulation to illustrate long-run performance of CIs

    Determining the sample size before collecting the data

    Example: Sample size for evaluating an advertising strategy

    4.4 Confidence intervals for means of normal populations

    The $t$ distribution

    Confidence interval for a mean using the $t$ distribution

    Example: Estimating mean weight change for anorexic girls

    Robustness for violations of normal population assumption

    Construction of $t$ distribution using chi-squared and standard normal

    Why does the pivotal quantity have the $t$ distribution?

    Cauchy distribution: t distribution with df=1 has unusual behavior

    4.5 Comparing two population means or proportions

    A model for comparing means: Normality with common variability

    A standard error and confidence interval for comparing means

    Example: Comparing a therapy to a control group

    Confidence interval comparing two proportions

    Example: Does prayer help coronary surgery patients?

    4.6 The bootstrap

    Computational resampling and bootstrap confidence intervals

    Example: Confidence intervals for library data

    4.7 The Bayesian approach to statistical inference

    Bayesian prior and posterior distributions

    Bayesian binomial inference: Beta prior distributions

    Example: Belief in hell

    Interpretation: Bayesian versus classical intervals

    Bayesian posterior interval comparing proportions

    Highest posterior density (HPD) posterior intervals

    4.8 Bayeian inference for means

    Bayesian inference for a normal mean

    Example: Bayesian analysis for anorexia therapy

    Bayesian inference for normal means with improper priors

    Predicting a future observation: Bayesian predictive distribution

    The Bayesian perspective, and empirical Bayes and hierarchical Bayes extensions

    4.9 Why maximum likelihood and Bayes estimators perform well *

    ML estimators have large-sample normal distributions

    Asymptotic efficiency of ML estimators same as best unbiased estimators

    Bayesian estimators also have good large-sample performance

    The likelihood principle

    4.10 Chapter summary

    Exercises

    5. Statistical Inference: Significance Testing

    5.1 The elements of a significance test

    Example: Testing for bias in selecting managers

    Assumptions, hypotheses, test statistic, P-value and conclusion

    5.2 Significance tests for proportions and means

    The elements of a significance test for a proportion

    Example: Climate change a major threat?

    One-sided significance tests

    The elements of a significance test for a mean

    Example: Significance test about political ideology

    5.3 Significance tests comparing means

    Significance tests for the difference between two means

    Example: Comparing a therapy to a control group

    Effect size for comparison of two means

    Bayesian inference for comparing means

    Example: Bayesian comparison of therapy and control groups

    5.4 Significance tests comparing proportions

    Significance test for the difference between two proportions

    Example: Comparing prayer and non-prayer surgery patients

    Bayesian inference for comparing two proportions

    Chi-squared tests for multiple proportions in contingency table

    Example: Happiness and marital status

    Standardized residuals: Describing the nature of an association

    5.5 Significance test decisions and errors

    The alpha-level: Making a decision based on the P-value

    Never ``accept H_0'' in a significance test

    Type I and Type II errors

    As P(Type I error) decreases, P(Type II error) increases

    Example: Testing whether astrology has some truth

    The power of a test

    Making decisions versus reporting the P-value

    5.6 Duality between significance tests and confidence intervals

    Connection between two-sided tests and confidence intervals

    Effect of sample size: Statistical versus practical significance

    Significance tests are less useful than confidence intervals

    Significance tests and P-values can be misleading

    5.7 Likelihood-ratio tests and confidence intervals *

    The likelihood-ratio and a chi-squared test statistic

    Likelihood-ratio test and confidence interval for a proportion

    Likelihood-ratio, Wald, score test triad

    5.8 Nonparametric tests

    A permutation test to compare two groups

    Example: Petting versus praise of dogs

    Wilcoxon test: Comparing mean ranks for two groups

    Comparing survival time distributions with censored data

    5.9 Chapter summary

    Exercises

    6. Linear Models and Least Squares

    6.1 The linear regression model and its least squares fit

    The linear model describes a conditional expectation

    Describing variation around the conditional expectation

    Least squares model fitting

    Example: Linear model for Scottish hill races

    The correlation

    Regression toward the mean in linear regression models

    Linear models and reality

    6.2 Multiple regression: Linear models with multiple explanatory variables

    Interpreting effects in multiple regression models

    Example: Multiple regression for Scottish hill races

    Association and causation

    Confounding, spuriousness, and conditional independence

    Example: Modeling the crime rate in Florida

    Equations for least squares estimates in multiple regression

    Interaction between explanatory variables in their effects

    Cook's distance: Checking for unusual observations

    6.3 Summarizing variability in linear regression models

    The error variance and chi-squared for linear models

    Decomposing variability into model explained and unexplained parts

    R-squared and the multiple correlation

    Example: R-squared for modeling Scottish hill races

    6.4 Statistical inference for normal linear models

    The F distribution: Testing that all effects equal 0

    Example: Linear model for mental impairment

    t tests and confidence intervals for individual effects

    Multicollinearity: Nearly redundant explanatory variables

    Confidence interval for E(Y) and prediction interval for Y

    The F test that all effects equal 0 is a likelihood-ratio test *

    6.5 Categorical explanatory variables in linear models

    Indicator variables for categories

    Example: Comparing mean incomes of racial-ethnic groups

    Analysis of variance (ANOVA): An F test comparing several means

    Multiple comparisons of means: Bonferroni and Tukey methods

    Models with both categorical and quantitative explanatory variables Comparing two nested normal linear models

    Interaction with categorical and quantitative explanatory variables

    6.6 Bayesian inference for normal linear models

    Prior and posterior distributions for normal linear models

    Example: Bayesian linear model for mental impairment

    Bayesian approach to the normal one-way layout

    6.7 Matrix formulation of linear models

    The model matrix

    Least squares estimates and standard errors

    The hat matrix and the leverage

    Alternatives to least squares: Robust regression and regularization

    Restricted optimality of least squares: Gauss--Markov theorem

    Matrix formulation of Bayesian normal linear model

    6.8 Chapter summary

    Exercises

    7. Generalized Linear Models

    7.1 Introduction to generalized linear models

    The three components of a generalized linear model

    GLMs for normal, binomial, and Poisson responses

    Example: GLMs for house selling prices

    The deviance

    Likelihood-ratio model comparison uses deviance difference

    Model selection: AIC and the bias/variance tradeoff

    Advantages of GLMs versus transforming the data

    Example: Normal and gamma GLMs for Covid-19 data

    7.2 Logistic regression for binary data

    Logistic regression: Model expressions

    Interpreting beta_j: effects on probabilities and odds

    Example: Dose-response study for flour beetles

    Grouped and ungrouped binary data: Effects on estimates and deviance

    Example: Modeling Italian employment with logit and identity links Complete separation and infinite logistic parameter estimates

    7.3 Bayesian inference for generalized linear models

    Normal prior distributions for GLM parameters

    Example: Bayesian logistic regression for endometrial cancer patients7.4 Poisson loglinear models for count data

    Poisson loglinear models

    Example: Modeling horseshoe crab satellite counts

    Modeling rates: Including an offset in the model

    Example: Lung cancer survival

    7.5 Negative binomial models for overdispersed count data *

    Increased variance due to heterogeneity

    Negative binomial: Gamma mixture of Poisson distributions

    Example: Negative binomial modeling of horseshoe crab data

    7.6 Iterative GLM model fitting *

    The Newton--Raphson method

    Newton--Raphson fitting of logistic regression model

    Covariance matrix of parameter estimates and Fisher scoring

    Likelihood equations and covariance matrix for Poisson GLMs

    7.7 Regularization with large numbers of parameters

    Penalized likelihood methods

    Penalized likelihood methods: The lasso

    Example: Predicting opinions with student survey data

    Why shrink ML estimates toward 0?

    Dimension reduction: Principal component analysis

    Bayesian inference with a large number of parameters

    Huge n: Handling big data

    7.8 Chapter summary

    Exercises

    %

    8. Classification and Clustering

    8.1 Classification: Linear Discriminant Analysis and Graphical Trees

    Classification with Fisher's linear discriminant function

    Example: Predicting whether horseshoe crabs have satellites

    Summarizing predictive power: Classification tables and ROC curves

    Classification trees: Graphical prediction

    Logistic regression versus linear discriminant analysis and classification trees

    Other methods for classification: k-nearest neighbors and neural networks

    prediction

    8.2 Cluster Analysis

    Measuring dissimilarity between observations on binary responses

    Hierarchical clustering algorithm and its dendrogram

    Example: Clustering states on presidential election outcomes

    8.3 Chapter summary

    Exercises

    9. Statistical Science: A Historical Overview

    9.1 The evolution of statistical science

    Evolution of probability

    Evolution of descriptivev and inferential statistics

    9.2 Pillars of statistical wisdom and practice

    Stigler's seven pillars of statistical wisdom

    Seven pillars of wisdom for practicing data science

    Appendix A: Using R in Statistical Science

    Appendix B: Using Python in Statistical Science

    Appendix C: Brief Solutions to Odd-Numbered Exercises

    Bibliography

    Example

    Subject Index

    Biography

    Alan Agresti, Distinguished Professor Emeritus at the University of Florida, is the author of seven books, including Categorical Data Analysis (Wiley) and Statistics: The Art and Science of Learning from Data (Pearson), and has presented short courses in 35 countries. His awards include an honorary doctorate from De Montfort University (UK) and Statistician of the Year from the American Statistical Association (Chicago chapter).

    Maria Kateri, Professor of Statistics and Data Science at the RWTH Aachen University, authored the monograph Contingency Table Analysis: Methods and Implementation Using R (Birkhäuser/Springer) and a textbook on mathematics for economists (in German). She has long-term experience in teaching statistics courses to students of Data Science, Mathematics, Statistics, Computer Science, Business Administration, and Engineering.

    "[...]  Overall, I found the book to be a creative and refreshing take on the challenge of building foundations of “classical” statistics while helping introduce newer topics that are increasingly central to the statistical sciences. Important ideas of the past 50 years (see Gelman and Ahtari 2021) such as resampling, regularization, and hierarchical modeling are incorporated as optional sections (marked with an asterisk). The authors have captured much of the excitement of the statistical sciences and shared it in a way that I believe that students (and instructors) will share their enthusiasm. I look forward to teaching using this book."
    -Nicholas J. Horton in the Journal of the American Statistical Association, July 2022

    "If you find the other books (co-)authored by A. Agresti interesting, you will not be disappointed this time either. The book is a very good mixture of theory and practice. It presents the topics in statistical science that any data scientist should be familiar with. ... In general, the theory is provided in an easy to read and understand way. Mathematical details are limited to minimum. The emphasis is on the intuitive explanation of the statistical theory and its implementation in practice. And because of that the theory is broadly illustrated with examples based on the real data (which is an additional asset of the book). ... Another pro worth mentioning is the way how the book is organized. It is extremely easy to go back and find the content which is needed. Blueshaded areas with key messages, R codes presented in blue, summaries at the end of each sections – all of this makes this book very transparent and well organized. The book can be truly recommended to students who would like to start their journey as Data Scientists or young practitioners in this field. It can be also a great inspiration for lecturers."
    -Kinga Sałapa in ISCB Book Reviews, September 2022

    "The main goal of this textbook is to present foundational statistical methods and theory that are relevant in the field of data science. The authors depart from the typical approaches taken by many conventional mathematical statistics textbooks by placing more emphasis on providing the students with intuitive and practical interpretations of those methods with the aid of R programming codes. The book also takes slightly different organizations and presents a few topics that are not commonly found in conventional mathematical statistics textbooks. Notably, the book introduces both the frequentist approach and the Bayesian approach for each chapter on statistical inference in Chapters 4 – 6...I find its particular strength to be its intuitive presentation of statistical theory and methods without getting bogged down in mathematical details that are perhaps less useful to the practitioners."
    -Mintaek Lee, Boise State University

    "The statistical training for budding data scientists is different than the statistical training for budding statisticians, or other scientists. Data scientists require a different mix of theory and practice than statisticians, plus a great deal more exposure to computation than many other types of scientists. The aspects of this manuscript that I find appealing for the courses I teach: 1. The use of real data. 2. The use of R but with the option to use Python. 3. A good mix of theory and practice. 4. The text is well-written with good exercises. 5. The coverage of topics (e.g. Bayesian methods and clustering) that are not usually part of a course in statistics at the level of this book".
    -Jason M. Graham, University of Scranton

    "This book distinguishes itself with its focus on computational aspects of statistics (the appendices on R and Python and the examples throughout the text that use R). The ‘cost’ of this approach seems to be that much less attention is given to probability than in a standard text. There is a definite market for this approach – computational statistics/data science do not really require as much probability background as is usually given, while more focus on the way that things are actually done in practice (with software such as R or Python) is extremely beneficial to students that are looking to apply statistical methods. There is a wealth of problems in the book, and their variety (both computational and theoretical) is much appreciated. Also, the expansive appendices on R and Python wonderful, and will be of great help to students…Two major reasons that I would adopt the book are that its discussions seem to be slightly nontraditional in some cases (see above), yet still getting the salient points across. I also am happy about the examples throughout the text that use R–this is very useful for my students."
    -Christopher Gaffney, Drexel University

    "I will most likely adopt the proposed book for my class. The book seems to provide just about right level of mathematics—not too theoretical or like many other cookbooks which are available for R programming."
    -Tumulesh Solanky, University of New Orleans

    "The book is well-written and the examples are well-suited for building foundations for statistical science for data science as a discipline. The material covers most of the theoretical backgrounds in statistics. Throughout the book, the authors have used R programming to illustrate the concepts. In many cases, simulations were presented to support the theory. Each chapter has abundant practical exercises for the readers to explore the materials further. This textbook can serve as a textbook for a data science curriculum."
    -Steve Chung, Cal State University Fresno