Probability and Statistics for Data Science : Math + R + Data book cover
1st Edition

Probability and Statistics for Data Science
Math + R + Data

ISBN 9781138393295
Published June 20, 2019 by Chapman and Hall/CRC
444 Pages

SAVE $20.99
was $69.95
USD $48.96

Prices & shipping based on shipping country


Book Description

Probability and Statistics for Data Science: Math + R + Data covers "math stat"—distributions, expected value, estimation etc.—but takes the phrase "Data Science" in the title quite seriously:

* Real datasets are used extensively.

* All data analysis is supported by R coding.

* Includes many Data Science applications, such as PCA, mixture distributions, random graph models, Hidden Markov models, linear and logistic regression, and neural networks.

* Leads the student to think critically about the "how" and "why" of statistics, and to "see the big picture."

* Not "theorem/proof"-oriented, but concepts and models are stated in a mathematically precise manner.

Prerequisites are calculus, some matrix algebra, and some experience in programming.

Norman Matloff is a professor of computer science at the University of California, Davis, and was formerly a statistics professor there. He is on the editorial boards of the Journal of Statistical Software and The R Journal. His book Statistical Regression and Classification: From Linear Models to Machine Learning was the recipient of the Ziegel Award for the best book reviewed in Technometrics in 2017. He is a recipient of his university's Distinguished Teaching Award.

Table of Contents

  1. Basic Probability Models
  2. Example: Bus Ridership

    A \Notebook" View: the Notion of a Repeatable Experiment

    Theoretical Approaches

    A More Intuitive Approach

    Our Definitions

    "Mailing Tubes"

    Example: Bus Ridership Model (cont'd)

    Example: ALOHA Network

    ALOHA Network Model Summary

    ALOHA Network Computations

    ALOHA in the Notebook Context

    Example: A Simple Board Game

    Bayes' Rule

    General Principle

    Example: Document Classification

    Random Graph Models

    Example: Preferential Attachment Model

    Combinatorics-Based Probability Computation

    Which Is More Likely in Five Cards, One King or Two Hearts?

    Example: Random Groups of Students

    Example: Lottery Tickets

    Example: \Association Rules"

    Example: Gaps between Numbers

    Multinomial Coefficients

    Example: Probability of Getting Four Aces in a Bridge Hand

  3. Monte Carlo Simulation
  4. Example: Rolling Dice

    First Improvement

    Second Improvement

    Third Improvement

    Example: Dice Problem

    Use of runif() for Simulating Events

    Example: ALOHA Network (cont'd)

    Example: Bus Ridership (cont'd)

    Example: Board Game (cont'd)

    Example: Broken Rod

    How Long Should We Run the Simulation?

    Computational Complements

    More on the replicate() Function

  5. Discrete Random Variables: Expected Value
  6. Random Variables

    Discrete Random Variables

    Independent Random Variables

    Example: The Monty Hall Problem

    Expected Value

    Generality|Not Just for Discrete Random Variables


    Definition and Notebook View

    Properties of Expected Value

    Computational Formula

    Further Properties of Expected Value

    Finding Approximate Expected Values via Simulation

    Casinos, Insurance Companies and \Sum Users," Compared to Others

    Mathematical Complements

    Proof of Property E:

  7. Discrete Random Variables: Variance
  8. Variance


    Central Importance of the Concept of Variance

    Intuition Regarding the Size of Var(X)

    Chebychev's Inequality

    The Coefficient of Variation

    A Useful Fact


    Indicator Random Variables, and Their Means and Variances

    Example: Return Time for Library Books, Version I

    Example: Return Time for Library Books, Version II

    Example: Indicator Variables in a Committee Problem


    Mathematical Complements

    Proof of Chebychev's Inequality

  9. Discrete Parametric Distribution Families
  10. Distributions

    Example: Toss Coin Until First Head

    Example: Sum of Two Dice

    Example: Watts-Strogatz Random Graph Model

    The Model

    Parametric Families of Distributions

    The Case of Importance to Us: Parameteric Families of pmfs

    Distributions Based on Bernoulli Trials

    The Geometric Family of Distributions

    R Functions

    Example: a Parking Space Problem

    The Binomial Family of Distributions

    R Functions

    Example: Parking Space Model

    The Negative Binomial Family of Distributions

    R Functions

    Example: Backup Batteries

    Two Major Non-Bernoulli Models

    The Poisson Family of Distributions

    R Functions

    Example: Broken Rod

    Fitting the Poisson and Power Law Models to Data

    Example: the Bus Ridership Problem

    Example: Flipping Coins with Bonuses

    Example: Analysis of Social Networks

    Mathematical Complements

    Computational Complements

    Graphics and Visualization in R

  11. Introduction to Discrete Markov Chains
  12. Matrix Formulation

    Example: Die Game

    Long-Run State Probabilities

    Stationary Distribution

    Calculation of _

    Simulation Calculation of _

    Example: -Heads-in-a-Row Game

    Example: Bus Ridership Problem

    Hidden Markov Models

    Example: Bus Ridership


    Google PageRank

  13. Continuous Probability Models
  14. A Random Dart

    Individual Values Now Have Probability Zero

    But Now We Have a Problem

    Our Way Out of the Problem: Cumulative Distribution Functions


    Non-Discrete, Non-Continuous Distributions

    Density Functions

    Properties of Densities

    Intuitive Meaning of Densities

    Expected Values

    A First Example

    Famous Parametric Families of Continuous Distributions

    The Uniform Distributions

    Density and Properties

    R Functions

    Example: Modeling of Disk Performance

    Example: Modeling of Denial-of-Service Attack

    The Normal (Gaussian) Family of Continuous Distributions

    Density and Properties

    R Functions

    Importance in Modeling

    The Exponential Family of Distributions

    Density and Properties

    R Functions

    Example: Garage Parking Fees

    Memoryless Property of Exponential Distributions

    Importance in Modeling

    The Gamma Family of Distributions

    Density and Properties

    Example: Network Buffer

    Importance in Modeling

    The Beta Family of Distributions

    Density Etc

    Importance in Modeling

    Mathematical Complements

    Duality of the Exponential Family with the Poisson Family

    Computational Complements

    Inverse Method for Sampling from a Density

    Sampling from a Poisson Distribution

  15. Statistics: Prologue
  16. Importance of This Chapter

    Sampling Distributions

    Random Samples

    The Sample Mean | a Random Variable

    Toy Population Example

    Expected Value and Variance of X

    Toy Population Example Again


    Notebook View

    Simple Random Sample Case

    The Sample Variance|Another Random Variable

    Intuitive Estimation of _

    Easier Computation

    Special Case: X Is an Indicator Variable

    To Divide by n or n-?

    Statistical Bias

    The Concept of a \Standard Error"

    Example: Pima Diabetes Study

    Don't Forget: Sample = Population!

    Simulation Issues

    Sample Estimates

    Infinite Populations?

    Observational Studies

    The Bayesian Philosophy

    How Does It Work?

    Arguments for and Against

    Computational Complements

    R's split() and tapply() Functions

  17. Fitting Continuous Models
  18. Estimating a Density from Sample Data

    Example: BMI Data

    The Number of Bins

    The Bias-Variance Tradeo_

    The Bias-Variance Tradeo_ in the Histogram Case

    A General Issue: Choosing the Degree of


    Parameter Estimation

    Method of Moments

    Example: BMI Data

    The Method of Maximum Likelihood

    Example: Humidity Data

    MM vs MLE

    Advanced Methods for Density Estimation

    Assessment of Goodness of Fit

    Mathematical Complements

    Details of Kernel Density Estimators

    Computational Complements

    Generic Functions

    The gmm Package

    The gmm() Function

    Example: Bodyfat Data

  19. The Family of Normal Distributions
  20. Density and Properties

    Closure Under Affine Transformation

    Closure Under Independent Summation

    A Mystery

    R Functions

    The Standard Normal Distribution

    Evaluating Normal cdfs

    Example: Network Intrusion

    Example: Class Enrollment Size

    The Central Limit Theorem

    Example: Cumulative Roundo_ Error

    Example: Coin Tosses

    Example: Museum Demonstration

    A Bit of Insight into the Mystery

    X Is Approximately Normal|No Matter What the Population Distribution Is

    Approximate Distribution of (Centered and Scaled) X

    Improved Assessment of Accuracy of X

    Importance in Modeling

    The Chi-Squared Family of Distributions

    Density and Properties

    Example: Error in Pin Placement

    Importance in Modeling

    Relation to Gamma Family

    Mathematical Complements

    Convergence in Distribution, and the Precisely-Stated CLT

    Computational Complements

    Example: Generating Normal Random Numbers

  21. Introduction to Statistical Inference
  22. The Role of Normal Distributions

    Confidence Intervals for Means

    Basic Formulation

    Example: Pima Diabetes Study

    Example: Humidity Data

    Meaning of Confidence Intervals

    A Weight Survey in Davis

    Confidence Intervals for Proportions

    Example: Machine Classification of Forest Covers

    The Student-t Distribution

    Introduction to Significance Tests

    The Proverbial Fair Coin

    The Basics

    General Testing Based on Normally Distributed Estimators

    The Notion of \p-Values"

    What's Random and What Is Not

    Example: the Forest Cover Data

    Problems with Significance Testing

    History of Significance Testing, and Where We Are Today

    The Basic Issues

    Alternative Approach

    The Problem of \P-hacking"

    A Thought Experiment

    Multiple Inference Methods

    Philosophy of Statistics

    More about Interpretation of CIs

    The Bayesian View of Confidence Intervals

  23. Multivariate Distributions
  24. Multivariate Distributions: Discrete Case

    Example: Marbles in a Bag

    Multivariate pmfs

    Multivariate Distributions: Continuous Case

    Multivariate Densities

    Motivation and Definition

    Use of Multivariate Densities in Finding Probabilities and Expected Values

    Example: a Triangular Distribution

    Example: Train Rendezvous

    Multivariate Distributions: Mixed Discrete-Continuous Case

    Measuring Co-variation of Random Variables


    Example: the Committee Example Again


    Example: Correlation in the Triangular Distribution

    Sample Estimates

    Sets of Independent Random Variables


    Expected Values Factor

    Covariance Is

    Variances Add

    Examples Involving Sets of Independent Random Variables

    Example: Dice

    Matrix Formulations

    Properties of Mean Vectors

    Covariance Matrices

    Covariance Matrices Linear Combinations of Random


    More on Sets of Independent Random Variables

    Probability Mass Functions and Densities Factor in the Independent Case


    Example: Ethernet

    Example: Backup Battery

    The Multivariate Normal Family of Distributions


    Geometric Interpretation

    R Functions

    Special Case: New Variable Is a Single Linear Combination of a Random Vector

    Properties of Multivariate Normal Distributions

    The Multivariate Central Limit Theorem

    Iterated Expectations

    Conditional Distributions

    The Theorem

    Example: Flipping Coins with Bonuses

    Conditional Expectation as a Random Variable

    What about Variance?

    Mixture Distributions

    Derivation of Mean and Variance

    Mathematical Complements

    Transform Methods

    Generating Functions

    Sums of Independent Poisson Random Variables Are Poisson Distributed

    A Geometric View of Conditional Expectation

    Alternate Proof of E(UV) = EU EV for Independent U,V

    Computational Complements

    Generating Multivariate Normal Random Vectors

  25. Dimension Reduction
  26. Principal Components Analysis


    Properties of PCA

    Example: Turkish Teaching Evaluations

    Mathematical Complements

    Derivation of PCA

  27. Predictive Modeling
  28. Example: Heritage Health Prize

    The Goals: Prediction and Description


    What Does \Relationship" Really Mean?

    Precise Definition

    Parametric Models for the Regression Function m()

    Estimation in Linear Parametric Regression Models

    Example: Baseball Data

    R Code

    Multiple Regression: More Than One Predictor Variable

    Example: Baseball Data (cont'd)

    Interaction Terms

    Parametric Estimation of Linear Regression Functions

    Meaning of \Linear"

    Random-X and Fixed-X Regression

    Point Estimates and Matrix Formulation

    Approximate Confidence Intervals

    Example: Baseball Data (cont'd)

    Dummy Variables


    Classification = Regression

    Logistic Regression

    The Logistic Model: Motivations

    Estimation and Inference for Logit Coefficients

    Example: Forest Cover Data

    R Code

    Analysis of the Results

    Multiclass Case

    Machine Learning Methods: Neural Networks

    Example: Predicting Vertebral Abnormalities

    But What Is Really Going On?

    R Packages

    Mathematical Complements

    Matrix Derivatives and Minimizing the Sum of Squares

    Computational Complements

    Some Computational Details in Section

    More Regarding glm()

  29. Model Parsimony and Overfitting

          What Is Overfitting?

          Example: Histograms

          Example: Polynomial Regression

          Can Anything Be Done about It?


A.      R Quick Start

          A Correspondences

          A Starting R

          A First Sample Programming Session

          A Vectorization

          A Second Sample Programming Session

          A Recycling

          A More on Vectorization

          A Third Sample Programming Session

          A Default Argument Values

          A The R List Type

          A The Basics

          A S Classes

          A Some Workhorse Functions

          A Data Frames

          A Online Help

          A Debugging in R

          A Further Reading

          B. Matrix Algebra

          B Terminology and Notation

          B Matrix Addition and Multiplication

          B Matrix Transpose

          B Linear Independence

          B Determinants

          B Matrix Inverse

          B Eigenvalues and Eigenvectors

          B Matrix Algebra in R

View More



Norman Matloff is a professor of computer science at the University of California, Davis, and was formerly a statistics professor there. He is on the editorial boards of the Journal of Statistical Software and The R Journal. His book Statistical Regression and Classification: From Linear Models to Machine Learning was the recipient of the Ziegel Award for the best book reviewed in Technometrics in 2017. He is a recipient of his university's Distinguished Teaching Award.


"I quite like this book. I believe that the book describes itself quite well when it says: Mathematically correct yet highly intuitive…This book would be great for a class that one takes before one takes my statistical learning class. I often run into beginning graduate Data Science students whose background is not math (e.g., CS or Business) and they are not ready…The book fills an important niche, in that it provides a self-contained introduction to material that is useful for a higher-level statistical learning course. I think that it compares well with competing books, particularly in that it takes a more "Data Science" and "example driven" approach than more classical books."
~Randy Paffenroth, Worchester Polytechnic Institute

"This text by Matloff (Univ. of California, Davis) affords an excellent introduction to statistics for the data science student…Its examples are often drawn from data science applications such as hidden Markov models and remote sensing, to name a few… All the models and concepts are explained well in precise mathematical terms (not presented as formal proofs), to help students gain an intuitive understanding."