1st Edition

# Probability and Statistics for Data Science Math + R + Data

**Also available as eBook on:**

**Probability and Statistics for Data Science: Math + R + Data** covers "math stat"—distributions, expected value, estimation etc.—but takes the phrase "Data Science" in the title quite seriously:

* Real datasets are used extensively.

* All data analysis is supported by R coding.

* Includes many Data Science applications, such as PCA, mixture distributions, random graph models, Hidden Markov models, linear and logistic regression, and neural networks.

* Leads the student to think critically about the "how" and "why" of statistics, and to "see the big picture."

* Not "theorem/proof"-oriented, but concepts and models are stated in a mathematically precise manner.

Prerequisites are calculus, some matrix algebra, and some experience in programming.

**Norman Matloff** is a professor of computer science at the University of California, Davis, and was formerly a statistics professor there. He is on the editorial boards of the *Journal of Statistical Software *and *The R Journal*. His book *Statistical Regression and Classification: From Linear Models to Machine Learning* was the recipient of the Ziegel Award for the best book reviewed in *Technometrics* in 2017. He is a recipient of his university's Distinguished Teaching Award.

- Basic Probability Models
- Monte Carlo Simulation
- Discrete Random Variables: Expected Value
- Discrete Random Variables: Variance
- Discrete Parametric Distribution Families
- Introduction to Discrete Markov Chains
- Continuous Probability Models
- Statistics: Prologue
- Fitting Continuous Models
- The Family of Normal Distributions
- Introduction to Statistical Inference
- Multivariate Distributions
- Dimension Reduction
- Predictive Modeling
- Model Parsimony and Overfitting

Example: Bus Ridership

A \Notebook" View: the Notion of a Repeatable Experiment

Theoretical Approaches

A More Intuitive Approach

Our Definitions

"Mailing Tubes"

Example: Bus Ridership Model (cont'd)

Example: ALOHA Network

ALOHA Network Model Summary

ALOHA Network Computations

ALOHA in the Notebook Context

Example: A Simple Board Game

Bayes' Rule

General Principle

Example: Document Classification

Random Graph Models

Example: Preferential Attachment Model

Combinatorics-Based Probability Computation

Which Is More Likely in Five Cards, One King or Two Hearts?

Example: Random Groups of Students

Example: Lottery Tickets

Example: \Association Rules"

Example: Gaps between Numbers

Multinomial Coefficients

Example: Probability of Getting Four Aces in a Bridge Hand

Example: Rolling Dice

First Improvement

Second Improvement

Third Improvement

Example: Dice Problem

Use of runif() for Simulating Events

Example: ALOHA Network (cont'd)

Example: Bus Ridership (cont'd)

Example: Board Game (cont'd)

Example: Broken Rod

How Long Should We Run the Simulation?

Computational Complements

More on the replicate() Function

Random Variables

Discrete Random Variables

Independent Random Variables

Example: The Monty Hall Problem

Expected Value

Generality|Not Just for Discrete Random Variables

Misnomer

Definition and Notebook View

Properties of Expected Value

Computational Formula

Further Properties of Expected Value

Finding Approximate Expected Values via Simulation

Casinos, Insurance Companies and \Sum Users," Compared to Others

Mathematical Complements

Proof of Property E:

Variance

Definition

Central Importance of the Concept of Variance

Intuition Regarding the Size of Var(X)

Chebychev's Inequality

The Coefficient of Variation

A Useful Fact

Covariance

Indicator Random Variables, and Their Means and Variances

Example: Return Time for Library Books, Version I

Example: Return Time for Library Books, Version II

Example: Indicator Variables in a Committee Problem

Skewness

Mathematical Complements

Proof of Chebychev's Inequality

Distributions

Example: Toss Coin Until First Head

Example: Sum of Two Dice

Example: Watts-Strogatz Random Graph Model

The Model

Parametric Families of Distributions

The Case of Importance to Us: Parameteric Families of pmfs

Distributions Based on Bernoulli Trials

The Geometric Family of Distributions

R Functions

Example: a Parking Space Problem

The Binomial Family of Distributions

R Functions

Example: Parking Space Model

The Negative Binomial Family of Distributions

R Functions

Example: Backup Batteries

Two Major Non-Bernoulli Models

The Poisson Family of Distributions

R Functions

Example: Broken Rod

Fitting the Poisson and Power Law Models to Data

Example: the Bus Ridership Problem

Example: Flipping Coins with Bonuses

Example: Analysis of Social Networks

Mathematical Complements

Computational Complements

Graphics and Visualization in R

Matrix Formulation

Example: Die Game

Long-Run State Probabilities

Stationary Distribution

Calculation of _

Simulation Calculation of _

Example: -Heads-in-a-Row Game

Example: Bus Ridership Problem

Hidden Markov Models

Example: Bus Ridership

Computation

Google PageRank

A Random Dart

Individual Values Now Have Probability Zero

But Now We Have a Problem

Our Way Out of the Problem: Cumulative Distribution Functions

CDFs

Non-Discrete, Non-Continuous Distributions

Density Functions

Properties of Densities

Intuitive Meaning of Densities

Expected Values

A First Example

Famous Parametric Families of Continuous Distributions

The Uniform Distributions

Density and Properties

R Functions

Example: Modeling of Disk Performance

Example: Modeling of Denial-of-Service Attack

The Normal (Gaussian) Family of Continuous Distributions

Density and Properties

R Functions

Importance in Modeling

The Exponential Family of Distributions

Density and Properties

R Functions

Example: Garage Parking Fees

Memoryless Property of Exponential Distributions

Importance in Modeling

The Gamma Family of Distributions

Density and Properties

Example: Network Buffer

Importance in Modeling

The Beta Family of Distributions

Density Etc

Importance in Modeling

Mathematical Complements

Duality of the Exponential Family with the Poisson Family

Computational Complements

Inverse Method for Sampling from a Density

Sampling from a Poisson Distribution

Importance of This Chapter

Sampling Distributions

Random Samples

The Sample Mean | a Random Variable

Toy Population Example

Expected Value and Variance of X

Toy Population Example Again

Interpretation

Notebook View

Simple Random Sample Case

The Sample Variance|Another Random Variable

Intuitive Estimation of _

Easier Computation

Special Case: X Is an Indicator Variable

To Divide by n or n-?

Statistical Bias

The Concept of a \Standard Error"

Example: Pima Diabetes Study

Don't Forget: Sample = Population!

Simulation Issues

Sample Estimates

Infinite Populations?

Observational Studies

The Bayesian Philosophy

How Does It Work?

Arguments for and Against

Computational Complements

R's split() and tapply() Functions

Estimating a Density from Sample Data

Example: BMI Data

The Number of Bins

The Bias-Variance Tradeo_

The Bias-Variance Tradeo_ in the Histogram Case

A General Issue: Choosing the Degree of

Smoothing

Parameter Estimation

Method of Moments

Example: BMI Data

The Method of Maximum Likelihood

Example: Humidity Data

MM vs MLE

Advanced Methods for Density Estimation

Assessment of Goodness of Fit

Mathematical Complements

Details of Kernel Density Estimators

Computational Complements

Generic Functions

The gmm Package

The gmm() Function

Example: Bodyfat Data

Density and Properties

Closure Under Affine Transformation

Closure Under Independent Summation

A Mystery

R Functions

The Standard Normal Distribution

Evaluating Normal cdfs

Example: Network Intrusion

Example: Class Enrollment Size

The Central Limit Theorem

Example: Cumulative Roundo_ Error

Example: Coin Tosses

Example: Museum Demonstration

A Bit of Insight into the Mystery

X Is Approximately Normal|No Matter What the Population Distribution Is

Approximate Distribution of (Centered and Scaled) X

Improved Assessment of Accuracy of X

Importance in Modeling

The Chi-Squared Family of Distributions

Density and Properties

Example: Error in Pin Placement

Importance in Modeling

Relation to Gamma Family

Mathematical Complements

Convergence in Distribution, and the Precisely-Stated CLT

Computational Complements

Example: Generating Normal Random Numbers

The Role of Normal Distributions

Confidence Intervals for Means

Basic Formulation

Example: Pima Diabetes Study

Example: Humidity Data

Meaning of Confidence Intervals

A Weight Survey in Davis

Confidence Intervals for Proportions

Example: Machine Classification of Forest Covers

The Student-t Distribution

Introduction to Significance Tests

The Proverbial Fair Coin

The Basics

General Testing Based on Normally Distributed Estimators

The Notion of \p-Values"

What's Random and What Is Not

Example: the Forest Cover Data

Problems with Significance Testing

History of Significance Testing, and Where We Are Today

The Basic Issues

Alternative Approach

The Problem of \P-hacking"

A Thought Experiment

Multiple Inference Methods

Philosophy of Statistics

More about Interpretation of CIs

The Bayesian View of Confidence Intervals

Multivariate Distributions: Discrete Case

Example: Marbles in a Bag

Multivariate pmfs

Multivariate Distributions: Continuous Case

Multivariate Densities

Motivation and Definition

Use of Multivariate Densities in Finding Probabilities and Expected Values

Example: a Triangular Distribution

Example: Train Rendezvous

Multivariate Distributions: Mixed Discrete-Continuous Case

Measuring Co-variation of Random Variables

Covariance

Example: the Committee Example Again

Correlation

Example: Correlation in the Triangular Distribution

Sample Estimates

Sets of Independent Random Variables

Properties

Expected Values Factor

Covariance Is

Variances Add

Examples Involving Sets of Independent Random Variables

Example: Dice

Matrix Formulations

Properties of Mean Vectors

Covariance Matrices

Covariance Matrices Linear Combinations of Random

Vectors

More on Sets of Independent Random Variables

Probability Mass Functions and Densities Factor in the Independent Case

Convolution

Example: Ethernet

Example: Backup Battery

The Multivariate Normal Family of Distributions

Densities

Geometric Interpretation

R Functions

Special Case: New Variable Is a Single Linear Combination of a Random Vector

Properties of Multivariate Normal Distributions

The Multivariate Central Limit Theorem

Iterated Expectations

Conditional Distributions

The Theorem

Example: Flipping Coins with Bonuses

Conditional Expectation as a Random Variable

What about Variance?

Mixture Distributions

Derivation of Mean and Variance

Mathematical Complements

Transform Methods

Generating Functions

Sums of Independent Poisson Random Variables Are Poisson Distributed

A Geometric View of Conditional Expectation

Alternate Proof of E(UV) = EU EV for Independent U,V

Computational Complements

Generating Multivariate Normal Random Vectors

Principal Components Analysis

Intuition

Properties of PCA

Example: Turkish Teaching Evaluations

Mathematical Complements

Derivation of PCA

Example: Heritage Health Prize

The Goals: Prediction and Description

Terminology

What Does \Relationship" Really Mean?

Precise Definition

Parametric Models for the Regression Function m()

Estimation in Linear Parametric Regression Models

Example: Baseball Data

R Code

Multiple Regression: More Than One Predictor Variable

Example: Baseball Data (cont'd)

Interaction Terms

Parametric Estimation of Linear Regression Functions

Meaning of \Linear"

Random-X and Fixed-X Regression

Point Estimates and Matrix Formulation

Approximate Confidence Intervals

Example: Baseball Data (cont'd)

Dummy Variables

Classification

Classification = Regression

Logistic Regression

The Logistic Model: Motivations

Estimation and Inference for Logit Coefficients

Example: Forest Cover Data

R Code

Analysis of the Results

Multiclass Case

Machine Learning Methods: Neural Networks

Example: Predicting Vertebral Abnormalities

But What Is Really Going On?

R Packages

Mathematical Complements

Matrix Derivatives and Minimizing the Sum of Squares

Computational Complements

Some Computational Details in Section

More Regarding glm()

What Is Overfitting?

Example: Histograms

Example: Polynomial Regression

Can Anything Be Done about It?

Cross-Validation

A. R Quick Start

A Correspondences

A Starting R

A First Sample Programming Session

A Vectorization

A Second Sample Programming Session

A Recycling

A More on Vectorization

A Third Sample Programming Session

A Default Argument Values

A The R List Type

A The Basics

A S Classes

A Some Workhorse Functions

A Data Frames

A Online Help

A Debugging in R

A Further Reading

B. Matrix Algebra

B Terminology and Notation

B Matrix Addition and Multiplication

B Matrix Transpose

B Linear Independence

B Determinants

B Matrix Inverse

B Eigenvalues and Eigenvectors

B Matrix Algebra in R

### Biography

**Norman Matloff** is a professor of computer science at the University of California, Davis, and was formerly a statistics professor there. He is on the editorial boards of the *Journal of Statistical Software *and *The R Journal*. His book *Statistical Regression and Classification: From Linear Models to Machine Learning* was the recipient of the Ziegel Award for the best book reviewed in *Technometrics* in 2017. He is a recipient of his university's Distinguished Teaching Award.

"I quite like this book. I believe that the book describes itself quite well when it says: Mathematically correct yet highly intuitive…This book would be great for a class that one takes before one takes my statistical learning class. I often run into beginning graduate Data Science students whose background is not math (e.g., CS or Business) and they are not ready…The book fills an important niche, in that it provides a self-contained introduction to material that is useful for a higher-level statistical learning course. I think that it compares well with competing books, particularly in that it takes a more "Data Science" and "example driven" approach than more classical books."

~Randy Paffenroth,Worchester Polytechnic Institute"This text by Matloff (Univ. of California, Davis) affords an excellent introduction to statistics for the data science student…Its examples are often drawn from data science applications such as hidden Markov models and remote sensing, to name a few… All the models and concepts are explained well in precise mathematical terms (not presented as formal proofs), to help students gain an intuitive understanding."

~CHOICE