1st Edition

Probability and Statistics for Data Science Math + R + Data

By Norman Matloff Copyright 2020
    444 Pages
    by Chapman & Hall

    444 Pages
    by Chapman & Hall

    Probability and Statistics for Data Science: Math + R + Data covers "math stat"—distributions, expected value, estimation etc.—but takes the phrase "Data Science" in the title quite seriously:

    * Real datasets are used extensively.

    * All data analysis is supported by R coding.

    * Includes many Data Science applications, such as PCA, mixture distributions, random graph models, Hidden Markov models, linear and logistic regression, and neural networks.

    * Leads the student to think critically about the "how" and "why" of statistics, and to "see the big picture."

    * Not "theorem/proof"-oriented, but concepts and models are stated in a mathematically precise manner.

    Prerequisites are calculus, some matrix algebra, and some experience in programming.

    Norman Matloff is a professor of computer science at the University of California, Davis, and was formerly a statistics professor there. He is on the editorial boards of the Journal of Statistical Software and The R Journal. His book Statistical Regression and Classification: From Linear Models to Machine Learning was the recipient of the Ziegel Award for the best book reviewed in Technometrics in 2017. He is a recipient of his university's Distinguished Teaching Award.

    1. Basic Probability Models
    2. Example: Bus Ridership

      A \Notebook" View: the Notion of a Repeatable Experiment

      Theoretical Approaches

      A More Intuitive Approach

      Our Definitions

      "Mailing Tubes"

      Example: Bus Ridership Model (cont'd)

      Example: ALOHA Network

      ALOHA Network Model Summary

      ALOHA Network Computations

      ALOHA in the Notebook Context

      Example: A Simple Board Game

      Bayes' Rule

      General Principle

      Example: Document Classification

      Random Graph Models

      Example: Preferential Attachment Model

      Combinatorics-Based Probability Computation

      Which Is More Likely in Five Cards, One King or Two Hearts?

      Example: Random Groups of Students

      Example: Lottery Tickets

      Example: \Association Rules"

      Example: Gaps between Numbers

      Multinomial Coefficients

      Example: Probability of Getting Four Aces in a Bridge Hand

    3. Monte Carlo Simulation
    4. Example: Rolling Dice

      First Improvement

      Second Improvement

      Third Improvement

      Example: Dice Problem

      Use of runif() for Simulating Events

      Example: ALOHA Network (cont'd)

      Example: Bus Ridership (cont'd)

      Example: Board Game (cont'd)

      Example: Broken Rod

      How Long Should We Run the Simulation?

      Computational Complements

      More on the replicate() Function

    5. Discrete Random Variables: Expected Value
    6. Random Variables

      Discrete Random Variables

      Independent Random Variables

      Example: The Monty Hall Problem

      Expected Value

      Generality|Not Just for Discrete Random Variables

      Misnomer

      Definition and Notebook View

      Properties of Expected Value

      Computational Formula

      Further Properties of Expected Value

      Finding Approximate Expected Values via Simulation

      Casinos, Insurance Companies and \Sum Users," Compared to Others

      Mathematical Complements

      Proof of Property E:

    7. Discrete Random Variables: Variance
    8. Variance

      Definition

      Central Importance of the Concept of Variance

      Intuition Regarding the Size of Var(X)

      Chebychev's Inequality

      The Coefficient of Variation

      A Useful Fact

      Covariance

      Indicator Random Variables, and Their Means and Variances

      Example: Return Time for Library Books, Version I

      Example: Return Time for Library Books, Version II

      Example: Indicator Variables in a Committee Problem

      Skewness

      Mathematical Complements

      Proof of Chebychev's Inequality

    9. Discrete Parametric Distribution Families
    10. Distributions

      Example: Toss Coin Until First Head

      Example: Sum of Two Dice

      Example: Watts-Strogatz Random Graph Model

      The Model

      Parametric Families of Distributions

      The Case of Importance to Us: Parameteric Families of pmfs

      Distributions Based on Bernoulli Trials

      The Geometric Family of Distributions

      R Functions

      Example: a Parking Space Problem

      The Binomial Family of Distributions

      R Functions

      Example: Parking Space Model

      The Negative Binomial Family of Distributions

      R Functions

      Example: Backup Batteries

      Two Major Non-Bernoulli Models

      The Poisson Family of Distributions

      R Functions

      Example: Broken Rod

      Fitting the Poisson and Power Law Models to Data

      Example: the Bus Ridership Problem

      Example: Flipping Coins with Bonuses

      Example: Analysis of Social Networks

      Mathematical Complements

      Computational Complements

      Graphics and Visualization in R

    11. Introduction to Discrete Markov Chains
    12. Matrix Formulation

      Example: Die Game

      Long-Run State Probabilities

      Stationary Distribution

      Calculation of _

      Simulation Calculation of _

      Example: -Heads-in-a-Row Game

      Example: Bus Ridership Problem

      Hidden Markov Models

      Example: Bus Ridership

      Computation

      Google PageRank

    13. Continuous Probability Models
    14. A Random Dart

      Individual Values Now Have Probability Zero

      But Now We Have a Problem

      Our Way Out of the Problem: Cumulative Distribution Functions

      CDFs

      Non-Discrete, Non-Continuous Distributions

      Density Functions

      Properties of Densities

      Intuitive Meaning of Densities

      Expected Values

      A First Example

      Famous Parametric Families of Continuous Distributions

      The Uniform Distributions

      Density and Properties

      R Functions

      Example: Modeling of Disk Performance

      Example: Modeling of Denial-of-Service Attack

      The Normal (Gaussian) Family of Continuous Distributions

      Density and Properties

      R Functions

      Importance in Modeling

      The Exponential Family of Distributions

      Density and Properties

      R Functions

      Example: Garage Parking Fees

      Memoryless Property of Exponential Distributions

      Importance in Modeling

      The Gamma Family of Distributions

      Density and Properties

      Example: Network Buffer

      Importance in Modeling

      The Beta Family of Distributions

      Density Etc

      Importance in Modeling

      Mathematical Complements

      Duality of the Exponential Family with the Poisson Family

      Computational Complements

      Inverse Method for Sampling from a Density

      Sampling from a Poisson Distribution

    15. Statistics: Prologue
    16. Importance of This Chapter

      Sampling Distributions

      Random Samples

      The Sample Mean | a Random Variable

      Toy Population Example

      Expected Value and Variance of X

      Toy Population Example Again

      Interpretation

      Notebook View

      Simple Random Sample Case

      The Sample Variance|Another Random Variable

      Intuitive Estimation of _

      Easier Computation

      Special Case: X Is an Indicator Variable

      To Divide by n or n-?

      Statistical Bias

      The Concept of a \Standard Error"

      Example: Pima Diabetes Study

      Don't Forget: Sample = Population!

      Simulation Issues

      Sample Estimates

      Infinite Populations?

      Observational Studies

      The Bayesian Philosophy

      How Does It Work?

      Arguments for and Against

      Computational Complements

      R's split() and tapply() Functions

    17. Fitting Continuous Models
    18. Estimating a Density from Sample Data

      Example: BMI Data

      The Number of Bins

      The Bias-Variance Tradeo_

      The Bias-Variance Tradeo_ in the Histogram Case

      A General Issue: Choosing the Degree of

      Smoothing

      Parameter Estimation

      Method of Moments

      Example: BMI Data

      The Method of Maximum Likelihood

      Example: Humidity Data

      MM vs MLE

      Advanced Methods for Density Estimation

      Assessment of Goodness of Fit

      Mathematical Complements

      Details of Kernel Density Estimators

      Computational Complements

      Generic Functions

      The gmm Package

      The gmm() Function

      Example: Bodyfat Data

    19. The Family of Normal Distributions
    20. Density and Properties

      Closure Under Affine Transformation

      Closure Under Independent Summation

      A Mystery

      R Functions

      The Standard Normal Distribution

      Evaluating Normal cdfs

      Example: Network Intrusion

      Example: Class Enrollment Size

      The Central Limit Theorem

      Example: Cumulative Roundo_ Error

      Example: Coin Tosses

      Example: Museum Demonstration

      A Bit of Insight into the Mystery

      X Is Approximately Normal|No Matter What the Population Distribution Is

      Approximate Distribution of (Centered and Scaled) X

      Improved Assessment of Accuracy of X

      Importance in Modeling

      The Chi-Squared Family of Distributions

      Density and Properties

      Example: Error in Pin Placement

      Importance in Modeling

      Relation to Gamma Family

      Mathematical Complements

      Convergence in Distribution, and the Precisely-Stated CLT

      Computational Complements

      Example: Generating Normal Random Numbers

    21. Introduction to Statistical Inference
    22. The Role of Normal Distributions

      Confidence Intervals for Means

      Basic Formulation

      Example: Pima Diabetes Study

      Example: Humidity Data

      Meaning of Confidence Intervals

      A Weight Survey in Davis

      Confidence Intervals for Proportions

      Example: Machine Classification of Forest Covers

      The Student-t Distribution

      Introduction to Significance Tests

      The Proverbial Fair Coin

      The Basics

      General Testing Based on Normally Distributed Estimators

      The Notion of \p-Values"

      What's Random and What Is Not

      Example: the Forest Cover Data

      Problems with Significance Testing

      History of Significance Testing, and Where We Are Today

      The Basic Issues

      Alternative Approach

      The Problem of \P-hacking"

      A Thought Experiment

      Multiple Inference Methods

      Philosophy of Statistics

      More about Interpretation of CIs

      The Bayesian View of Confidence Intervals

    23. Multivariate Distributions
    24. Multivariate Distributions: Discrete Case

      Example: Marbles in a Bag

      Multivariate pmfs

      Multivariate Distributions: Continuous Case

      Multivariate Densities

      Motivation and Definition

      Use of Multivariate Densities in Finding Probabilities and Expected Values

      Example: a Triangular Distribution

      Example: Train Rendezvous

      Multivariate Distributions: Mixed Discrete-Continuous Case

      Measuring Co-variation of Random Variables

      Covariance

      Example: the Committee Example Again

      Correlation

      Example: Correlation in the Triangular Distribution

      Sample Estimates

      Sets of Independent Random Variables

      Properties

      Expected Values Factor

      Covariance Is

      Variances Add

      Examples Involving Sets of Independent Random Variables

      Example: Dice

      Matrix Formulations

      Properties of Mean Vectors

      Covariance Matrices

      Covariance Matrices Linear Combinations of Random

      Vectors

      More on Sets of Independent Random Variables

      Probability Mass Functions and Densities Factor in the Independent Case

      Convolution

      Example: Ethernet

      Example: Backup Battery

      The Multivariate Normal Family of Distributions

      Densities

      Geometric Interpretation

      R Functions

      Special Case: New Variable Is a Single Linear Combination of a Random Vector

      Properties of Multivariate Normal Distributions

      The Multivariate Central Limit Theorem

      Iterated Expectations

      Conditional Distributions

      The Theorem

      Example: Flipping Coins with Bonuses

      Conditional Expectation as a Random Variable

      What about Variance?

      Mixture Distributions

      Derivation of Mean and Variance

      Mathematical Complements

      Transform Methods

      Generating Functions

      Sums of Independent Poisson Random Variables Are Poisson Distributed

      A Geometric View of Conditional Expectation

      Alternate Proof of E(UV) = EU EV for Independent U,V

      Computational Complements

      Generating Multivariate Normal Random Vectors

    25. Dimension Reduction
    26. Principal Components Analysis

      Intuition

      Properties of PCA

      Example: Turkish Teaching Evaluations

      Mathematical Complements

      Derivation of PCA

    27. Predictive Modeling
    28. Example: Heritage Health Prize

      The Goals: Prediction and Description

      Terminology

      What Does \Relationship" Really Mean?

      Precise Definition

      Parametric Models for the Regression Function m()

      Estimation in Linear Parametric Regression Models

      Example: Baseball Data

      R Code

      Multiple Regression: More Than One Predictor Variable

      Example: Baseball Data (cont'd)

      Interaction Terms

      Parametric Estimation of Linear Regression Functions

      Meaning of \Linear"

      Random-X and Fixed-X Regression

      Point Estimates and Matrix Formulation

      Approximate Confidence Intervals

      Example: Baseball Data (cont'd)

      Dummy Variables

      Classification

      Classification = Regression

      Logistic Regression

      The Logistic Model: Motivations

      Estimation and Inference for Logit Coefficients

      Example: Forest Cover Data

      R Code

      Analysis of the Results

      Multiclass Case

      Machine Learning Methods: Neural Networks

      Example: Predicting Vertebral Abnormalities

      But What Is Really Going On?

      R Packages

      Mathematical Complements

      Matrix Derivatives and Minimizing the Sum of Squares

      Computational Complements

      Some Computational Details in Section

      More Regarding glm()

    29. Model Parsimony and Overfitting

              What Is Overfitting?

              Example: Histograms

              Example: Polynomial Regression

              Can Anything Be Done about It?

              Cross-Validation

    A.      R Quick Start

              A Correspondences

              A Starting R

              A First Sample Programming Session

              A Vectorization

              A Second Sample Programming Session

              A Recycling

              A More on Vectorization

              A Third Sample Programming Session

              A Default Argument Values

              A The R List Type

              A The Basics

              A S Classes

              A Some Workhorse Functions

              A Data Frames

              A Online Help

              A Debugging in R

              A Further Reading

              B. Matrix Algebra

              B Terminology and Notation

              B Matrix Addition and Multiplication

              B Matrix Transpose

              B Linear Independence

              B Determinants

              B Matrix Inverse

              B Eigenvalues and Eigenvectors

              B Matrix Algebra in R

    Biography

    Norman Matloff is a professor of computer science at the University of California, Davis, and was formerly a statistics professor there. He is on the editorial boards of the Journal of Statistical Software and The R Journal. His book Statistical Regression and Classification: From Linear Models to Machine Learning was the recipient of the Ziegel Award for the best book reviewed in Technometrics in 2017. He is a recipient of his university's Distinguished Teaching Award.

    "I quite like this book. I believe that the book describes itself quite well when it says: Mathematically correct yet highly intuitive…This book would be great for a class that one takes before one takes my statistical learning class. I often run into beginning graduate Data Science students whose background is not math (e.g., CS or Business) and they are not ready…The book fills an important niche, in that it provides a self-contained introduction to material that is useful for a higher-level statistical learning course. I think that it compares well with competing books, particularly in that it takes a more "Data Science" and "example driven" approach than more classical books."
    ~Randy Paffenroth, Worchester Polytechnic Institute

    "This text by Matloff (Univ. of California, Davis) affords an excellent introduction to statistics for the data science student…Its examples are often drawn from data science applications such as hidden Markov models and remote sensing, to name a few… All the models and concepts are explained well in precise mathematical terms (not presented as formal proofs), to help students gain an intuitive understanding."
    ~CHOICE