1st Edition

# Statistical Regression and Classification From Linear Models to Machine Learning

**Also available as eBook on:**

This text provides a modern introduction to regression and classification with an emphasis on big data and R. Each chapter is partitioned into a main body section and an extras section. The main body uses math stat very sparingly and always in the context of something concrete, which means that readers can skip the math stat content entirely if they wish. The extras section is for those who feel comfortable with analysis using math stat.

***Statistical Regression and Classification: From Linear Models to Machine Learning** **was awarded the 2017 Ziegel Award for the best book reviewed in Technometrics in 2017.***

**Setting the Stage Example: Predicting Bike-Sharing ActivityExample of the Prediction Goal: Body FatExample of the Description Goal: Who Clicks Web Ads?Optimal Prediction A Note About E(), Samples and Populations Example: Do Baseball Players Gain Weight As They Age?**

Prediction vs Description

A First Estimator

A Possibly Better Estimator, Using a Linear Model

**Multipredictor Linear Models**

*Parametric vs Nonparametric Models**Example: Click-Through Rate*

Several Predictor Variables

Several Predictor Variables

Estimation of Coefficients

The Description Goal

Nonparametric Regression Estimation: k-NN

Looking at Nearby Points

Measures of Nearness

The k-NN Method, and Tuning Parameters

Nearest-Neighbor Analysis in the regtools

Package

Example: Baseball Player Data

**Parametric Settings**

*After Fitting a Model, How Do We Use It for Prediction?*Nonparametric Settings

The Generic predict() Function

**Intuition**

*Overfitting, and the Variance-Bias Tradeoff*

Example: Student Evaluations of Instructors

*Linear Model Case*

**Cross-Validation**

The Code

Applying the Code

k-NN Case

Choosing the Partition Sizes

*Important Note on Tuning Parameters*

*Rough Rule of Thumb*

**Linear Modeling of _(t)**

*Example: Bike-Sharing Data*

Nonparametric Analysis

**Example: Salaries of Female Programmers and Engineers**

*Interaction Terms, Including Quadratics*

**Higher-Order Polynomial Models**

*Saving Your Work*

**It's a Regression Problem!**

*Classification Techniques*

Example: Bike-Sharing Data

*Crucial Advice: Don't Automate, Participate!*

**Indicator Random Variables**

*Mathematical Complements*

Mean Squared Error of an Estimator

_(t) Minimizes Mean Squared Prediction Error

_(t) Minimizes the Misclassification Rate

Kernel-Based Nonparametric Estimation of Regression

Functions

General Nonparametric Regression

Some Properties of Conditional Expectation

Conditional Expectation As a Random Variable

The Law of Total Expectation

Law of Total Variance

Tower Property

Geometric View

**CRAN Packages**

*Computational Complements*

The Function tapply() and Its Cousins

The Innards of the k-NN Code

Function Dispatch

**Centering and Scaling****Further Exploration: Data, Code and Math Problems**

Linear Regression Models

Linear Regression Models

** Notation **Motivation

The "Error Term"

Random- vs Fixed-X Cases

Least-Squares Estimation

Matrix Formulations

() in Matrix Terms

Using Matrix Operations to Minimize ()

Models Without an Intercept Term

**Statistical Inference**

*A Closer Look at lm() Output*

**Classical**

*Assumptions*

Motivation: the Multivariate Normal Distribution Family

**b_ Is Unbiased**

*Unbiasedness and Consistency*

Bias As an Issue/Nonissue

b_ Is Statistically Consistent

*Inference under Homoscedasticity**Review: Classical Inference on a Single Mean*

Back to Reality

The Concept of a Standard Error

Extension to the Regression Case

Example: Bike-Sharing Data

*Collective Predictive Strength of the X(j)**Basic Properties*

Definition of R

Bias Issues

Adjusted-R

The Leaving-One-Out Method"

Extensions of LOOM

LOOM for k-NN

Other Measures

*The Practical Value of p-Values | Small OR Large**Misleadingly Small p-Values*

Example: Forest Cover Data

Example: Click Through Data

Misleadingly LARGE p-Values

The Verdict \

*Missing Values*

**Mathematical Complements**Covariance Matrices

The Multivariate Normal Distribution Family

The Central Limit Theorem

Details on Models Without a Constant Term

Unbiasedness of the Least-Squares Estimator

Consistency of the Least-Squares Estimator

Biased Nature of S

*Random Variables As Inner Product Spaces*

**The Geometry of Conditional Expectation**

Projections

Conditional Expectations As Projections

Predicted Values and Error Terms Are Uncorrelated

Classical \Exact" Inference

Asymptotic (p + )-Variate Normality of b_

**Details of the Computation of ()**

*Computational Complements*

R Functions Relating to the Multivariate Normal Distribution

Family

Example: Simulation Computation of a Bivariate

Normal Quantity

More Details of 'lm' Objects

**Homoscedasticity and Other Assumptions in Practice Normality Assumption Independence Assumption | Don't Overlook It **Estimation of a Single Mean

Inference on Linear Regression Coefficients

What Can Be Done?

Example: MovieLens Data

**Robustness of the Homoscedasticity Assumption**

*Dropping the Homoscedasticity Assumption*Weighted Least Squares

A Procedure for Valid Inference

The Methodology

Example: Female Wages

Simulation Test

Variance-Stabilizing Transformations

The Verdict

*Further Reading*

**The R merge() Function**

*Computational Complements*

**The Delta Method**

*Mathematical Complements*Distortion Due to Transformation \

*Further Exploration: Data, Code and Math Problems*

**Generalized Linear and Nonlinear Models** *Example: Enzyme Kinetics Model *** The Generalized Linear Model (GLM) **Definition

Poisson Regression

Exponential Families

GLM Computation

R's glm() Function

**Motivation**

*GLM: the Logistic Model*Example: Pima Diabetes Data

Interpretation of Coefficients

The predict() Function Again

Overall Prediction Accuracy

Example: Predicting Spam E-mail

Linear Boundary

*GLM: the Poisson Regression Model***The Gauss-Newton Method**

*Least-Squares Computation for Nonlinear Models*

Eicker-White Asymptotic Standard Errors

Example: Bike Sharing Data

The Elephant in the Room": Convergence Issues

*Further Reading*

**R Factors**

*Computational Complements***Maximum Likelihood Estimation**

*Mathematical Complements***Which Is Better?**

Multiclass Classification Problems

*Further Exploration: Data, Code and Math Problems*Multiclass Classification Problems

*Key Notation*

Key Equations

Estimating the Functions i(t)

How Do We Use Models for Prediction?

One vs All or All vs All?

Key Equations

Estimating the Functions i(t)

How Do We Use Models for Prediction?

One vs All or All vs All?

Example: Vertebrae Data

Intuition

Example: Letter Recognition Data

Example: k-NN on the Letter Recognition Data

The Verdict

**Background**

*The Classical Approach: Fisher Linear Discriminant Analysis*

Derivation

Example: Vertebrae Data

LDA Code and Results

**Model**

*Multinomial Logistic Model*

Software

Example: Vertebrae Data

**Why the Concern Regarding Balance?**

*The Issue of \Unbalanced" (and Balanced) Data*

A Crucial Sampling Issue

It All Depends on How We Sample

Remedies

Example: Letter Recognition

**Unequal Misclassification Costs**

*Going Beyond Using the 0.5 Threshhold*

Revisiting the Problem of Unbalanced Data

The Confusion Matrix and the ROC Curve

Code

Example: Spam Data

**Classification via Density Estimation**

*Mathematical Complements*

Methods for Density Estimation

Time Complexity Comparison, OVA vs AVA

Optimal Classification Rule for Unequal Error Costs

**R Code for OVA and AVA Logit Analysis**

*Computational Complements*

ROC Code

*Further Exploration: Data, Code and Math Problems***Prediction Context**

Model Fit: Assessment and Improvement

Model Fit: Assessment and Improvement

*Aims of This Chapter*

Methods

Notation

Goals of Model Fit-Checking

Methods

Notation

Goals of Model Fit-Checking

Description Context

Center vs Fringes of the Data Set

*Example: Currency Data*

*R-Squared, Revisited*

**Overall Measures of Model Fit**

Cross-Validation, Revisited

Plotting Parametric Fit Against Nonparametric One

Residuals vs Smoothing

**Partial Residual Plots**

*Diagnostics Related to Individual Predictors*

Plotting Nonparametric Fit Against Each Predictor

The freqparcoord Package

Parallel Coordinates

The regdiag() Function

**The inuence() Function**

*Effects of Unusual Observations on Model Fit*

Example: Currency Data

Use of freqparcoord for Outlier Detection

**Median Regression**

*Automated Outlier Resistance*

Example: Currency Data

*Example: Vocabulary Acquisition*

**Example: Pima Diabetes Study**

*Classification Settings*

**Deleting Terms from the Model**

*Improving Fit*

Adding Polynomial Terms

Example: Currency Data

Example: Programmer/Engineer Census Data

Boosting

View from the 30,000 Foot Level

Performance

*A Tool to Aid Model Selection*

*Special Note on the Description Goal*

**Data Wrangling for the Word Bank Dataset**

*Computational Complements*

Mathematical Complements

The Hat Matrix

Matrix Inverse Update

The Median Minimizes Mean Absolute Deviation

*Further Exploration: Data, Code and Math Problems*

Disaggregating Regressor Effects** A Small Analytical Example **Example: UCB Admissions Data (Logit)

Example: Baseball Player Data

Simpson's Paradox

The Verdict

**Instrumental Variables (IVs)**

*Unobserved Predictor Variables*The IV Method

Stage Least Squares:

Example: Years of Schooling

Multiple Predictors

The Verdict

Random Effects Models

Example: Movie Ratings Data, Random Effects

Multiple Random Effects

Why Use Random/Mixed Effects Models?

**Estimating the Counterfactual**

*Regression Function Averaging*

Example: Job Training

Small Area Estimation: \Borrowing from Neighbors"

The Verdict

**The Frequent Occurence of Extreme Events**

*Multiple Inference*

Relation to Statistical Inference

The Bonferroni Inequality

Scheffe's Method

Example: MovieLens Data

The Verdict

**Movie Lens Data Wrangling**

*Computational Complements*More Data Wrangling in the MovieLens Example

**Iterated Projections**

*Mathematical Complements*

Standard Errors for RFA

Asymptotic Chi-Square Distributions

*Further Exploration: Data, Code and Math Problems*

**Shrinkage Estimators **

*Relevance of James-Stein to Regression Estimation*

**What's All the Fuss About?**

*Multicollinearity*

A Simple Guiding Model

Wrong" Signs in Estimated Coefficients

Checking for Multicollinearity

The Variance Ination Factor

Example: Currency Data

What Can/Should One Do?

Do Nothing

Eliminate Some Predictors

Employ a Shrinkage Method

**Alternate Definitions**

*Ridge Regression*

Yes, It Is Smaller

Choosing the Value of _

Example: Currency Data

**Definition**

*The LASSO*

The lars Package

Example: Currency Data

The Elastic Net

*Why It May Work*

**Cases of Exact Multicollinearity, Including****p > n**Example: R mtcars Data

Additional Motivation for the Elastic Net

**Example: Vertebrae Data**

*Bias, Standard Errors and Significance Tests*

Generalized Linear Models

Generalized Linear Models

**James-Stein Theory**

*Other Terminology*

Further Reading

Mathematical Complements

Further Reading

Mathematical Complements

Definition

Theoretical Properties

When Might Shrunken Estimators Be Helpful?

Ridge Action Increases Eigenvalues

**Code for ridgelm()**

*Computational Complements*

*Further Exploration: Data, Code and Math Problems*

**Variable Selection and Dimension Reduction** ** A Closer Look at Under/Overfitting **A Simple Guiding Example

*How Many Is Too Many?*

**Some Common Measures**

*Fit Criteria*

No Panacea!

**Basic Notion**

*Variable Selection Methods*

Simple Use of p-Values: Pitfalls

Asking \What If" Questions

Stepwise Selection

Simple Use of p-Values: Pitfalls

Asking \What If" Questions

Stepwise Selection

Forward vs Backward Selection

R Functions for Stepwise Regression

Example: Bodyfat Data

Classification Settings

Example: Bank Marketing Data

Example: Vertebrae Data

Nonparametric Settings

Is Dimension Reduction Important in the

Nonparametric Setting?

The LASSO

Why the LASSO Often Performs Subsetting

Example: Bodyfat Data

*Post-Selection Inference*

**Informal Nature**

*Direct Methods for Dimension Reduction*

Role in Regression Analysis

PCA

Issues

Example: Bodyfat Data

Example: Instructor Evaluations

Nonnegative Matrix Factorization (NMF)

Overview

Interpretation

Sum-of-Parts Property

Example: Spam Detection

Use of freqparcoord for Dimension Reduction

Example: Student Evaluations of Instructors

Dimension Reduction for Dummy/R Factor

Variables

**Computation for NMF**

*The Verdict*

Further Reading

Computational Complements

Further Reading

Computational Complements

**MSEs for the Simple Example**

*Mathematical Complements*

*Further Exploration: Data, Code and Math Problems*

**Partition-Based Methods** ** CART **Bagging

Example: Vertebral Column Data

Technical Details

Statistical Consistency

Tuning Parameters

Random Forests

Example: Vertebrae Data

Example: Letter Recognition

*Other Implementations of CART*

Further Exploration: Data, Code and Math ProblemsFurther Exploration: Data, Code and Math Problems

**Semi-Linear Methods **

**Extrapolation Via lm()**

*k-NN with Linear Smoothing*

Multicollinearity Issues

Example: Bodyfat Data

Tuning Parameter

**SVMs**

*Linear Approximation of Class Boundaries*

Geometric Motivation

Reduced Convex Hulls

Tuning Parameter

Nonlinear Boundaries

Statistical Consistency

Example: Letter Recognition Data

Neural Networks

Example: Vertebrae Data

Tuning Parameters and Other Technical Details

Dimension Reduction

Statistical Consistency

*The Verdict*

**Mathematical Complements****Edge Bias with k-NN and Kernel Methods**

Dual Formulation for SVM

The Kernel Trick

*Further Reading*

Further Exploration: Data, Code and Math ProblemsFurther Exploration: Data, Code and Math Problems

Regression and Classification in Big Data

Regression and Classification in Big Data

**Software Alchemy**

*Solving the Big-n Problem*

Example: Flight Delay Data

More on the Insufficient Memory Issue

Deceivingly Big- n

The Independence Assumption in Big-n Data

*How Many Is Too Many?*

**Addressing Big-p**Toy Model

Results from the Research Literature

A Much Simpler and More Direct Approach

Nonparametric Case

The Curse of Dimensionality

Example: Currency Data

Example: Quiz Documents

The Verdict

**Speedup from Software Alchemy**

*Mathematical Complements*

**The partools Package**

*Computational Complements*

Use of the tm Package

*Further Exploration: Data, Code and Math Problems*

### Biography

**Norman Matloff** is a professor of computer science at the University of California, Davis, and was a founder of the Statistics Department at that institution. **Statistical Regression and Classification: From Linear Models to Machine Learning** was awarded the 2017 Ziegel Award for the best book reviewed in Technometrics in 2017. His current research focus is on recommender systems, and applications of regression methods to small area estimation and bias reduction in observational studies. He is on the editorial boards of the *Journal of Statistical Computation* and the *R Journal*. An award-winning teacher, he is the author of *The Art of R Programming* and *Parallel Computation in Data Science: With Examples in R, C++ and CUDA*.

" . . . Matloff delivers a well-balanced book for advanced beginners. Besides the mathematical formulas, he also presents many chunks of R code, and if the reader is able to read R code, the formulas and calculations become clearer. Due to the computational R code, the well-written Appendix, and an overall clear English, the book will help students and autodidacts. Matloff has written a textbook of the best kind for such a broad topic."

~ Jochen Kruppa, Biometric Journal". . . the book is well suitable for a wide audience: For practitioners interested in applying the methodology, for students in statistics as well as economics/social sciences and computer science. Even in more mathematically oriented classes it can be used as a complimentary text to the usual theoretic textbooks deepening students ability to interpret and question statistical results.

~Claudia Kirch, Magdeburg"This is an application-oriented book introducing frequently used classification and regression methods and the principles behind them. This book tries to keep a balance between theory and practice. It not only elaborates the theories of statistical regression and classification, but also provides large amount of real world examples and R codes to help the reader practice what they learned. As stated in the preface, the targeted readers are data analysts and college students. The style of the book fits well to the anticipated audience."

~Quanquan Gu,University of Virginia

"I consider this book as very useful for the practitioner, the instructor and the student. It contains a wealth of material, both conceptual and practical, and above all stimulates the reader to think by him/herself, without being misled by recipes."

~Ricardo Maronna, Statistical Papers