1st Edition

# Statistical Regression and Classification From Linear Models to Machine Learning

By Norman Matloff Copyright 2017
532 Pages
by Chapman & Hall

532 Pages
by Chapman & Hall

528 Pages
by Chapman & Hall

Also available as eBook on:

This text provides a modern introduction to regression and classification with an emphasis on big data and R. Each chapter is partitioned into a main body section and an extras section. The main body uses math stat very sparingly and always in the context of something concrete, which means that readers can skip the math stat content entirely if they wish. The extras section is for those who feel comfortable with analysis using math stat.

*Statistical Regression and Classification: From Linear Models to Machine Learning was awarded the 2017 Ziegel Award for the best book reviewed in Technometrics in 2017.*

Setting the Stage
Example: Predicting Bike-Sharing Activity
Example of the Prediction Goal: Body Fat
Example of the Description Goal: Who Clicks Web Ads?
Optimal Prediction
A Note About E(), Samples and Populations
Example: Do Baseball Players Gain Weight As They Age?

Prediction vs Description
A First Estimator
A Possibly Better Estimator, Using a Linear Model
Parametric vs Nonparametric Models
Example: Click-Through Rate
Several Predictor Variables
Multipredictor Linear Models
Estimation of Coefficients
The Description Goal
Nonparametric Regression Estimation: k-NN
Looking at Nearby Points
Measures of Nearness
The k-NN Method, and Tuning Parameters
Nearest-Neighbor Analysis in the regtools
Package
Example: Baseball Player Data
After Fitting a Model, How Do We Use It for Prediction?
Parametric Settings
Nonparametric Settings
The Generic predict() Function
Overfitting, and the Variance-Bias Tradeoff
Intuition
Example: Student Evaluations of Instructors
Cross-Validation
Linear Model Case
The Code
Applying the Code
k-NN Case
Choosing the Partition Sizes
Important Note on Tuning Parameters
Rough Rule of Thumb
Example: Bike-Sharing Data
Linear Modeling of _(t)
Nonparametric Analysis
Interaction Terms, Including Quadratics
Example: Salaries of Female Programmers and Engineers
Higher-Order Polynomial Models
Classification Techniques
It's a Regression Problem!
Example: Bike-Sharing Data
Crucial Advice: Don't Automate, Participate!
Mathematical Complements
Indicator Random Variables
Mean Squared Error of an Estimator
_(t) Minimizes Mean Squared Prediction Error
_(t) Minimizes the Misclassification Rate
Kernel-Based Nonparametric Estimation of Regression
Functions
General Nonparametric Regression
Some Properties of Conditional Expectation
Conditional Expectation As a Random Variable
The Law of Total Expectation
Law of Total Variance
Tower Property
Geometric View
Computational Complements
CRAN Packages
The Function tapply() and Its Cousins
The Innards of the k-NN Code
Function Dispatch
Centering and Scaling
Further Exploration: Data, Code and Math Problems

Linear Regression Models

Notation
The "Error Term"
Random- vs Fixed-X Cases
Least-Squares Estimation
Motivation
Matrix Formulations
() in Matrix Terms
Using Matrix Operations to Minimize ()
Models Without an Intercept Term
A Closer Look at lm() Output
Statistical Inference
Assumptions
Classical
Motivation: the Multivariate Normal Distribution Family
Unbiasedness and Consistency
b_ Is Unbiased
Bias As an Issue/Nonissue
b_ Is Statistically Consistent
Inference under Homoscedasticity
Review: Classical Inference on a Single Mean
Back to Reality
The Concept of a Standard Error
Extension to the Regression Case
Example: Bike-Sharing Data
Collective Predictive Strength of the X(j)
Basic Properties
Definition of R
Bias Issues
The Leaving-One-Out Method"
Extensions of LOOM
LOOM for k-NN
Other Measures
The Practical Value of p-Values | Small OR Large
Example: Forest Cover Data
Example: Click Through Data
The Verdict \
Missing Values
Mathematical Complements

Covariance Matrices
The Multivariate Normal Distribution Family
The Central Limit Theorem
Details on Models Without a Constant Term
Unbiasedness of the Least-Squares Estimator
Consistency of the Least-Squares Estimator
Biased Nature of S
The Geometry of Conditional Expectation
Random Variables As Inner Product Spaces
Projections
Conditional Expectations As Projections
Predicted Values and Error Terms Are Uncorrelated
Classical \Exact" Inference
Asymptotic (p + )-Variate Normality of b_
Computational Complements
Details of the Computation of ()
R Functions Relating to the Multivariate Normal Distribution
Family
Example: Simulation Computation of a Bivariate
Normal Quantity
More Details of 'lm' Objects

Homoscedasticity and Other Assumptions in Practice
Normality Assumption
Independence Assumption | Don't Overlook It
Estimation of a Single Mean
Inference on Linear Regression Coefficients
What Can Be Done?
Example: MovieLens Data
Dropping the Homoscedasticity Assumption
Robustness of the Homoscedasticity Assumption
Weighted Least Squares
A Procedure for Valid Inference
The Methodology
Example: Female Wages
Simulation Test
Variance-Stabilizing Transformations
The Verdict
Computational Complements
The R merge() Function
Mathematical Complements
The Delta Method
Distortion Due to Transformation \
Further Exploration: Data, Code and Math Problems

Generalized Linear and Nonlinear Models
Example: Enzyme Kinetics Model
The Generalized Linear Model (GLM)
Definition
Poisson Regression
Exponential Families
GLM Computation
R's glm() Function
GLM: the Logistic Model
Motivation
Example: Pima Diabetes Data
Interpretation of Coefficients
The predict() Function Again
Overall Prediction Accuracy
Example: Predicting Spam E-mail
Linear Boundary
GLM: the Poisson Regression Model
Least-Squares Computation for Nonlinear Models
The Gauss-Newton Method
Eicker-White Asymptotic Standard Errors
Example: Bike Sharing Data
The Elephant in the Room": Convergence Issues
Computational Complements
R Factors
Mathematical Complements
Maximum Likelihood Estimation
Further Exploration: Data, Code and Math Problems

Multiclass Classification Problems
Key Notation
Key Equations
Estimating the Functions i(t)
How Do We Use Models for Prediction?
One vs All or All vs All?
Which Is Better?
Example: Vertebrae Data
Intuition
Example: Letter Recognition Data
Example: k-NN on the Letter Recognition Data
The Verdict
The Classical Approach: Fisher Linear Discriminant Analysis
Background
Derivation
Example: Vertebrae Data
LDA Code and Results
Multinomial Logistic Model
Model
Software
Example: Vertebrae Data
The Issue of \Unbalanced" (and Balanced) Data
Why the Concern Regarding Balance?
A Crucial Sampling Issue
It All Depends on How We Sample
Remedies
Example: Letter Recognition
Going Beyond Using the 0.5 Threshhold
Unequal Misclassification Costs
Revisiting the Problem of Unbalanced Data
The Confusion Matrix and the ROC Curve
Code
Example: Spam Data
Mathematical Complements
Classification via Density Estimation
Methods for Density Estimation
Time Complexity Comparison, OVA vs AVA
Optimal Classification Rule for Unequal Error Costs
Computational Complements
R Code for OVA and AVA Logit Analysis
ROC Code
Further Exploration: Data, Code and Math Problems

Model Fit: Assessment and Improvement
Aims of This Chapter
Methods
Notation
Goals of Model Fit-Checking
Prediction Context
Description Context
Center vs Fringes of the Data Set
Example: Currency Data
Overall Measures of Model Fit
R-Squared, Revisited
Cross-Validation, Revisited
Plotting Parametric Fit Against Nonparametric One
Residuals vs Smoothing
Diagnostics Related to Individual Predictors
Partial Residual Plots
Plotting Nonparametric Fit Against Each Predictor
The freqparcoord Package
Parallel Coordinates
The regdiag() Function
Effects of Unusual Observations on Model Fit
The inuence() Function
Example: Currency Data
Use of freqparcoord for Outlier Detection
Automated Outlier Resistance
Median Regression
Example: Currency Data
Example: Vocabulary Acquisition
Classification Settings
Example: Pima Diabetes Study
Improving Fit
Deleting Terms from the Model
Example: Currency Data
Example: Programmer/Engineer Census Data
Boosting
View from the 30,000 Foot Level
Performance
A Tool to Aid Model Selection
Special Note on the Description Goal
Computational Complements
Data Wrangling for the Word Bank Dataset
Mathematical Complements
The Hat Matrix
Matrix Inverse Update
The Median Minimizes Mean Absolute Deviation
Further Exploration: Data, Code and Math Problems

Disaggregating Regressor Effects

A Small Analytical Example
Example: Baseball Player Data
Example: UCB Admissions Data (Logit)
The Verdict
Unobserved Predictor Variables
Instrumental Variables (IVs)
The IV Method
Stage Least Squares:
Example: Years of Schooling
Multiple Predictors
The Verdict
Random Effects Models
Example: Movie Ratings Data, Random Effects
Multiple Random Effects
Why Use Random/Mixed Effects Models?
Regression Function Averaging
Estimating the Counterfactual
Example: Job Training
Small Area Estimation: \Borrowing from Neighbors"
The Verdict
Multiple Inference
The Frequent Occurence of Extreme Events
Relation to Statistical Inference
The Bonferroni Inequality
Scheffe's Method
Example: MovieLens Data
The Verdict
Computational Complements
Movie Lens Data Wrangling
More Data Wrangling in the MovieLens Example
Mathematical Complements
Iterated Projections
Standard Errors for RFA
Asymptotic Chi-Square Distributions
Further Exploration: Data, Code and Math Problems

Shrinkage Estimators
Relevance of James-Stein to Regression Estimation
Multicollinearity
What's All the Fuss About?
A Simple Guiding Model
Wrong" Signs in Estimated Coefficients
Checking for Multicollinearity
The Variance Ination Factor
Example: Currency Data
What Can/Should One Do?
Do Nothing
Eliminate Some Predictors
Employ a Shrinkage Method
Ridge Regression
Alternate Definitions
Yes, It Is Smaller
Choosing the Value of _
Example: Currency Data
The LASSO
Definition
The lars Package
Example: Currency Data
The Elastic Net
Cases of Exact Multicollinearity, Including p > n
Why It May Work
Example: R mtcars Data
Additional Motivation for the Elastic Net
Bias, Standard Errors and Significance Tests
Generalized Linear Models
Example: Vertebrae Data
Other Terminology
Mathematical Complements
James-Stein Theory
Definition
Theoretical Properties
When Might Shrunken Estimators Be Helpful?
Ridge Action Increases Eigenvalues
Computational Complements
Code for ridgelm()
Further Exploration: Data, Code and Math Problems

Variable Selection and Dimension Reduction
A Closer Look at Under/Overfitting
A Simple Guiding Example
How Many Is Too Many?
Fit Criteria
Some Common Measures
No Panacea!
Variable Selection Methods
Simple Use of p-Values: Pitfalls
Asking \What If" Questions
Stepwise Selection
Basic Notion
Forward vs Backward Selection
R Functions for Stepwise Regression
Example: Bodyfat Data
Classification Settings
Example: Bank Marketing Data
Example: Vertebrae Data
Nonparametric Settings
Is Dimension Reduction Important in the
Nonparametric Setting?
The LASSO
Why the LASSO Often Performs Subsetting
Example: Bodyfat Data
Post-Selection Inference
Direct Methods for Dimension Reduction
Informal Nature
Role in Regression Analysis
PCA
Issues
Example: Bodyfat Data
Example: Instructor Evaluations
Nonnegative Matrix Factorization (NMF)
Overview
Interpretation
Sum-of-Parts Property
Example: Spam Detection
Use of freqparcoord for Dimension Reduction
Example: Student Evaluations of Instructors
Dimension Reduction for Dummy/R Factor
Variables
The Verdict
Computational Complements
Computation for NMF
Mathematical Complements
MSEs for the Simple Example
Further Exploration: Data, Code and Math Problems

Partition-Based Methods
CART
Example: Vertebral Column Data
Technical Details
Statistical Consistency
Tuning Parameters
Random Forests
Bagging
Example: Vertebrae Data
Example: Letter Recognition
Other Implementations of CART
Further Exploration: Data, Code and Math Problems

Semi-Linear Methods
k-NN with Linear Smoothing
Extrapolation Via lm()
Multicollinearity Issues
Example: Bodyfat Data
Tuning Parameter
Linear Approximation of Class Boundaries
SVMs
Geometric Motivation
Reduced Convex Hulls
Tuning Parameter
Nonlinear Boundaries
Statistical Consistency
Example: Letter Recognition Data
Neural Networks
Example: Vertebrae Data
Tuning Parameters and Other Technical Details
Dimension Reduction
Statistical Consistency
The Verdict
Mathematical Complements
Edge Bias with k-NN and Kernel Methods
Dual Formulation for SVM
The Kernel Trick
Further Exploration: Data, Code and Math Problems

Regression and Classification in Big Data
Solving the Big-n Problem
Software Alchemy
Example: Flight Delay Data
More on the Insufficient Memory Issue
Deceivingly  Big- n
The Independence Assumption in Big-n Data
How Many Is Too Many?
Toy Model
Results from the Research Literature
A Much Simpler and More Direct Approach
Nonparametric Case
The Curse of Dimensionality
Example: Currency Data
Example: Quiz Documents
The Verdict
Mathematical Complements
Speedup from Software Alchemy
Computational Complements
The partools Package
Use of the tm Package
Further Exploration: Data, Code and Math Problems

### Biography

Norman Matloff is a professor of computer science at the University of California, Davis, and was a founder of the Statistics Department at that institution. Statistical Regression and Classification: From Linear Models to Machine Learning was awarded the 2017 Ziegel Award for the best book reviewed in Technometrics in 2017. His current research focus is on recommender systems, and applications of regression methods to small area estimation and bias reduction in observational studies. He is on the editorial boards of the Journal of Statistical Computation and the R Journal. An award-winning teacher, he is the author of The Art of R Programming and Parallel Computation in Data Science: With Examples in R, C++ and CUDA.

" . . . Matloff delivers a well-balanced book for advanced beginners. Besides the mathematical formulas, he also presents many chunks of R code, and if the reader is able to read R code, the formulas and calculations become clearer. Due to the computational R code, the well-written Appendix, and an overall clear English, the book will help students and autodidacts. Matloff has written a textbook of the best kind for such a broad topic."
~ Jochen Kruppa, Biometric Journal

". . . the book is well suitable for a wide audience: For practitioners interested in applying the methodology, for students in statistics as well as economics/social sciences and computer science. Even in more mathematically oriented classes it can be used as a complimentary text to the usual theoretic textbooks deepening students ability to interpret and question statistical results.
~ Claudia Kirch, Magdeburg

"This is an application-oriented book introducing frequently used classification and regression methods and the principles behind them. This book tries to keep a balance between theory and practice. It not only elaborates the theories of statistical regression and classification, but also provides large amount of real world examples and R codes to help the reader practice what they learned. As stated in the preface, the targeted readers are data analysts and college students. The style of the book fits well to the anticipated audience."
~ Quanquan Gu, University of Virginia

"I consider this book as very useful for the practitioner, the instructor and the student. It contains a wealth of material, both conceptual and practical, and above all stimulates the reader to think by him/herself, without being misled by recipes."
~Ricardo Maronna, Statistical Papers