Statistical Regression and Classification
From Linear Models to Machine Learning
Preview
Book Description
Statistical Regression and Classification: From Linear Models to Machine Learning takes an innovative look at the traditional statistical regression course, presenting a contemporary treatment in line with today's applications and users. The text takes a modern look at regression:
* A thorough treatment of classical linear and generalized linear models, supplemented with introductory material on machine learning methods.
* Since classification is the focus of many contemporary applications, the book covers this topic in detail, especially the multiclass case.
* In view of the voluminous nature of many modern datasets, there is a chapter on Big Data.
* Has special Mathematical and Computational Complements sections at ends of chapters, and exercises are partitioned into Data, Math and Complements problems.
* Instructors can tailor coverage for specific audiences such as majors in Statistics, Computer Science, or Economics.
* More than 75 examples using real data.
The book treats classical regression methods in an innovative, contemporary manner. Though some statistical learning methods are introduced, the primary methodology used is linear and generalized linear parametric models, covering both the Description and Prediction goals of regression methods. The author is just as interested in Description applications of regression, such as measuring the gender wage gap in Silicon Valley, as in forecasting tomorrow's demand for bike rentals. An entire chapter is devoted to measuring such effects, including discussion of Simpson's Paradox, multiple inference, and causation issues. Similarly, there is an entire chapter of parametric model fit, making use of both residual analysis and assessment via nonparametric analysis.
Norman Matloff is a professor of computer science at the University of California, Davis, and was a founder of the Statistics Department at that institution. His current research focus is on recommender systems, and applications of regression methods to small area estimation and bias reduction in observational studies. He is on the editorial boards of the Journal of Statistical Computation and the R Journal. An awardwinning teacher, he is the author of The Art of R Programming and Parallel Computation in Data Science: With Examples in R, C++ and CUDA.
Table of Contents
*Statistical Regression and Classification: From Linear Models to Machine Learning was awarded the 2017 Ziegel Award for the best book reviewed in Technometrics in 2017.*
Setting the Stage
Example: Predicting BikeSharing Activity
Example of the Prediction Goal: Body Fat
Example of the Description Goal: Who Clicks Web Ads?
Optimal Prediction
A Note About E(), Samples and Populations
Example: Do Baseball Players Gain Weight As They Age?
Prediction vs Description
A First Estimator
A Possibly Better Estimator, Using a Linear Model
Parametric vs Nonparametric Models
Example: ClickThrough Rate
Several Predictor Variables
Multipredictor Linear Models
Estimation of Coefficients
The Description Goal
Nonparametric Regression Estimation: kNN
Looking at Nearby Points
Measures of Nearness
The kNN Method, and Tuning Parameters
NearestNeighbor Analysis in the regtools
Package
Example: Baseball Player Data
After Fitting a Model, How Do We Use It for Prediction?
Parametric Settings
Nonparametric Settings
The Generic predict() Function
Overfitting, and the VarianceBias Tradeoff
Intuition
Example: Student Evaluations of Instructors
CrossValidation
Linear Model Case
The Code
Applying the Code
kNN Case
Choosing the Partition Sizes
Important Note on Tuning Parameters
Rough Rule of Thumb
Example: BikeSharing Data
Linear Modeling of _(t)
Nonparametric Analysis
Interaction Terms, Including Quadratics
Example: Salaries of Female Programmers and Engineers
Saving Your Work
HigherOrder Polynomial Models
Classification Techniques
It's a Regression Problem!
Example: BikeSharing Data
Crucial Advice: Don't Automate, Participate!
Mathematical Complements
Indicator Random Variables
Mean Squared Error of an Estimator
_(t) Minimizes Mean Squared Prediction Error
_(t) Minimizes the Misclassification Rate
KernelBased Nonparametric Estimation of Regression
Functions
General Nonparametric Regression
Some Properties of Conditional Expectation
Conditional Expectation As a Random Variable
The Law of Total Expectation
Law of Total Variance
Tower Property
Geometric View
Computational Complements
CRAN Packages
The Function tapply() and Its Cousins
The Innards of the kNN Code
Function Dispatch
Centering and Scaling
Further Exploration: Data, Code and Math Problems
Linear Regression Models
Notation
The "Error Term"
Random vs FixedX Cases
LeastSquares Estimation
Motivation
Matrix Formulations
() in Matrix Terms
Using Matrix Operations to Minimize ()
Models Without an Intercept Term
A Closer Look at lm() Output
Statistical Inference
Assumptions
Classical
Motivation: the Multivariate Normal Distribution Family
Unbiasedness and Consistency
b_ Is Unbiased
Bias As an Issue/Nonissue
b_ Is Statistically Consistent
Inference under Homoscedasticity
Review: Classical Inference on a Single Mean
Back to Reality
The Concept of a Standard Error
Extension to the Regression Case
Example: BikeSharing Data
Collective Predictive Strength of the X(j)
Basic Properties
Definition of R
Bias Issues
AdjustedR
The LeavingOneOut Method"
Extensions of LOOM
LOOM for kNN
Other Measures
The Practical Value of pValues  Small OR Large
Misleadingly Small pValues
Example: Forest Cover Data
Example: Click Through Data
Misleadingly LARGE pValues
The Verdict \
Missing Values
Mathematical Complements
Covariance Matrices
The Multivariate Normal Distribution Family
The Central Limit Theorem
Details on Models Without a Constant Term
Unbiasedness of the LeastSquares Estimator
Consistency of the LeastSquares Estimator
Biased Nature of S
The Geometry of Conditional Expectation
Random Variables As Inner Product Spaces
Projections
Conditional Expectations As Projections
Predicted Values and Error Terms Are Uncorrelated
Classical \Exact" Inference
Asymptotic (p + )Variate Normality of b_
Computational Complements
Details of the Computation of ()
R Functions Relating to the Multivariate Normal Distribution
Family
Example: Simulation Computation of a Bivariate
Normal Quantity
More Details of 'lm' Objects
Homoscedasticity and Other Assumptions in Practice
Normality Assumption
Independence Assumption  Don't Overlook It
Estimation of a Single Mean
Inference on Linear Regression Coefficients
What Can Be Done?
Example: MovieLens Data
Dropping the Homoscedasticity Assumption
Robustness of the Homoscedasticity Assumption
Weighted Least Squares
A Procedure for Valid Inference
The Methodology
Example: Female Wages
Simulation Test
VarianceStabilizing Transformations
The Verdict
Further Reading
Computational Complements
The R merge() Function
Mathematical Complements
The Delta Method
Distortion Due to Transformation \
Further Exploration: Data, Code and Math Problems
Generalized Linear and Nonlinear Models
Example: Enzyme Kinetics Model
The Generalized Linear Model (GLM)
Definition
Poisson Regression
Exponential Families
GLM Computation
R's glm() Function
GLM: the Logistic Model
Motivation
Example: Pima Diabetes Data
Interpretation of Coefficients
The predict() Function Again
Overall Prediction Accuracy
Example: Predicting Spam Email
Linear Boundary
GLM: the Poisson Regression Model
LeastSquares Computation for Nonlinear Models
The GaussNewton Method
EickerWhite Asymptotic Standard Errors
Example: Bike Sharing Data
The Elephant in the Room": Convergence Issues
Further Reading
Computational Complements
R Factors
Mathematical Complements
Maximum Likelihood Estimation
Further Exploration: Data, Code and Math Problems
Multiclass Classification Problems
Key Notation
Key Equations
Estimating the Functions i(t)
How Do We Use Models for Prediction?
One vs All or All vs All?
Which Is Better?
Example: Vertebrae Data
Intuition
Example: Letter Recognition Data
Example: kNN on the Letter Recognition Data
The Verdict
The Classical Approach: Fisher Linear Discriminant Analysis
Background
Derivation
Example: Vertebrae Data
LDA Code and Results
Multinomial Logistic Model
Model
Software
Example: Vertebrae Data
The Issue of \Unbalanced" (and Balanced) Data
Why the Concern Regarding Balance?
A Crucial Sampling Issue
It All Depends on How We Sample
Remedies
Example: Letter Recognition
Going Beyond Using the 0.5 Threshhold
Unequal Misclassification Costs
Revisiting the Problem of Unbalanced Data
The Confusion Matrix and the ROC Curve
Code
Example: Spam Data
Mathematical Complements
Classification via Density Estimation
Methods for Density Estimation
Time Complexity Comparison, OVA vs AVA
Optimal Classification Rule for Unequal Error Costs
Computational Complements
R Code for OVA and AVA Logit Analysis
ROC Code
Further Exploration: Data, Code and Math Problems
Model Fit: Assessment and Improvement
Aims of This Chapter
Methods
Notation
Goals of Model FitChecking
Prediction Context
Description Context
Center vs Fringes of the Data Set
Example: Currency Data
Overall Measures of Model Fit
RSquared, Revisited
CrossValidation, Revisited
Plotting Parametric Fit Against Nonparametric One
Residuals vs Smoothing
Diagnostics Related to Individual Predictors
Partial Residual Plots
Plotting Nonparametric Fit Against Each Predictor
The freqparcoord Package
Parallel Coordinates
The regdiag() Function
Effects of Unusual Observations on Model Fit
The inuence() Function
Example: Currency Data
Use of freqparcoord for Outlier Detection
Automated Outlier Resistance
Median Regression
Example: Currency Data
Example: Vocabulary Acquisition
Classification Settings
Example: Pima Diabetes Study
Improving Fit
Deleting Terms from the Model
Adding Polynomial Terms
Example: Currency Data
Example: Programmer/Engineer Census Data
Boosting
View from the 30,000 Foot Level
Performance
A Tool to Aid Model Selection
Special Note on the Description Goal
Computational Complements
Data Wrangling for the Word Bank Dataset
Mathematical Complements
The Hat Matrix
Matrix Inverse Update
The Median Minimizes Mean Absolute Deviation
Further Exploration: Data, Code and Math Problems
Disaggregating Regressor Effects
A Small Analytical Example
Example: Baseball Player Data
Simpson's Paradox
Example: UCB Admissions Data (Logit)
The Verdict
Unobserved Predictor Variables
Instrumental Variables (IVs)
The IV Method
Stage Least Squares:
Example: Years of Schooling
Multiple Predictors
The Verdict
Random Effects Models
Example: Movie Ratings Data, Random Effects
Multiple Random Effects
Why Use Random/Mixed Effects Models?
Regression Function Averaging
Estimating the Counterfactual
Example: Job Training
Small Area Estimation: \Borrowing from Neighbors"
The Verdict
Multiple Inference
The Frequent Occurence of Extreme Events
Relation to Statistical Inference
The Bonferroni Inequality
Scheffe's Method
Example: MovieLens Data
The Verdict
Computational Complements
Movie Lens Data Wrangling
More Data Wrangling in the MovieLens Example
Mathematical Complements
Iterated Projections
Standard Errors for RFA
Asymptotic ChiSquare Distributions
Further Exploration: Data, Code and Math Problems
Shrinkage Estimators
Relevance of JamesStein to Regression Estimation
Multicollinearity
What's All the Fuss About?
A Simple Guiding Model
Wrong" Signs in Estimated Coefficients
Checking for Multicollinearity
The Variance Ination Factor
Example: Currency Data
What Can/Should One Do?
Do Nothing
Eliminate Some Predictors
Employ a Shrinkage Method
Ridge Regression
Alternate Definitions
Yes, It Is Smaller
Choosing the Value of _
Example: Currency Data
The LASSO
Definition
The lars Package
Example: Currency Data
The Elastic Net
Cases of Exact Multicollinearity, Including p > n
Why It May Work
Example: R mtcars Data
Additional Motivation for the Elastic Net
Bias, Standard Errors and Significance Tests
Generalized Linear Models
Example: Vertebrae Data
Other Terminology
Further Reading
Mathematical Complements
JamesStein Theory
Definition
Theoretical Properties
When Might Shrunken Estimators Be Helpful?
Ridge Action Increases Eigenvalues
Computational Complements
Code for ridgelm()
Further Exploration: Data, Code and Math Problems
Variable Selection and Dimension Reduction
A Closer Look at Under/Overfitting
A Simple Guiding Example
How Many Is Too Many?
Fit Criteria
Some Common Measures
No Panacea!
Variable Selection Methods
Simple Use of pValues: Pitfalls
Asking \What If" Questions
Stepwise Selection
Basic Notion
Forward vs Backward Selection
R Functions for Stepwise Regression
Example: Bodyfat Data
Classification Settings
Example: Bank Marketing Data
Example: Vertebrae Data
Nonparametric Settings
Is Dimension Reduction Important in the
Nonparametric Setting?
The LASSO
Why the LASSO Often Performs Subsetting
Example: Bodyfat Data
PostSelection Inference
Direct Methods for Dimension Reduction
Informal Nature
Role in Regression Analysis
PCA
Issues
Example: Bodyfat Data
Example: Instructor Evaluations
Nonnegative Matrix Factorization (NMF)
Overview
Interpretation
SumofParts Property
Example: Spam Detection
Use of freqparcoord for Dimension Reduction
Example: Student Evaluations of Instructors
Dimension Reduction for Dummy/R Factor
Variables
The Verdict
Further Reading
Computational Complements
Computation for NMF
Mathematical Complements
MSEs for the Simple Example
Further Exploration: Data, Code and Math Problems
PartitionBased Methods
CART
Example: Vertebral Column Data
Technical Details
Statistical Consistency
Tuning Parameters
Random Forests
Bagging
Example: Vertebrae Data
Example: Letter Recognition
Other Implementations of CART
Further Exploration: Data, Code and Math Problems
SemiLinear Methods
kNN with Linear Smoothing
Extrapolation Via lm()
Multicollinearity Issues
Example: Bodyfat Data
Tuning Parameter
Linear Approximation of Class Boundaries
SVMs
Geometric Motivation
Reduced Convex Hulls
Tuning Parameter
Nonlinear Boundaries
Statistical Consistency
Example: Letter Recognition Data
Neural Networks
Example: Vertebrae Data
Tuning Parameters and Other Technical Details
Dimension Reduction
Statistical Consistency
The Verdict
Mathematical Complements
Edge Bias with kNN and Kernel Methods
Dual Formulation for SVM
The Kernel Trick
Further Reading
Further Exploration: Data, Code and Math Problems
Regression and Classification in Big Data
Solving the Bign Problem
Software Alchemy
Example: Flight Delay Data
More on the Insufficient Memory Issue
Deceivingly Big n
The Independence Assumption in Bign Data
Addressing Bigp
How Many Is Too Many?
Toy Model
Results from the Research Literature
A Much Simpler and More Direct Approach
Nonparametric Case
The Curse of Dimensionality
Example: Currency Data
Example: Quiz Documents
The Verdict
Mathematical Complements
Speedup from Software Alchemy
Computational Complements
The partools Package
Use of the tm Package
Further Exploration: Data, Code and Math Problems
Author(s)
Biography
Norman Matloff is a professor of computer science at the University of California, Davis, and was a founder of the Statistics Department at that institution. Statistical Regression and Classification: From Linear Models to Machine Learning was awarded the 2017 Ziegel Award for the best book reviewed in Technometrics in 2017. His current research focus is on recommender systems, and applications of regression methods to small area estimation and bias reduction in observational studies. He is on the editorial boards of the Journal of Statistical Computation and the R Journal. An awardwinning teacher, he is the author of The Art of R Programming and Parallel Computation in Data Science: With Examples in R, C++ and CUDA.
Reviews
" . . . Matloff delivers a wellbalanced book for advanced beginners. Besides the mathematical formulas, he also presents many chunks of R code, and if the reader is able to read R code, the formulas and calculations become clearer. Due to the computational R code, the wellwritten Appendix, and an overall clear English, the book will help students and autodidacts. Matloff has written a textbook of the best kind for such a broad topic."
~ Jochen Kruppa, Biometric Journal". . . the book is well suitable for a wide audience: For practitioners interested in applying the methodology, for students in statistics as well as economics/social sciences and computer science. Even in more mathematically oriented classes it can be used as a complimentary text to the usual theoretic textbooks deepening students ability to interpret and question statistical results.
~ Claudia Kirch, Magdeburg"This is an applicationoriented book introducing frequently used classification and regression methods and the principles behind them. This book tries to keep a balance between theory and practice. It not only elaborates the theories of statistical regression and classification, but also provides large amount of real world examples and R codes to help the reader practice what they learned. As stated in the preface, the targeted readers are data analysts and college students. The style of the book fits well to the anticipated audience."
~ Quanquan Gu, University of Virginia
"I consider this book as very useful for the practitioner, the instructor and the student. It contains a wealth of material, both conceptual and practical, and above all stimulates the reader to think by him/herself, without being misled by recipes."
~Ricardo Maronna, Statistical Papers
Support Material
Ancillaries

Student Resources
Watch Video