1st Edition

Statistical Regression and Classification From Linear Models to Machine Learning

By Norman Matloff Copyright 2017
    532 Pages
    by Chapman & Hall

    532 Pages
    by Chapman & Hall

    This text provides a modern introduction to regression and classification with an emphasis on big data and R. Each chapter is partitioned into a main body section and an extras section. The main body uses math stat very sparingly and always in the context of something concrete, which means that readers can skip the math stat content entirely if they wish. The extras section is for those who feel comfortable with analysis using math stat.

    *Statistical Regression and Classification: From Linear Models to Machine Learning was awarded the 2017 Ziegel Award for the best book reviewed in Technometrics in 2017.*


    Setting the Stage
    Example: Predicting Bike-Sharing Activity
    Example of the Prediction Goal: Body Fat
    Example of the Description Goal: Who Clicks Web Ads?
    Optimal Prediction
    A Note About E(), Samples and Populations
    Example: Do Baseball Players Gain Weight As They Age?

    Prediction vs Description
    A First Estimator
    A Possibly Better Estimator, Using a Linear Model
    Parametric vs Nonparametric Models
    Example: Click-Through Rate
    Several Predictor Variables
    Multipredictor Linear Models
    Estimation of Coefficients
    The Description Goal
    Nonparametric Regression Estimation: k-NN
    Looking at Nearby Points
    Measures of Nearness
    The k-NN Method, and Tuning Parameters
    Nearest-Neighbor Analysis in the regtools
    Package
    Example: Baseball Player Data
    After Fitting a Model, How Do We Use It for Prediction?
    Parametric Settings
    Nonparametric Settings
    The Generic predict() Function
    Overfitting, and the Variance-Bias Tradeoff
    Intuition
    Example: Student Evaluations of Instructors
    Cross-Validation
    Linear Model Case
    The Code
    Applying the Code
    k-NN Case
    Choosing the Partition Sizes
    Important Note on Tuning Parameters
    Rough Rule of Thumb
    Example: Bike-Sharing Data
    Linear Modeling of _(t)
    Nonparametric Analysis
    Interaction Terms, Including Quadratics
    Example: Salaries of Female Programmers and Engineers
    Saving Your Work
    Higher-Order Polynomial Models
    Classification Techniques
    It's a Regression Problem!
    Example: Bike-Sharing Data
    Crucial Advice: Don't Automate, Participate!
    Mathematical Complements
    Indicator Random Variables
    Mean Squared Error of an Estimator
    _(t) Minimizes Mean Squared Prediction Error
    _(t) Minimizes the Misclassification Rate
    Kernel-Based Nonparametric Estimation of Regression
    Functions
    General Nonparametric Regression
    Some Properties of Conditional Expectation
    Conditional Expectation As a Random Variable
    The Law of Total Expectation
    Law of Total Variance
    Tower Property
    Geometric View
    Computational Complements
    CRAN Packages
    The Function tapply() and Its Cousins
    The Innards of the k-NN Code
    Function Dispatch
    Centering and Scaling
    Further Exploration: Data, Code and Math Problems


    Linear Regression Models

    Notation
    The "Error Term"
    Random- vs Fixed-X Cases
    Least-Squares Estimation
    Motivation
    Matrix Formulations
    () in Matrix Terms
    Using Matrix Operations to Minimize ()
    Models Without an Intercept Term
    A Closer Look at lm() Output
    Statistical Inference
    Assumptions
    Classical
    Motivation: the Multivariate Normal Distribution Family
    Unbiasedness and Consistency
    b_ Is Unbiased
    Bias As an Issue/Nonissue
    b_ Is Statistically Consistent
    Inference under Homoscedasticity
    Review: Classical Inference on a Single Mean
    Back to Reality
    The Concept of a Standard Error
    Extension to the Regression Case
    Example: Bike-Sharing Data
    Collective Predictive Strength of the X(j)
    Basic Properties
    Definition of R
    Bias Issues
    Adjusted-R
    The Leaving-One-Out Method"
    Extensions of LOOM
    LOOM for k-NN
    Other Measures
    The Practical Value of p-Values | Small OR Large
    Misleadingly Small p-Values
    Example: Forest Cover Data
    Example: Click Through Data
    Misleadingly LARGE p-Values
    The Verdict \
    Missing Values
    Mathematical Complements

    Covariance Matrices
    The Multivariate Normal Distribution Family
    The Central Limit Theorem
    Details on Models Without a Constant Term
    Unbiasedness of the Least-Squares Estimator
    Consistency of the Least-Squares Estimator
    Biased Nature of S
    The Geometry of Conditional Expectation
    Random Variables As Inner Product Spaces
    Projections
    Conditional Expectations As Projections
    Predicted Values and Error Terms Are Uncorrelated
    Classical \Exact" Inference
    Asymptotic (p + )-Variate Normality of b_
    Computational Complements
    Details of the Computation of ()
    R Functions Relating to the Multivariate Normal Distribution
    Family
    Example: Simulation Computation of a Bivariate
    Normal Quantity
    More Details of 'lm' Objects

    Homoscedasticity and Other Assumptions in Practice
    Normality Assumption
    Independence Assumption | Don't Overlook It
    Estimation of a Single Mean
    Inference on Linear Regression Coefficients
    What Can Be Done?
    Example: MovieLens Data
    Dropping the Homoscedasticity Assumption
    Robustness of the Homoscedasticity Assumption
    Weighted Least Squares
    A Procedure for Valid Inference
    The Methodology
    Example: Female Wages
    Simulation Test
    Variance-Stabilizing Transformations
    The Verdict
    Further Reading
    Computational Complements
    The R merge() Function
    Mathematical Complements
    The Delta Method
    Distortion Due to Transformation \
    Further Exploration: Data, Code and Math Problems

    Generalized Linear and Nonlinear Models
    Example: Enzyme Kinetics Model
    The Generalized Linear Model (GLM)
    Definition
    Poisson Regression
    Exponential Families
    GLM Computation
    R's glm() Function
    GLM: the Logistic Model
    Motivation
    Example: Pima Diabetes Data
    Interpretation of Coefficients
    The predict() Function Again
    Overall Prediction Accuracy
    Example: Predicting Spam E-mail
    Linear Boundary
    GLM: the Poisson Regression Model
    Least-Squares Computation for Nonlinear Models
    The Gauss-Newton Method
    Eicker-White Asymptotic Standard Errors
    Example: Bike Sharing Data
    The Elephant in the Room": Convergence Issues
    Further Reading
    Computational Complements
    R Factors
    Mathematical Complements
    Maximum Likelihood Estimation
    Further Exploration: Data, Code and Math Problems

    Multiclass Classification Problems
    Key Notation
    Key Equations
    Estimating the Functions i(t)
    How Do We Use Models for Prediction?
    One vs All or All vs All?
    Which Is Better?
    Example: Vertebrae Data
    Intuition
    Example: Letter Recognition Data
    Example: k-NN on the Letter Recognition Data
    The Verdict
    The Classical Approach: Fisher Linear Discriminant Analysis
    Background
    Derivation
    Example: Vertebrae Data
    LDA Code and Results
    Multinomial Logistic Model
    Model
    Software
    Example: Vertebrae Data
    The Issue of \Unbalanced" (and Balanced) Data
    Why the Concern Regarding Balance?
    A Crucial Sampling Issue
    It All Depends on How We Sample
    Remedies
    Example: Letter Recognition
    Going Beyond Using the 0.5 Threshhold
    Unequal Misclassification Costs
    Revisiting the Problem of Unbalanced Data
    The Confusion Matrix and the ROC Curve
    Code
    Example: Spam Data
    Mathematical Complements
    Classification via Density Estimation
    Methods for Density Estimation
    Time Complexity Comparison, OVA vs AVA
    Optimal Classification Rule for Unequal Error Costs
    Computational Complements
    R Code for OVA and AVA Logit Analysis
    ROC Code
    Further Exploration: Data, Code and Math Problems

    Model Fit: Assessment and Improvement
    Aims of This Chapter
    Methods
    Notation
    Goals of Model Fit-Checking
    Prediction Context
    Description Context
    Center vs Fringes of the Data Set
    Example: Currency Data
    Overall Measures of Model Fit
    R-Squared, Revisited
    Cross-Validation, Revisited
    Plotting Parametric Fit Against Nonparametric One
    Residuals vs Smoothing
    Diagnostics Related to Individual Predictors
    Partial Residual Plots
    Plotting Nonparametric Fit Against Each Predictor
    The freqparcoord Package
    Parallel Coordinates
    The regdiag() Function
    Effects of Unusual Observations on Model Fit
    The inuence() Function
    Example: Currency Data
    Use of freqparcoord for Outlier Detection
    Automated Outlier Resistance
    Median Regression
    Example: Currency Data
    Example: Vocabulary Acquisition
    Classification Settings
    Example: Pima Diabetes Study
    Improving Fit
    Deleting Terms from the Model
    Adding Polynomial Terms
    Example: Currency Data
    Example: Programmer/Engineer Census Data
    Boosting
    View from the 30,000 Foot Level
    Performance
    A Tool to Aid Model Selection
    Special Note on the Description Goal
    Computational Complements
    Data Wrangling for the Word Bank Dataset
    Mathematical Complements
    The Hat Matrix
    Matrix Inverse Update
    The Median Minimizes Mean Absolute Deviation
    Further Exploration: Data, Code and Math Problems


    Disaggregating Regressor Effects

    A Small Analytical Example
    Example: Baseball Player Data
    Simpson's Paradox
    Example: UCB Admissions Data (Logit)
    The Verdict
    Unobserved Predictor Variables
    Instrumental Variables (IVs)
    The IV Method
    Stage Least Squares:
    Example: Years of Schooling
    Multiple Predictors
    The Verdict
    Random Effects Models
    Example: Movie Ratings Data, Random Effects
    Multiple Random Effects
    Why Use Random/Mixed Effects Models?
    Regression Function Averaging
    Estimating the Counterfactual
    Example: Job Training
    Small Area Estimation: \Borrowing from Neighbors"
    The Verdict
    Multiple Inference
    The Frequent Occurence of Extreme Events
    Relation to Statistical Inference
    The Bonferroni Inequality
    Scheffe's Method
    Example: MovieLens Data
    The Verdict
    Computational Complements
    Movie Lens Data Wrangling
    More Data Wrangling in the MovieLens Example
    Mathematical Complements
    Iterated Projections
    Standard Errors for RFA
    Asymptotic Chi-Square Distributions
    Further Exploration: Data, Code and Math Problems

    Shrinkage Estimators
    Relevance of James-Stein to Regression Estimation
    Multicollinearity
    What's All the Fuss About?
    A Simple Guiding Model
    Wrong" Signs in Estimated Coefficients
    Checking for Multicollinearity
    The Variance Ination Factor
    Example: Currency Data
    What Can/Should One Do?
    Do Nothing
    Eliminate Some Predictors
    Employ a Shrinkage Method
    Ridge Regression
    Alternate Definitions
    Yes, It Is Smaller
    Choosing the Value of _
    Example: Currency Data
    The LASSO
    Definition
    The lars Package
    Example: Currency Data
    The Elastic Net
    Cases of Exact Multicollinearity, Including p > n
    Why It May Work
    Example: R mtcars Data
    Additional Motivation for the Elastic Net
    Bias, Standard Errors and Significance Tests
    Generalized Linear Models
    Example: Vertebrae Data
    Other Terminology
    Further Reading
    Mathematical Complements
    James-Stein Theory
    Definition
    Theoretical Properties
    When Might Shrunken Estimators Be Helpful?
    Ridge Action Increases Eigenvalues
    Computational Complements
    Code for ridgelm()
    Further Exploration: Data, Code and Math Problems

    Variable Selection and Dimension Reduction
    A Closer Look at Under/Overfitting
    A Simple Guiding Example
    How Many Is Too Many?
    Fit Criteria
    Some Common Measures
    No Panacea!
    Variable Selection Methods
    Simple Use of p-Values: Pitfalls
    Asking \What If" Questions
    Stepwise Selection
    Basic Notion
    Forward vs Backward Selection
    R Functions for Stepwise Regression
    Example: Bodyfat Data
    Classification Settings
    Example: Bank Marketing Data
    Example: Vertebrae Data
    Nonparametric Settings
    Is Dimension Reduction Important in the
    Nonparametric Setting?
    The LASSO
    Why the LASSO Often Performs Subsetting
    Example: Bodyfat Data
    Post-Selection Inference
    Direct Methods for Dimension Reduction
    Informal Nature
    Role in Regression Analysis
    PCA
    Issues
    Example: Bodyfat Data
    Example: Instructor Evaluations
    Nonnegative Matrix Factorization (NMF)
    Overview
    Interpretation
    Sum-of-Parts Property
    Example: Spam Detection
    Use of freqparcoord for Dimension Reduction
    Example: Student Evaluations of Instructors
    Dimension Reduction for Dummy/R Factor
    Variables
    The Verdict
    Further Reading
    Computational Complements
    Computation for NMF
    Mathematical Complements
    MSEs for the Simple Example
    Further Exploration: Data, Code and Math Problems

    Partition-Based Methods
    CART
    Example: Vertebral Column Data
    Technical Details
    Statistical Consistency
    Tuning Parameters
    Random Forests
    Bagging
    Example: Vertebrae Data
    Example: Letter Recognition
    Other Implementations of CART
    Further Exploration: Data, Code and Math Problems

    Semi-Linear Methods
    k-NN with Linear Smoothing
    Extrapolation Via lm()
    Multicollinearity Issues
    Example: Bodyfat Data
    Tuning Parameter
    Linear Approximation of Class Boundaries
    SVMs
    Geometric Motivation
    Reduced Convex Hulls
    Tuning Parameter
    Nonlinear Boundaries
    Statistical Consistency
    Example: Letter Recognition Data
    Neural Networks
    Example: Vertebrae Data
    Tuning Parameters and Other Technical Details
    Dimension Reduction
    Statistical Consistency
    The Verdict
    Mathematical Complements
    Edge Bias with k-NN and Kernel Methods
    Dual Formulation for SVM
    The Kernel Trick
    Further Reading
    Further Exploration: Data, Code and Math Problems


    Regression and Classification in Big Data
    Solving the Big-n Problem
    Software Alchemy
    Example: Flight Delay Data
    More on the Insufficient Memory Issue
    Deceivingly  Big- n
    The Independence Assumption in Big-n Data
    Addressing Big-p
    How Many Is Too Many?
    Toy Model
    Results from the Research Literature
    A Much Simpler and More Direct Approach
    Nonparametric Case
    The Curse of Dimensionality
    Example: Currency Data
    Example: Quiz Documents
    The Verdict
    Mathematical Complements
    Speedup from Software Alchemy
    Computational Complements
    The partools Package
    Use of the tm Package
    Further Exploration: Data, Code and Math Problems

    Biography

    Norman Matloff is a professor of computer science at the University of California, Davis, and was a founder of the Statistics Department at that institution. Statistical Regression and Classification: From Linear Models to Machine Learning was awarded the 2017 Ziegel Award for the best book reviewed in Technometrics in 2017. His current research focus is on recommender systems, and applications of regression methods to small area estimation and bias reduction in observational studies. He is on the editorial boards of the Journal of Statistical Computation and the R Journal. An award-winning teacher, he is the author of The Art of R Programming and Parallel Computation in Data Science: With Examples in R, C++ and CUDA.

    " . . . Matloff delivers a well-balanced book for advanced beginners. Besides the mathematical formulas, he also presents many chunks of R code, and if the reader is able to read R code, the formulas and calculations become clearer. Due to the computational R code, the well-written Appendix, and an overall clear English, the book will help students and autodidacts. Matloff has written a textbook of the best kind for such a broad topic."
    ~ Jochen Kruppa, Biometric Journal

    ". . . the book is well suitable for a wide audience: For practitioners interested in applying the methodology, for students in statistics as well as economics/social sciences and computer science. Even in more mathematically oriented classes it can be used as a complimentary text to the usual theoretic textbooks deepening students ability to interpret and question statistical results.
    ~ Claudia Kirch, Magdeburg

    "This is an application-oriented book introducing frequently used classification and regression methods and the principles behind them. This book tries to keep a balance between theory and practice. It not only elaborates the theories of statistical regression and classification, but also provides large amount of real world examples and R codes to help the reader practice what they learned. As stated in the preface, the targeted readers are data analysts and college students. The style of the book fits well to the anticipated audience."
    ~ Quanquan Gu, University of Virginia


    "I consider this book as very useful for the practitioner, the instructor and the student. It contains a wealth of material, both conceptual and practical, and above all stimulates the reader to think by him/herself, without being misled by recipes."
    ~Ricardo Maronna, Statistical Papers