# Statistical Foundations of Data Science

## Preview

## Book Description

**Statistical Foundations of Data Science** gives a thorough introduction to commonly used statistical models, contemporary statistical machine learning techniques and algorithms, along with their mathematical insights and statistical theories. It aims to serve as a graduate-level textbook and a research monograph on high-dimensional statistics, sparsity and covariance learning, machine learning, and statistical inference. It includes ample exercises that involve both theoretical studies as well as empirical applications.

The book begins with an introduction to the stylized features of big data and their impacts on statistical analysis. It then introduces multiple linear regression and expands the techniques of model building via nonparametric regression and kernel tricks. It provides a comprehensive account on sparsity explorations and model selections for multiple regression, generalized linear models, quantile regression, robust regression, hazards regression, among others. High-dimensional inference is also thoroughly addressed and so is feature screening. The book also provides a comprehensive account on high-dimensional covariance estimation, learning latent factors and hidden structures, as well as their applications to statistical estimation, inference, prediction and machine learning problems. It also introduces thoroughly statistical machine learning theory and methods for classification, clustering, and prediction. These include CART, random forests, boosting, support vector machines, clustering algorithms, sparse PCA, and deep learning.

## Table of Contents

**I. Introduction**

Rise of Big Data and Dimensionality

Biological Sciences

Health Sciences

Computer and Information Sciences

Economics and Finance

Business and Program Evaluation

Earth Sciences and Astronomy

Impact of Big Data

Impact of Dimensionality

Computation

Noise Accumulation

Spurious Correlation

Statistical theory

Aim of High-dimensional Statistical Learning

What big data can do

Scope of the book

**2. Multiple and Nonparametric Regression **

Introduction

Multiple Linear Regression

The Gauss-Markov Theorem

Statistical Tests

Weighted Least-Squares

Box-Cox Transformation

Model Building and Basis Expansions

Polynomial Regression

Spline Regression

Multiple Covariates

Ridge Regression

Bias-Variance Tradeo

Penalized Least Squares

Bayesian Interpretation

Ridge Regression Solution Path

Kernel Ridge Regression

Regression in Reproducing Kernel Hilbert Space

Leave-one-out and Generalized Cross-validation

Exercises

3. Introduction to Penalized Least-Squares

Classical Variable Selection Criteria

Subset selection

Relation with penalized regression

Selection of regularization parameters

Folded-concave Penalized Least Squares

Orthonormal designs

Penalty functions

Thresholding by SCAD and MCP

Risk properties

Characterization of folded-concave PLS

Lasso and L Regularization

Nonnegative garrote

Lasso

Adaptive Lasso

Elastic Net

Dantzig selector

SLOPE and Sorted Penalties

Concentration inequalities and uniform convergence

A brief history of model selection

Bayesian Variable Selection

Bayesian view of the PLS

A Bayesian framework for selection

Numerical Algorithms

Quadratic programs

Least angle regression_

Local quadratic approximations

Local linear algorithm

Penalized linear unbiased selection_

Cyclic coordinate descent algorithms

Iterative shrinkage-thresholding algorithms

Projected proximal gradient method

ADMM

Iterative Local Adaptive Majorization and Minimization

Other Methods and Timeline

Regularization parameters for PLS

Degrees of freedom

Extension of information criteria

Application to PLS estimators

Residual variance and refitted cross-validation

Residual variance of Lasso

Refitted cross-validation

Extensions to Nonparametric Modeling

Structured nonparametric models

Group penalty

Applications

Bibliographical notes

Exercises

4. Penalized Least Squares: Properties

Performance Benchmarks

Performance measures

Impact of model uncertainty

Bayes lower bounds for orthogonal design

Minimax lower bounds for general design

Performance goals, sparsity and sub-Gaussian noise

Penalized L Selection

Lasso and Dantzig Selector

Selection consistency

Prediction and coefficient estimation errors

Model size and least squares after selection

Properties of the Dantzig selector

Regularity conditions on the design matrix

Properties of Concave PLS

Properties of penalty functions

Local and oracle solutions

Properties of local solutions

Global and approximate global solutions

Smaller and Sorted Penalties

Sorted concave penalties and its local approximation

Approximate PLS with smaller and sorted penalties

Properties of LLA and LCA

Bibliographical notes

Exercises

5. Generalized Linear Models and Penalized Likelihood

Generalized Linear Models

Exponential family

Elements of generalized linear models

Maximum likelihood

Computing MLE: Iteratively reweighed least squares

Deviance and Analysis of Deviance

Residuals

Examples

Bernoulli and binomial models

Models for count responses

Models for nonnegative continuous responses

Normal error models

Sparest solution in high confidence set

A general setup

Examples

Properties

Variable Selection via Penalized Likelihood

Algorithms

Local quadratic approximation

Local linear approximation

Coordinate descent

Iterative Local Adaptive Majorization and Minimization

Tuning parameter selection

An Application

Sampling Properties in low-dimension

Notation and regularity conditions

The oracle property

Sampling Properties with Diverging Dimensions

Asymptotic properties of GIC selectors

Properties under Ultrahigh Dimensions

The Lasso penalized estimator and its risk property

Strong oracle property

Numeric studies

Risk properties

Bibliographical notes

Exercises

6. Penalized M-estimators

Penalized quantile regression

Quantile regression

Variable selection in quantile regression

A fast algorithm for penalized quantile regression

Penalized composite quantile regression

Variable selection in robust regression

Robust regression

Variable selection in Huber regression

Rank regression and its variable selection

Rank regression

Penalized weighted rank regression

Variable Selection for Survival Data

Partial likelihood

Variable selection via penalized partial likelihood and its properties

Theory of folded-concave penalized M-estimator

Conditions on penalty and restricted strong convexity

Statistical accuracy of penalized M-estimator with

folded concave penalties

Computational accuracy

Bibliographical notes

Exercises

7. High Dimensional Inference

Inference in linear regression

Debias of regularized regression estimators

Choices of weights

Inference for the noise level

Inference in generalized linear models

Desparsified Lasso

Decorrelated score estimator

Test of linear hypotheses

Numerical comparison

An application

Asymptotic efficiency

Statistical efficiency and Fisher information

Linear regression with random design

Partial linear regression

Gaussian graphical models

Inference via penalized least squares

Sample size in regression and graphical models

General solutions_

Local semi-LD decomposition

Data swap

Gradient approximation

Bibliographical notes

Exercises

8. Feature Screening

Correlation Screening

Sure screening property

Connection to multiple comparison

Iterative SIS

Generalized and Rank Correlation Screening

Feature Screening for Parametric Models

Generalized linear models

A unified strategy for parametric feature screening

Conditional sure independence screening

Nonparametric Screening

Additive models

Varying coefficient models

Heterogeneous nonparametric models

Model-free Feature Screening

Sure independent ranking screening procedure

Feature screening via distance correlation

Feature screening for high-dimensional categorial data

Screening and Selection

Feature screening via forward regression

Sparse maximum likelihood estimate

Feature screening via partial correlation

Refitted Cross-Validation

RCV algorithm

RCV in linear models

RCV in nonparametric regression

An Illustration

Bibliographical notes

Exercises

9. Covariance Regularization and Graphical Models

Basic facts about matrix

Sparse Covariance Matrix Estimation

Covariance regularization by thresholding and banding

Asymptotic properties

Nearest positive definite matrices

Robust covariance inputs

Sparse Precision Matrix and Graphical Models

Gaussian graphical models

Penalized likelihood and M-estimation

Penalized least-squares

CLIME and its adaptive version

Latent Gaussian Graphical Models

Technical Proofs

Proof of Theorem

Proof of Theorem

Proof of Theorem

Proof of Theorem

Bibliographical notes

Exercises

10. Covariance Learning and Factor Models

Principal Component Analysis

Introduction to PCA

Power Method

Factor Models and Structured Covariance Learning

Factor model and high-dimensional PCA

Extracting latent factors and POET

Methods for selecting number of factors

Covariance and Precision Learning with Known Factors

Factor model with observable factors

Robust initial estimation of covariance matrix

Augmented factor models and projected PCA

Asymptotic Properties

Properties for estimating loading matrix

Properties for estimating covariance matrices

Properties for estimating realized latent factors

Properties for estimating idiosyncratic components

Technical Proofs

Proof of Theorem

Proof of Theorem

Proof of Theorem

Proof of Theorem

Bibliographical Notes

Exercises

11. Applications of Factor Models and PCA

Factor-adjusted Regularized Model Selection

Importance of factor adjustments

FarmSelect

Application to forecasting bond risk premia

Application to a neuroblastoma data

Asymptotic theory for FarmSelect

Factor-adjusted robust multiple testing

False discovery rate control

Multiple testing under dependence measurements

Power of factor adjustments

FarmTest

Application to neuroblastoma data

Factor Augmented Regression Methods

Principal Component Regression

Augmented Principal Component Regression

Application to Forecast Bond Risk Premia

Applications to Statistical Machine Learning

Community detection

Topic model

Matrix completion

Item ranking

Gaussian Mixture models

Bibliographical Notes

Exercises

12. Supervised Learning

Model-based Classifiers

Linear and quadratic discriminant analysis

Logistic regression

Kernel Density Classifiers and Naive Bayes

Nearest Neighbor Classifiers

Classification Trees and Ensemble Classifiers

Classification trees

Bagging

Random forests

Boosting

Support Vector Machines

The standard support vector machine

Generalizations of SVMs

Sparse Classifiers via Penalized Empirical Loss

The importance of sparsity under high-dimensionality

Sparse support vector machines

Sparse large margin classifiers

Sparse Discriminant Analysis

Nearest shrunken centroids classifier

Features annealed independent rule

Selection bias of sparse independence rules

Regularized optimal affine discriminant

Linear programming discriminant

Direct sparse discriminant analysis

Solution path equivalence between ROAD and DSDA

Feature Augmention and Sparse Additive Classifiers

Feature augmentation

Penalized additive logistic regression

Semiparametric sparse discriminant analysis

Bibliographical notes

Exercises

13. Unsupervised Learning

Cluster Analysis

K-means clustering

Hierarchical clustering

Model-based clustering

Spectral clustering

Data-driven choices of the number of clusters

Variable Selection in Clustering

Sparse clustering

Sparse model-based clustering

Sparse mixture of experts model

An Introduction to High Dimensional PCA

Inconsistency of the regular PCA

Consistency under sparse eigenvector model

Sparse Principal Component Analysis

Sparse PCA

An iterative SVD thresholding approach

A penalized matrix decomposition approach

A semidefinite programming approach

A generalized power method

Bibliographical notes

Exercises

14. An Introduction to Deep Learning

Rise of Deep Learning

Feed-forward neural networks

Model setup

Back-propagation in computational graphs

Popular models

Convolutional neural networks

Recurrent neural networks

Vanilla RNNs

GRUs and LSTM

Multilayer RNNs

Modules

Deep unsupervised learning

Autoencoders

Generative adversarial networks

Sampling view of GANs

Minimum distance view of GANs

Training deep neural nets

Stochastic gradient descent

Mini-batch SGD

Momentum-based SGD

SGD with adaptive learning rates

Easing numerical instability

ReLU activation function

Skip connections

Batch normalization

Regularization techniques

Weight decay

Dropout

Data augmentation

Example: image classification

Bibliography notes

## Author(s)

### Biography

The authors are international authorities and leaders on the presented topics. All are fellows of the Institute of Mathematical Statistics and the American Statistical Association.

**Jianqing Fan** is Frederick L. Moore Professor, Princeton University. He is co-editing *Journal of Business and Economics Statistics *and was the co-editor of *The Annals of Statistics*, *Probability Theory and Related Fields*, and* Journal of Econometrics* and has been recognized by the 2000 COPSS Presidents' Award, AAAS Fellow, Guggenheim Fellow, Guy medal in silver, Noether Senior Scholar Award, and Academician of Academia Sinica.

**Runze Li** is Elberly family chair professor and AAAS fellow, Pennsylvania State University, and was co-editor of *The Annals of Statistics*.

**Cun-Hui Zhang** is distinguished professor, Rutgers University and was co-editor of *Statistical Science*.

**Hui Zou** is professor, University of Minnesota and was action editor of *Journal of Machine Learning Research*.

## Reviews

"This book delivers a very comprehensive summary of the development of statistical foundations of data science. The authors no doubt are doing frontier research and have made several crucial contributions to the field. Therefore, the book offers a very good account of the most cutting-edge development. The book is suitable for both master and Ph.D. students in statistics, and also for researchers in both applied and theoretical data science. Researchers can take this book as an index of topics, as it summarizes in brief many significant research articles in an accessible way. Each chapter can be read independently by experienced researchers. It provides a nice cover of key concepts in those topics and researchers can benefit from reading the specific chapters and paragraphs to get a big picture rather than diving into many technical articles. There are altogether 14 chapters. It can serve as a textbook for two semesters. The book also provides handy codes and data sets, which is a great treasure for practitioners."

~Journal of Time Series Analysis