1st Edition
Data Science and Machine Learning Mathematical and Statistical Methods
"This textbook is a well-rounded, rigorous, and informative work presenting the mathematics behind modern machine learning techniques. It hits all the right notes: the choice of topics is up-to-date and perfect for a course on data science for mathematics students at the advanced undergraduate or early graduate level. This book fills a sorely-needed gap in the existing literature by not sacrificing depth for breadth, presenting proofs of major theorems and subsequent derivations, as well as providing a copious amount of Python code. I only wish a book like this had been around when I first began my journey!" -Nicholas Hoell, University of Toronto
"This is a well-written book that provides a deeper dive into data-scientific methods than many introductory texts. The writing is clear, and the text logically builds up regularization, classification, and decision trees. Compared to its probable competitors, it carves out a unique niche. -Adam Loy, Carleton College
The purpose of Data Science and Machine Learning: Mathematical and Statistical Methods is to provide an accessible, yet comprehensive textbook intended for students interested in gaining a better understanding of the mathematics and statistics that underpin the rich variety of ideas and machine learning algorithms in data science.
Key Features:
- Focuses on mathematical understanding.
- Presentation is self-contained, accessible, and comprehensive.
- Extensive list of exercises and worked-out examples.
- Many concrete algorithms with Python code.
- Full color throughout.
Further Resources can be found on the authors website: https://github.com/DSML-book/Lectures
Preface
Notation
Importing, Summarizing, and Visualizing Data
Introduction
Structuring Features According to Type
Summary Tables
Summary Statistics
Visualizing Data
Plotting Qualitative Variables
Plotting Quantitative Variables
Data Visualization in a Bivariate Setting
Exercises
Statistical Learning
Introduction
Supervised and Unsupervised Learning
Training and Test Loss
Tradeoffs in Statistical Learning
Estimating Risk
In-Sample Risk
Cross-Validation
Modeling Data
Multivariate Normal Models
Normal Linear Models
Bayesian Learning
Exercises
Monte Carlo Methods
Introduction .
Monte Carlo Sampling
Generating Random Numbers
Simulating Random Variables
Simulating Random Vectors and Processes
Resampling
Markov Chain Monte Carlo
Monte Carlo Estimation
Crude Monte Carlo
Bootstrap Method
Variance Reduction
Monte Carlo for Optimization
Simulated Annealing
Cross-Entropy Method
Splitting for Optimization
Noisy Optimization
Exercises
Unsupervised Learning
Introduction
Risk and Loss in Unsupervised Learning
Expectation–Maximization (EM) Algorithm
Empirical Distribution and Density Estimation
Clustering via Mixture Models
Mixture Models
EM Algorithm for Mixture Models
Clustering via Vector Quantization
K-Means
Clustering via Continuous Multiextremal Optimization
Hierarchical Clustering
Principal Component Analysis (PCA)
Motivation: Principal Axes of an Ellipsoid
PCA and Singular Value Decomposition (SVD)
Exercises
Regression
Introduction
Linear Regression
Analysis via Linear Models
Parameter Estimation
Model Selection and Prediction
Cross-Validation and Predictive Residual Sum of Squares
In-Sample Risk and Akaike Information Criterion
Categorical Features
Nested Models
Coefficient of Determination
Inference for Normal Linear Models
Comparing Two Normal Linear Models
Confidence and Prediction Intervals
Nonlinear Regression Models
Linear Models in Python
Modeling
Analysis
Analysis of Variance (ANOVA)
Confidence and Prediction Intervals
Model Validation
Variable Selection
Generalized Linear Models
Exercises
Regularization and Kernel Methods
Introduction
Regularization
Reproducing Kernel Hilbert Spaces
Construction of Reproducing Kernels
Reproducing Kernels via Feature Mapping
Kernels from Characteristic Functions
Reproducing Kernels Using Orthonormal Features
Kernels from Kernels
Representer Theorem
Smoothing Cubic Splines
Gaussian Process Regression
Kernel PCA
Exercises
Classification
Introduction
Classification Metrics
Classification via Bayes’ Rule
Linear and Quadratic Discriminant Analysis
Logistic Regression and Softmax Classification
K-nearest Neighbors Classification
Support Vector Machine
Classification with Scikit-Learn
Exercises
Decision Trees and Ensemble Methods
Introduction
Top-Down Construction of Decision Trees
Regional Prediction Functions
Splitting Rules
Termination Criterion
Basic Implementation
Additional Considerations
Binary Versus Non-Binary Trees
Data Preprocessing
Alternative Splitting Rules
Categorical Variables
Missing Values
Controlling the Tree Shape
Cost-Complexity Pruning
Advantages and Limitations of Decision Trees
Bootstrap Aggregation
Random Forests
Boosting
Exercises
Deep Learning
Introduction
Feed-Forward Neural Networks
Back-Propagation
Methods for Training
Steepest Descent
Levenberg–Marquardt Method
Limited-Memory BFGS Method
Adaptive Gradient Methods
Examples in Python
Simple Polynomial Regression
Image Classification
Exercises
Linear Algebra and Functional Analysis
Vector Spaces, Bases, and Matrices
Inner Product
Complex Vectors and Matrices
Orthogonal Projections
Eigenvalues and Eigenvectors
Left- and Right-Eigenvectors
Matrix Decompositions
(P)LU Decomposition
Woodbury Identity
Cholesky Decomposition
QR Decomposition and the Gram–Schmidt Procedure
Singular Value Decomposition
Solving Structured Matrix Equations
Functional Analysis
Fourier Transforms
Discrete Fourier Transform
Fast Fourier Transform
Multivariate Differentiation and Optimization
Multivariate Differentiation
Taylor Expansion
Chain Rule
Optimization Theory
Convexity and Optimization
Lagrangian Method
Duality
Numerical Root-Finding and Minimization
Newton-Like Methods
Quasi-Newton Methods
Normal Approximation Method
Nonlinear Least Squares
Constrained Minimization via Penalty Functions
Probability and Statistics
Random Experiments and Probability Spaces
Random Variables and Probability Distributions
Expectation
Joint Distributions
Conditioning and Independence
Conditional Probability
Independence
Expectation and Covariance
Conditional Density and Conditional Expectation
Functions of Random Variables
Multivariate Normal Distribution
Convergence of Random Variables
Law of Large Numbers and Central Limit Theorem
Markov Chains
Statistics
Estimation
Method of Moments
Maximum Likelihood Method
Confidence Intervals
Hypothesis Testing
Python Primer
Getting Started
Python Objects
Types and Operators
Functions and Methods
Modules
Flow Control
Iteration
Classes
Files
NumPy
Creating and Shaping Arrays
Slicing
Array Operations
Random Numbers
Matplotlib
Creating a Basic Plot
Pandas
Series and DataFrame
Manipulating Data Frames
Extracting Information
Plotting
Scikit-learn
Partitioning the Data
Standardization
Fitting and Prediction
Testing the Model
System Calls, URL Access, and Speed-Up
Bibliography
Index
Biography
Dirk P. Kroese, PhD, is a Professor of Mathematics and Statistics at The University of Queensland. He has published over 120 articles and five books in a wide range of areas in mathematics, statistics, data science, machine learning, and Monte Carlo methods. He is a pioneer of the well-known Cross-Entropy method—an adaptive Monte Carlo technique, which is being used around the world to help solve difficult estimation and optimization problems in science, engineering, and finance.
Zdravko Botev, PhD, is an Australian Mathematical Science Institute Lecturer in Data Science and Machine Learning with an appointment at the University of New South Wales in Sydney, Australia. He is the recipient of the 2018 Christopher Heyde Medal of the Australian Academy of Science for distinguished research in the Mathematical Sciences.
Thomas Taimre, PhD, is a Senior Lecturer of Mathematics and Statistics at The University of Queensland.
His research interests range from applied probability and Monte Carlo methods to applied physics and the remarkably universal self-mixing effect in lasers. He has published over 100 articles, holds a patent, and is the coauthor of Handbook of Monte Carlo Methods (Wiley).
Radislav Vaisman, PhD, is a Lecturer of Mathematics and Statistics at The University of Queensland. His research interests lie at the intersection of applied probability, machine learning, and computer science. He has published over 20 articles and two books.
"The first impression when handling and opening this book at a random page is superb. A big format (A4) and heavy weight, because the paper quality is high, along with a spectacular style and large font, much colour and many plots, and blocks of python code enhanced in colour boxes. This makes the book attractive and easy to study...The book is a very well-designed data science course, with mathematical rigor in mind. Key concepts are highlighted in red in the margins, often with links to other parts of the book...This book will be excellent for those that want to build a strong mathematical foundation for their knowledge on the main machine learning techniques, and at the same time get python recipes on how to perform the analyses for worked examples."
- Victor Moreno, ISCB News, December 2020