
Exploratory Data Analysis with MATLAB
Preview
Book Description
Praise for the Second Edition:
"The authors present an intuitive and easy-to-read book. … accompanied by many examples, proposed exercises, good references, and comprehensive appendices that initiate the reader unfamiliar with MATLAB."
—Adolfo Alvarez Pinto, International Statistical Review
"Practitioners of EDA who use MATLAB will want a copy of this book. … The authors have done a great service by bringing together so many EDA routines, but their main accomplishment in this dynamic text is providing the understanding and tools to do EDA.
—David A Huckaby, MAA Reviews
Exploratory Data Analysis (EDA) is an important part of the data analysis process. The methods presented in this text are ones that should be in the toolkit of every data scientist. As computational sophistication has increased and data sets have grown in size and complexity, EDA has become an even more important process for visualizing and summarizing data before making assumptions to generate hypotheses and models.
Exploratory Data Analysis with MATLAB, Third Edition presents EDA methods from a computational perspective and uses numerous examples and applications to show how the methods are used in practice. The authors use MATLAB code, pseudo-code, and algorithm descriptions to illustrate the concepts. The MATLAB code for examples, data sets, and the EDA Toolbox are available for download on the book’s website.
New to the Third Edition
- Random projections and estimating local intrinsic dimensionality
- Deep learning autoencoders and stochastic neighbor embedding
- Minimum spanning tree and additional cluster validity indices
- Kernel density estimation
- Plots for visualizing data distributions, such as beanplots and violin plots
- A chapter on visualizing categorical data
Table of Contents
Part I
Introduction to Exploratory Data Analysis
What is Exploratory Data Analysis
Overview of the Text
A Few Words about Notation
Data Sets Used in the Book
Unstructured Text Documents
Gene Expression Data
Oronsay Data Set
Software Inspection
Transforming Data
Power Transformations
Standardization
Sphering the Data
Further Reading
Exercises
Part II
EDA as Pattern Discovery
Dimensionality Reduction — Linear Methods
Introduction
Principal Component Analysis — PCA
PCA Using the Sample Covariance Matrix
PCA Using the Sample Correlation Matrix
How Many Dimensions Should We Keep?
Singular Value Decomposition — SVD
Nonnegative Matrix Factorization
Factor Analysis
Fisher’s Linear Discriminant
Random Projections
Intrinsic Dimensionality
Nearest Neighbor Approach
Correlation Dimension
Maximum Likelihood Approach
Estimation Using Packing Numbers
Estimation of Local Dimension
Summary and Further Reading
Exercises
Dimensionality Reduction — Nonlinear Methods
Multidimensional Scaling — MDS
Metric MDS
Nonmetric MDS
Manifold Learning
Locally Linear Embedding
Isometric Feature Mapping — ISOMAP
Hessian Eigenmaps
Artificial Neural Network Approaches
Self-Organizing Maps
Generative Topographic Maps
Curvilinear Component Analysis
Autoencoders
Stochastic Neighbor Embedding
Summary and Further Reading
Exercises
Data Tours
Grand Tour
Torus Winding Method
Pseudo Grand Tour
Interpolation Tours
Projection Pursuit
Projection Pursuit Indexes
Posse Chi-Square Index
Moment Index
Independent Component Analysis
Summary and Further Reading
Exercises
Finding Clusters
Introduction
Hierarchical Methods
Optimization Methods — k-Means
Spectral Clustering
Document Clustering
Nonnegative Matrix Factorization — Revisited
Probabilistic Latent Semantic Analysis
Minimal Spanning Trees and Clustering
Definitions
Minimum Spanning Tree Clustering
Evaluating the Clusters
Rand Index
Cophenetic Correlation
Upper Tail Rule
Silhouette Plot
Gap Statistic
Cluster Validity Indices
Summary and Further Reading
Exercises
Model-Based Clustering
Overview of Model-Based Clustering
Finite Mixtures
Multivariate Finite Mixtures
Component Models — Constraining the Covariances
Expectation-Maximization Algorithm
Hierarchical Agglomerative Model-Based Clustering
Model-Based Clustering
MBC for Density Estimation and Discriminant Analysis
Introduction to Pattern Recognition
Bayes Decision Theory
Estimating Probability Densities with MBC
Generating Random Variables from a Mixture Model
Summary and Further Reading
Exercises
Smoothing Scatterplots
Introduction
Loess
Robust Loess
Residuals and Diagnostics with Loess
Residual Plots
Spread Smooth
Loess Envelopes — Upper and Lower Smooths
Smoothing Splines
Regression with Splines
Smoothing Splines
Smoothing Splines for Uniformly Spaced Data
Choosing the Smoothing Parameter
Bivariate Distribution Smooths
Pairs of Middle Smoothings
Polar Smoothing
Curve Fitting Toolbox
Summary and Further Reading
Exercises
Part III
Graphical Methods for EDA
Visualizing Clusters
Dendrogram
Treemaps
Rectangle Plots
ReClus Plots
Data Image
Summary and Further Reading
Exercises
Distribution Shapes
Histograms
Univariate Histograms
Bivariate Histograms
Kernel Density
Univariate Kernel Density Estimation
Multivariate Kernel Density Estimation
Boxplots
The Basic Boxplot
Variations of the Basic Boxplot
Violin Plots
Beeswarm Plot
Bean Plot
Quantile Plots
Probability Plots
Quantile-Quantile Plot
Quantile Plot
Bagplots
Rangefinder Boxplot
Summary and Further Reading
Exercises
Multivariate Visualization
Glyph Plots
Scatterplots
2-D and 3-D Scatterplots
Scatterplot Matrices
Scatterplots with Hexagonal Binning
Dynamic Graphics
Identification of Data
Linking
Brushing
Coplots
Dot Charts
Basic Dot Chart
Multiway Dot Chart
Plotting Points as Curves
Parallel Coordinate Plots
Andrews’ Curves
Andrews’ Images
More Plot Matrices
Data Tours Revisited
Grand Tour
Permutation Tour
Biplots
Summary and Further Reading
Exercises
Visualizing Categorical Data
Discrete Distributions
Binomial Distribution
Poisson Distribution
Exploring Distribution Shapes
Poissonness Plot
Binomialness Plot
Hanging Rootogram
Contingency Tables
Background
Bar Plots
Spine Plots
Mosaic Plots
Sieve Diagrams
Log Odds Plot
Summary and Further Reading
Exercises
Appendix A
Proximity Measures
Appendix B
Software Resources for EDA
Appendix C
Appendix D
MATLAB® Basics
Author(s)
Biography
Wendy L. Martinez is a mathematical statistician with the U.S. Bureau of Labor Statistics. She is a fellow of the American Statistical Association, a co-author of several popular Chapman & Hall/CRC books, and a MATLAB® user for more than 20 years. Her research interests include text data mining, probability density estimation, signal processing, scientific visualization, and statistical pattern recognition. She earned an M.S. in aerospace engineering from George Washington University and a Ph.D. in computational sciences and informatics from George Mason University.
Angel R. Martinez is fully retired after a long career with the U.S. federal government and as an adjunct professor at Strayer University, where he taught undergraduate and graduate courses in statistics and mathematics. Before retiring from government service, he worked for the U.S. Navy as an operations research analyst and a computer scientist. He earned an M.S. in systems engineering from the Virginia Polytechnic Institute and State University and a Ph.D. in computational sciences and informatics from George Mason University.
Since 1984, Jeffrey L. Solka has been working in statistical pattern recognition for the Department of the Navy. He has published over 120 journal, conference, and technical papers; has won numerous awards; and holds 4 patents. He earned an M.S. in mathematics from James Madison University, an M.S. in physics from Virginia Polytechnic Institute and State University, and a Ph.D. in computational sciences and informatics from George Mason University.
Reviews
“This book presents an extensive coverage in exploratory data analysis (EDA) using the software Matlab. Although this software is used throughout the book, readers can modify the algorithms for different statistical packages. … This book is intended for a wide audience including statisticians, computer scientists, and engineers. A wide range of topics along with Matlab codes are given. Each chapter ends with a good number of exercises which would be very helpful to complement the knowledge learned from the chapter. It is a great source for the students/researchers. It is suitable for a course in the targeted areas at the senior undergraduate or graduate courses. Although Matlab is used throughout the book, the algorithms can easily be converted in other platforms.”
—Morteza Marzjarani in Technometrics, November 2019