1st Edition

Robust Cluster Analysis and Variable Selection




ISBN 9781439857960
Published September 2, 2014 by Chapman and Hall/CRC
392 Pages 60 B/W Illustrations

USD $180.00

Prices & shipping based on shipping country


Preview

Book Description

Clustering remains a vibrant area of research in statistics. Although there are many books on this topic, there are relatively few that are well founded in the theoretical aspects. In Robust Cluster Analysis and Variable Selection, Gunter Ritter presents an overview of the theory and applications of probabilistic clustering and variable selection, synthesizing the key research results of the last 50 years.

The author focuses on the robust clustering methods he found to be the most useful on simulated data and real-time applications. The book provides clear guidance for the varying needs of both applications, describing scenarios in which accuracy and speed are the primary goals.

Robust Cluster Analysis and Variable Selection includes all of the important theoretical details, and covers the key probabilistic models, robustness issues, optimization algorithms, validation techniques, and variable selection methods. The book illustrates the different methods with simulated data and applies them to real-world data sets that can be easily downloaded from the web. This provides you with guidance in how to use clustering methods as well as applicable procedures and algorithms without having to understand their probabilistic fundamentals.

Table of Contents

Introduction

Mixture and classification models and their likelihood estimators
General consistency and asymptotic normality
Local likelihood estimates
Maximum likelihood estimates
Notes
Mixture models and their likelihood estimators
Latent distributions
Finite mixture models
Identifiable mixture models
Asymptotic properties of local likelihood maxima
Asymptotic properties of the MLE: constrained nonparametric mixture models
Asymptotic properties of the MLE: constrained parametric mixture models
Notes
Classification models and their criteria
Probabilistic criteria for general populations
Admissibility and size constraints
Steady partitions
Elliptical models
Normal models
Geometric considerations
Consistency of the MAP criterion
Notes


Robustification by trimming
Outliers and measures of robustness
Outliers
The sensitivities
Sensitivity of ML estimates of mixture models
Breakdown points
Trimming the mixture model
Trimmed likelihood function of the mixture model
Normal components
Universal breakdown points of covariance matrices, mixing rates, and means
Restricted breakdown point of mixing rates and means
Notes
Trimming the classification model – the TDC
Trimmed MAP classification model
Normal case – the Trimmed Determinant Criterion, TDC
Breakdown robustness of the constrained TDC
Universal breakdown point of covariance matrices and means
Restricted breakdown point of the means
Notes

Algorithms
EM algorithm for mixtures
General mixtures
Normal mixtures
Mixtures of multivariate t-distributions
Trimming – the EMT algorithm
Order of Convergence
Acceleration of the mixture EM
Notes
k
-Parameters algorithms
General and elliptically symmetric models
Steady solutions and trimming
Using combinatorial optimization
Overall algorithms
Notes
Hierarchical methods for initial solutions

Favorite solutions and cluster validation
Scale balance and Pareto solutions
Number of components of uncontaminated data
Likelihood–ratio tests
Using cluster criteria as test statistics
Model selection criteria
Ridgeline manifold
Number of components and outliers
Classification trimmed likelihood curves
Trimmed BIC
Adjusted BIC
Cluster validation
Separation indices
Normality and related tests
Visualization
Measures of agreement of partitions
Stability
Notes

Variable selection in clustering
Irrelevance
Definition and general properties
The normal case
Filters
Univariate filters
Multivariate filters
Wrappers
Using the likelihood ratio test
Using Bayes factors and their BIC approximations
Maximum likelihood subset selection
Consistency of the MAP cluster criterion with variable selection
Practical guidelines
Notes

Applications
Miscellaneous data sets
IRIS data
SWISS BILLS
STONE FLAKES
Gene expression data
Supervised and unsupervised methods
Combining gene selection and profile clustering
Application to the LEUKEMIA data
Notes

Appendix A: Geometry and linear algebra
Appendix B: Topology
Appendix C: Analysis
Appendix D: Measures and probabilities
Appendix E: Probability
Appendix F: Statistics
Appendix G: Optimization

...
View More

Author(s)

Biography

Dr. Gunter Ritter is an emeritus professor in the Department of Mathematics and Computer Science at the University of Passau. He is the author and coauthor of numerous research papers in scientific journals in the areas of measure theory, probability theory, queuing theory, statistics, pattern and image recognition, and Fourier analysis. He is a member of the International Federation of Classification Societies and its German branch GfKl as well as the German Mathematical Society.

Featured Author Profiles

Reviews

"I congratulate the author on his hard work, which provides a well-founded mathematical/probabilistic treatment for some valuable clustering techniques...I highly recommend this book and find it very enjoyable to read for those with enough background and who wish to gain a more in-depth knowledge of cluster analysis. For this audience, it is a stimulating read covering several novel and useful ideas and providing a starting point for developing more advanced theory and applications...In summary, I found the book appealing and I appreciate the effort made by the author in providing a rigorous approach to (robust) cluster analysis."
Journal of the American Statistical Association, May 2016

"This book is quite theoretical, with parts of the text being highly technical. However, it also provides the practitioner with useful procedures and algorithms applicable without having to understand the probabilistic fundamentals."
Mathematical Reviews, August 2015

"Professor Ritter has contributed an original and highly valuable volume on probabilistic clustering. This book provides a selection of methods that the author has found most useful in the analysis of real and simulated data. It places a special focus on problems caused by outliers and irrelevant variables, yet, above all, with an admirable and in-depth attention to the roots of methods in mathematical theorems and fundamental statistical principles. An absolute must for those who really care about the mathematical-statistical foundations of probabilistic cluster analysis and who want to get to the bottom of them."
—Iven Van Mechelen, Past President of the International Federation of Classification Societies

"This book provides a marvelous, deep, comprehensive, and knowledgeable presentation of basic and advanced methods and algorithms for data clustering related to model-based approaches (mixture model and fixed-classification approach). Its special and innovative features consist in the presentation of outlier models and corresponding trimming variants of classical maximum likelihood clustering methods and in a complete derivation of the large-sample theory of the resulting estimates, with and without outliers (this did not exist in book form before). In addition to presenting suitable (EM and k-parameters) clustering algorithms, the author proposes new ideas to cope with a possibly large number of ‘local’ clustering solutions, e.g., by selecting ‘favorite’ classifications, together with methods for cluster evaluation and variable selection. Real-case examples, e.g., from gene analysis, illustrate the proposed methods.
Given its broad methodological range, the presentation of new concepts and methods in clustering, the consideration of outlier-infected situations, and the complete and exact derivation of results, this book can be considered a standard work for all classificationists and data analysts. For practitioners, it contains a wealth of models and algorithms to choose from and many tricky practical advices for computing and interpretation. For researchers, this will be an indispensable source of information concerning the statistical and mathematical foundations and results in the context of model-based clustering."
—Hans-Hermann Bock, RWTH Aachen University, Germany