Robust Cluster Analysis and Variable Selection: 1st Edition (Hardback) book cover

Robust Cluster Analysis and Variable Selection

1st Edition

By Gunter Ritter

Chapman and Hall/CRC

392 pages | 60 B/W Illus.

Purchasing Options:$ = USD
Hardback: 9781439857960
pub: 2014-09-02
$115.00
x
eBook (VitalSource) : 9780429063435
pub: 2014-09-02
from $28.98


FREE Standard Shipping!

Description

Clustering remains a vibrant area of research in statistics. Although there are many books on this topic, there are relatively few that are well founded in the theoretical aspects. In Robust Cluster Analysis and Variable Selection, Gunter Ritter presents an overview of the theory and applications of probabilistic clustering and variable selection, synthesizing the key research results of the last 50 years.

The author focuses on the robust clustering methods he found to be the most useful on simulated data and real-time applications. The book provides clear guidance for the varying needs of both applications, describing scenarios in which accuracy and speed are the primary goals.

Robust Cluster Analysis and Variable Selection includes all of the important theoretical details, and covers the key probabilistic models, robustness issues, optimization algorithms, validation techniques, and variable selection methods. The book illustrates the different methods with simulated data and applies them to real-world data sets that can be easily downloaded from the web. This provides you with guidance in how to use clustering methods as well as applicable procedures and algorithms without having to understand their probabilistic fundamentals.

Reviews

"I congratulate the author on his hard work, which provides a well-founded mathematical/probabilistic treatment for some valuable clustering techniques…I highly recommend this book and find it very enjoyable to read for those with enough background and who wish to gain a more in-depth knowledge of cluster analysis. For this audience, it is a stimulating read covering several novel and useful ideas and providing a starting point for developing more advanced theory and applications…In summary, I found the book appealing and I appreciate the effort made by the author in providing a rigorous approach to (robust) cluster analysis."

Journal of the American Statistical Association, May 2016

"This book is quite theoretical, with parts of the text being highly technical. However, it also provides the practitioner with useful procedures and algorithms applicable without having to understand the probabilistic fundamentals."

Mathematical Reviews, August 2015

"Professor Ritter has contributed an original and highly valuable volume on probabilistic clustering. This book provides a selection of methods that the author has found most useful in the analysis of real and simulated data. It places a special focus on problems caused by outliers and irrelevant variables, yet, above all, with an admirable and in-depth attention to the roots of methods in mathematical theorems and fundamental statistical principles. An absolute must for those who really care about the mathematical-statistical foundations of probabilistic cluster analysis and who want to get to the bottom of them."

—Iven Van Mechelen, Past President of the International Federation of Classification Societies

"This book provides a marvelous, deep, comprehensive, and knowledgeable presentation of basic and advanced methods and algorithms for data clustering related to model-based approaches (mixture model and fixed-classification approach). Its special and innovative features consist in the presentation of outlier models and corresponding trimming variants of classical maximum likelihood clustering methods and in a complete derivation of the large-sample theory of the resulting estimates, with and without outliers (this did not exist in book form before). In addition to presenting suitable (EM and k-parameters) clustering algorithms, the author proposes new ideas to cope with a possibly large number of ‘local’ clustering solutions, e.g., by selecting ‘favorite’ classifications, together with methods for cluster evaluation and variable selection. Real-case examples, e.g., from gene analysis, illustrate the proposed methods.

Given its broad methodological range, the presentation of new concepts and methods in clustering, the consideration of outlier-infected situations, and the complete and exact derivation of results, this book can be considered a standard work for all classificationists and data analysts. For practitioners, it contains a wealth of models and algorithms to choose from and many tricky practical advices for computing and interpretation. For researchers, this will be an indispensable source of information concerning the statistical and mathematical foundations and results in the context of model-based clustering."

—Hans-Hermann Bock, RWTH Aachen University, Germany

Table of Contents

Introduction

Mixture and classification models and their likelihood estimators

General consistency and asymptotic normality

Local likelihood estimates

Maximum likelihood estimates

Notes

Mixture models and their likelihood estimators

Latent distributions

Finite mixture models

Identifiable mixture models

Asymptotic properties of local likelihood maxima

Asymptotic properties of the MLE: constrained nonparametric mixture models

Asymptotic properties of the MLE: constrained parametric mixture models

Notes

Classification models and their criteria

Probabilistic criteria for general populations

Admissibility and size constraints

Steady partitions

Elliptical models

Normal models

Geometric considerations

Consistency of the MAP criterion

Notes

Robustification by trimming

Outliers and measures of robustness

Outliers

The sensitivities

Sensitivity of ML estimates of mixture models

Breakdown points

Trimming the mixture model

Trimmed likelihood function of the mixture model

Normal components

Universal breakdown points of covariance matrices, mixing rates, and means

Restricted breakdown point of mixing rates and means

Notes

Trimming the classification model – the TDC

Trimmed MAP classification model

Normal case – the Trimmed Determinant Criterion, TDC

Breakdown robustness of the constrained TDC

Universal breakdown point of covariance matrices and means

Restricted breakdown point of the means

Notes

Algorithms

EM algorithm for mixtures

General mixtures

Normal mixtures

Mixtures of multivariate t-distributions

Trimming – the EMT algorithm

Order of Convergence

Acceleration of the mixture EM

Notes

k-Parameters algorithms

General and elliptically symmetric models

Steady solutions and trimming

Using combinatorial optimization

Overall algorithms

Notes

Hierarchical methods for initial solutions

Favorite solutions and cluster validation

Scale balance and Pareto solutions

Number of components of uncontaminated data

Likelihood–ratio tests

Using cluster criteria as test statistics

Model selection criteria

Ridgeline manifold

Number of components and outliers

Classification trimmed likelihood curves

Trimmed BIC

Adjusted BIC

Cluster validation

Separation indices

Normality and related tests

Visualization

Measures of agreement of partitions

Stability

Notes

Variable selection in clustering

Irrelevance

Definition and general properties

The normal case

Filters

Univariate filters

Multivariate filters

Wrappers

Using the likelihood ratio test

Using Bayes factors and their BIC approximations

Maximum likelihood subset selection

Consistency of the MAP cluster criterion with variable selection

Practical guidelines

Notes

Applications

Miscellaneous data sets

IRIS data

SWISS BILLS

STONE FLAKES

Gene expression data

Supervised and unsupervised methods

Combining gene selection and profile clustering

Application to the LEUKEMIA data

Notes

Appendix A: Geometry and linear algebra

Appendix B: Topology

Appendix C: Analysis

Appendix D: Measures and probabilities

Appendix E: Probability

Appendix F: Statistics

Appendix G: Optimization

About the Author

Dr. Gunter Ritter is an emeritus professor in the Department of Mathematics and Computer Science at the University of Passau. He is the author and coauthor of numerous research papers in scientific journals in the areas of measure theory, probability theory, queuing theory, statistics, pattern and image recognition, and Fourier analysis. He is a member of the International Federation of Classification Societies and its German branch GfKl as well as the German Mathematical Society.

About the Series

Chapman & Hall/CRC Monographs on Statistics and Applied Probability

Learn more…

Subject Categories

BISAC Subject Codes/Headings:
COM021030
COMPUTERS / Database Management / Data Mining
MAT029000
MATHEMATICS / Probability & Statistics / General