1st Edition

Robust Cluster Analysis and Variable Selection

By Gunter Ritter Copyright 2015
    392 Pages 60 B/W Illustrations
    by Chapman & Hall

    Clustering remains a vibrant area of research in statistics. Although there are many books on this topic, there are relatively few that are well founded in the theoretical aspects. In Robust Cluster Analysis and Variable Selection, Gunter Ritter presents an overview of the theory and applications of probabilistic clustering and variable selection, synthesizing the key research results of the last 50 years.

    The author focuses on the robust clustering methods he found to be the most useful on simulated data and real-time applications. The book provides clear guidance for the varying needs of both applications, describing scenarios in which accuracy and speed are the primary goals.

    Robust Cluster Analysis and Variable Selection includes all of the important theoretical details, and covers the key probabilistic models, robustness issues, optimization algorithms, validation techniques, and variable selection methods. The book illustrates the different methods with simulated data and applies them to real-world data sets that can be easily downloaded from the web. This provides you with guidance in how to use clustering methods as well as applicable procedures and algorithms without having to understand their probabilistic fundamentals.

    Introduction

    Mixture and classification models and their likelihood estimators
    General consistency and asymptotic normality
    Local likelihood estimates
    Maximum likelihood estimates
    Notes
    Mixture models and their likelihood estimators
    Latent distributions
    Finite mixture models
    Identifiable mixture models
    Asymptotic properties of local likelihood maxima
    Asymptotic properties of the MLE: constrained nonparametric mixture models
    Asymptotic properties of the MLE: constrained parametric mixture models
    Notes
    Classification models and their criteria
    Probabilistic criteria for general populations
    Admissibility and size constraints
    Steady partitions
    Elliptical models
    Normal models
    Geometric considerations
    Consistency of the MAP criterion
    Notes


    Robustification by trimming
    Outliers and measures of robustness
    Outliers
    The sensitivities
    Sensitivity of ML estimates of mixture models
    Breakdown points
    Trimming the mixture model
    Trimmed likelihood function of the mixture model
    Normal components
    Universal breakdown points of covariance matrices, mixing rates, and means
    Restricted breakdown point of mixing rates and means
    Notes
    Trimming the classification model – the TDC
    Trimmed MAP classification model
    Normal case – the Trimmed Determinant Criterion, TDC
    Breakdown robustness of the constrained TDC
    Universal breakdown point of covariance matrices and means
    Restricted breakdown point of the means
    Notes

    Algorithms
    EM algorithm for mixtures
    General mixtures
    Normal mixtures
    Mixtures of multivariate t-distributions
    Trimming – the EMT algorithm
    Order of Convergence
    Acceleration of the mixture EM
    Notes
    k
    -Parameters algorithms
    General and elliptically symmetric models
    Steady solutions and trimming
    Using combinatorial optimization
    Overall algorithms
    Notes
    Hierarchical methods for initial solutions

    Favorite solutions and cluster validation
    Scale balance and Pareto solutions
    Number of components of uncontaminated data
    Likelihood–ratio tests
    Using cluster criteria as test statistics
    Model selection criteria
    Ridgeline manifold
    Number of components and outliers
    Classification trimmed likelihood curves
    Trimmed BIC
    Adjusted BIC
    Cluster validation
    Separation indices
    Normality and related tests
    Visualization
    Measures of agreement of partitions
    Stability
    Notes

    Variable selection in clustering
    Irrelevance
    Definition and general properties
    The normal case
    Filters
    Univariate filters
    Multivariate filters
    Wrappers
    Using the likelihood ratio test
    Using Bayes factors and their BIC approximations
    Maximum likelihood subset selection
    Consistency of the MAP cluster criterion with variable selection
    Practical guidelines
    Notes

    Applications
    Miscellaneous data sets
    IRIS data
    SWISS BILLS
    STONE FLAKES
    Gene expression data
    Supervised and unsupervised methods
    Combining gene selection and profile clustering
    Application to the LEUKEMIA data
    Notes

    Appendix A: Geometry and linear algebra
    Appendix B: Topology
    Appendix C: Analysis
    Appendix D: Measures and probabilities
    Appendix E: Probability
    Appendix F: Statistics
    Appendix G: Optimization

    Biography

    Dr. Gunter Ritter is an emeritus professor in the Department of Mathematics and Computer Science at the University of Passau. He is the author and coauthor of numerous research papers in scientific journals in the areas of measure theory, probability theory, queuing theory, statistics, pattern and image recognition, and Fourier analysis. He is a member of the International Federation of Classification Societies and its German branch GfKl as well as the German Mathematical Society.

    "I congratulate the author on his hard work, which provides a well-founded mathematical/probabilistic treatment for some valuable clustering techniques...I highly recommend this book and find it very enjoyable to read for those with enough background and who wish to gain a more in-depth knowledge of cluster analysis. For this audience, it is a stimulating read covering several novel and useful ideas and providing a starting point for developing more advanced theory and applications...In summary, I found the book appealing and I appreciate the effort made by the author in providing a rigorous approach to (robust) cluster analysis."
    Journal of the American Statistical Association, May 2016

    "This book is quite theoretical, with parts of the text being highly technical. However, it also provides the practitioner with useful procedures and algorithms applicable without having to understand the probabilistic fundamentals."
    Mathematical Reviews, August 2015

    "Professor Ritter has contributed an original and highly valuable volume on probabilistic clustering. This book provides a selection of methods that the author has found most useful in the analysis of real and simulated data. It places a special focus on problems caused by outliers and irrelevant variables, yet, above all, with an admirable and in-depth attention to the roots of methods in mathematical theorems and fundamental statistical principles. An absolute must for those who really care about the mathematical-statistical foundations of probabilistic cluster analysis and who want to get to the bottom of them."
    —Iven Van Mechelen, Past President of the International Federation of Classification Societies

    "This book provides a marvelous, deep, comprehensive, and knowledgeable presentation of basic and advanced methods and algorithms for data clustering related to model-based approaches (mixture model and fixed-classification approach). Its special and innovative features consist in the presentation of outlier models and corresponding trimming variants of classical maximum likelihood clustering methods and in a complete derivation of the large-sample theory of the resulting estimates, with and without outliers (this did not exist in book form before). In addition to presenting suitable (EM and k-parameters) clustering algorithms, the author proposes new ideas to cope with a possibly large number of ‘local’ clustering solutions, e.g., by selecting ‘favorite’ classifications, together with methods for cluster evaluation and variable selection. Real-case examples, e.g., from gene analysis, illustrate the proposed methods.
    Given its broad methodological range, the presentation of new concepts and methods in clustering, the consideration of outlier-infected situations, and the complete and exact derivation of results, this book can be considered a standard work for all classificationists and data analysts. For practitioners, it contains a wealth of models and algorithms to choose from and many tricky practical advices for computing and interpretation. For researchers, this will be an indispensable source of information concerning the statistical and mathematical foundations and results in the context of model-based clustering."
    —Hans-Hermann Bock, RWTH Aachen University, Germany