1st Edition

Feature Engineering and Selection A Practical Approach for Predictive Models

By Max Kuhn, Kjell Johnson Copyright 2020
    314 Pages
    by Chapman & Hall

    314 Pages
    by Chapman & Hall

    The process of developing predictive models includes many stages. Most resources focus on the modeling algorithms but neglect other critical aspects of the modeling process. This book describes techniques for finding the best representations of predictors for modeling and for nding the best subset of predictors for improving model performance. A variety of example data sets are used to illustrate the techniques along with R programs for reproducing the results.

    1. Introduction
    A Simple Example
    Important Concepts
    A More Complex Example
    Feature Selection
    An Outline of the Book

    2. Illustrative Example: Predicting Risk of Ischemic Stroke
    Predictive Modeling Across Sets
    Other Considerations

    3. A Review of the Predictive Modeling Process
    Illustrative Example: OkCupid Profile Data
    Measuring Performance
    Data Splitting
    Tuning Parameters and Overfitting
    Model Optimization and Tuning
    Comparing Models Using the Training Set
    Feature Engineering Without Overfitting

    4. Exploratory Visualizations
    Introduction to the Chicago Train Ridership Data
    Visualizations for Numeric Data: Exploring Train Ridership Data
    Visualizations for Categorical Data: Exploring the OkCupid Data
    Post Modeling Exploratory Visualizations

    5. Encoding Categorical Predictors
    Creating Dummy Variables for Unordered Categories
    Encoding Predictors with Many Categories
    Approaches for Novel Categories
    Supervised Encoding Methods
    Encodings for Ordered Data
    Creating Features from Text Data
    Factors versus Dummy Variables in Tree-Based Models

    6. Engineering Numeric Predictors
    Many Transformations
    Many: Many Transformations

    7. Detecting Interaction Effects
    Guiding Principles in the Search for Interactions
    Practical Considerations
    The Brute-Force Approach to Identifying Predictive Interactions
    Approaches when Complete Enumeration is Practically Impossible
    Other Potentially Useful Tools

    8. Handling Missing Data
    Understanding the Nature and Severity of Missing Information
    Models that are Resistant to Missing Values
    Deletion of Data
    Encoding Missingness
    Imputation methods
    Special Cases

    9. Working with Profile Data
    Illustrative Data: Pharmaceutical Manufacturing Monitoring
    What are the Experimental Unit and the Unit of Prediction?
    Reducing Background
    Reducing Other Noise
    Exploiting Correlation
    Impacts of Data Processing on Modeling

    10. Feature Selection Overview
    Goals of Feature Selection
    Classes of Feature Selection Methodologies
    Effect of Irrelevant Features
    Overfitting to Predictors and External Validation
    A Case Study
    Next Steps

    11. Greedy Search Methods
    Illustrative Data: Predicting Parkinson’s Disease
    Simple Filters
    Recursive Feature Elimination
    Stepwise Selection

    12. Global Search Methods
    Naive Bayes Models
    Simulated Annealing
    Genetic Algorithms
    Test Set Results


    Max Kuhn, Ph.D., is a software engineer at RStudio. He worked in 18 years in drug discovery and medical diagnostics applying predictive models to real data. He has authored numerous R packages for predictive modeling and machine learning.

    Kjell Johnson, Ph.D., is the owner and founder of Stat Tenacity, a firm that provides statistical and predictive modeling consulting services. He has taught short courses on predictive modeling for the American Society for Quality, American Chemical Society, International Biometric Society, and for many corporations.

    Kuhn and Johnson have also authored Applied Predictive Modeling, which is a comprehensive, practical guide to the process of building a predictive model. The text won the 2014 Technometrics Ziegel Prize for Outstanding Book.

    "The book is timely and needed. The interest in all things 'data science' morphed into everybody pretending to do, or know, Machine Learning. Kuhn and Johnson happen to actually know this—as evidenced by their earlier and still-popular tome entitled ‘Applied Predictive Modeling.’ The proposed ‘Feature Engineering and Selection’ builds on this and extends it. I expect it to become as popular with a wide reach as both a textbook, self-study material, and reference."
    ~Dirk Eddelbuettel, University of Illinois at Urbana-Champaign

    "As a reviewer, it has been exciting and edifying to see this book develop into what is likely to become one of the foundational works on feature engineering. It is launching propitiously on the current tide of interest in both interpretable models and AutoML."
    ~Robert Horton, Microsoft

    "In recent years, the statistics literature has featured new developments in modeling and predictive analytics. Approaches such as cross-validation and statistical/machine learning techniques have become widespread. The author's previous book ("Applied Predictive Modeling", APM) provided a wide-ranging introduction and integration of these methods and suggested a workflow in R to carry out exploratory and confirmation analyses. With this project, the authors have identified an important and interesting component of these methods that describes building better models by focusing on the predictors (feature engineering)…The authors focus on the variables that go into the model (and how they are represented) and argue that such issues are as important (or more important) than the particular methods that are applied to an analysis...The proposed book is likely to serve as a textbook (for a number of undergraduate and graduate courses in a variety of disciplines) and reference (for a large number of statisticians seeking principled and well-organized modeling)."
    ~Nicholas Horton, Amherst College

    "I think this book is great and a joy to read…I like the pragmatic and practical approach taken in the book, and the examples given are very illustrative. The emphasis on how and when to use resampling is refreshing and something that the community needs to hear more."
    ~Andreas C. Muller, Columbia University