Feature Engineering and Selection: A Practical Approach for Predictive Models, 1st Edition (Hardback) book cover

Feature Engineering and Selection

A Practical Approach for Predictive Models, 1st Edition

By Max Kuhn, Kjell Johnson

Chapman and Hall/CRC

308 pages

Purchasing Options:$ = USD
Hardback: 9781138079229
pub: 2019-09-12
Available for pre-order

FREE Standard Shipping!


The process of developing predictive models includes many stages. Most resources focus on the modeling algorithms but neglect other critical aspects of the modeling process. This book describes techniques for finding the best representations of predictors for modeling and for nding the best subset of predictors for improving model performance. A variety of example data sets are used to illustrate the techniques along with R programs for reproducing the results.


"The book is timely and needed. The interest in all things 'data science' morphed into everybody pretending to do, or know, Machine Learning. Kuhn and Johnson happen to actually know this—as evidenced by their earlier and still-popular tome entitled ‘Applied Predictive Modeling.’ The proposed ‘Feature Engineering and Selection’ builds on this and extends it. I expect it to become as popular with a wide reach as both a textbook, self-study material, and reference." ~Dirk Eddelbuettel, University of Illinois at Urbana-Champaign

"As a reviewer, it has been exciting and edifying to see this book develop into what is likely to become one of the foundational works on feature engineering. It is launching propitiously on the current tide of interest in both interpretable models and AutoML." ~Robert Horton, Microsoft

"In recent years, the statistics literature has featured new developments in modeling and predictive analytics. Approaches such as cross-validation and statistical/machine learning techniques have become widespread. The author's previous book ("Applied Predictive Modeling", APM) provided a wide-ranging introduction and integration of these methods and suggested a workflow in R to carry out exploratory and confirmation analyses. With this project, the authors have identified an important and interesting component of these methods that describes building better models by focusing on the predictors (feature engineering)…The authors focus on the variables that go into the model (and how they are represented) and argue that such issues are as important (or more important) than the particular methods that are applied to an analysis…The proposed book is likely to serve as a textbook (for a number of undergraduate and graduate courses in a variety of disciplines) and reference (for a large number of statisticians seeking principled and well-organized modeling)." ~Nicholas Horton, Amherst College

Table of Contents

1. Introduction

A Simple Example

Important Concepts

A More Complex Example

Feature Selection

An Outline of the Book


2. Illustrative Example: Predicting Risk of Ischemic Stroke




Predictive Modeling Across Sets

Other Considerations


3. A Review of the Predictive Modeling Process

Illustrative Example: OkCupid Profile Data

Measuring Performance

Data Splitting


Tuning Parameters and Overfitting

Model Optimization and Tuning

Comparing Models Using the Training Set

Feature Engineering Without Overfitting



4. Exploratory Visualizations

Introduction to the Chicago Train Ridership Data

Visualizations for Numeric Data: Exploring Train Ridership Data

Visualizations for Categorical Data: Exploring the OkCupid Data

Post Modeling Exploratory Visualizations



5. Encoding Categorical Predictors

Creating Dummy Variables for Unordered Categories

Encoding Predictors with Many Categories

Approaches for Novel Categories

Supervised Encoding Methods

Encodings for Ordered Data

Creating Features from Text Data

Factors versus Dummy Variables in Tree-Based Models



6. Engineering Numeric Predictors


Many Transformations

Many: Many Transformations



7. Detecting Interaction Effects

Guiding Principles in the Search for Interactions

Practical Considerations

The Brute-Force Approach to Identifying Predictive Interactions

Approaches when Complete Enumeration is Practically Impossible

Other Potentially Useful Tools



8. Handling Missing Data

Understanding the Nature and Severity of Missing Information

Models that are Resistant to Missing Values

Deletion of Data

Encoding Missingness

Imputation methods

Special Cases



9. Working with Profile Data

Illustrative Data: Pharmaceutical Manufacturing Monitoring

What are the Experimental Unit and the Unit of Prediction?

Reducing Background

Reducing Other Noise

Exploiting Correlation

Impacts of Data Processing on Modeling



10. Feature Selection Overview

Goals of Feature Selection

Classes of Feature Selection Methodologies

Effect of Irrelevant Features

Overfitting to Predictors and External Validation

A Case Study

Next Steps


11. Greedy Search Methods

Illustrative Data: Predicting Parkinson’s Disease

Simple Filters

Recursive Feature Elimination

Stepwise Selection



12. Global Search Methods

Naive Bayes Models

Simulated Annealing

Genetic Algorithms

Test Set Results



About the Authors

Max Kuhn, Ph.D., is a software engineer at RStudio. He worked in 18 years in drug discovery and medical diagnostics applying predictive models to real data. He has authored numerous R packages for predictive modeling and machine learning.

Kjell Johnson, Ph.D., is the owner and founder of Stat Tenacity, a firm that provides statistical and predictive modeling consulting services. He has taught short courses on predictive modeling for the American Society for Quality, American Chemical Society, International Biometric Society, and for many corporations.

Kuhn and Johnson have also authored Applied Predictive Modeling, which is a comprehensive, practical guide to the process of building a predictive model. The textwon the 2014 Technometrics Ziegel Prize for Outstanding Book.

About the Series

Chapman & Hall/CRC Data Science Series

Reflecting the interdisciplinary nature of the field, this new data science book series brings together researchers, practitioners, and instructors from statistics, computer science, machine learning, and analytics. The series will publish cutting-edge research, industry applications, and textbooks in data science.

* Presents the latest research and applications in the field, including new statistical and computational techniques
* Covers a broad range of interdisciplinary topics
* Provides guidance on the use of software for data science, including R, Python, and Julia
* Includes both introductory and advanced material for students and professionals
* Presents concepts while assuming minimal theoretical background

The scope of the series is broad, including titles in machine learning, pattern recognition, predictive analytics, business analytics, visualization, programming, software, learning analytics, data collection and wrangling, interactive graphics, reproducible research, and more. The inclusion of examples, applications, and code implementation is essential.

Learn more…

Subject Categories

BISAC Subject Codes/Headings:
COMPUTERS / Database Management / Data Mining
COMPUTERS / Machine Theory
MATHEMATICS / Probability & Statistics / General