1st Edition

Data Analytics A Small Data Approach

By Shuai Huang, Houtao Deng Copyright 2021
    273 Pages
    by Chapman & Hall

    Data Analytics: A Small Data Approach is suitable for an introductory data analytics course to help students understand some main statistical learning models. It has many small datasets to guide students to work out pencil solutions of the models and then compare with results obtained from established R packages. Also, as data science practice is a process that should be told as a story, in this book there are many course materials about exploratory data analysis, residual analysis, and flowcharts to develop and validate models and data pipelines.

    The main models covered in this book include linear regression, logistic regression, tree models and random forests, ensemble learning, sparse learning, principal component analysis, kernel methods including the support vector machine and kernel regression, and deep learning. Each chapter introduces two or three techniques. For each technique, the book highlights the intuition and rationale first, then shows how mathematics is used to articulate the intuition and formulate the learning problem. R is used to implement the techniques on both simulated and real-world dataset. Python code is also available at the book’s website: http://dataanalyticsbook.info.

    1. INTRODUCTION

    Who will benefit from this book

    Overview of a Data Analytics Pipeline

    Topics in a Nutshell

    2. ABSTRACTION

    Regression & tree models

    Overview

    Regression Models

    Tree Models

    Remarks

    Exercises

    3. RECOGNITION

    Logistic regression & ranking

    Overview

    Logistic Regression Model

    A Ranking Problem by Pairwise Comparison

    Statistical Process Control using Decision Tree

    Remarks

    Exercise

    4. RESONANCE

    Bootstrap & random forests

    Overview

    How Bootstrap Works

    Random Forests

    Remarks

    Exercises

    5. LEARNING (I)

    Cross validation & OOB

    Overview

    Cross-Validation

    Out-of-bag error in Random Forest

    Remarks

    Exercises

    6. DIAGNOSIS

    Residuals & heterogeneity

    Overview

    Diagnosis in Regression

    Diagnosis in Random Forests

    Clustering

    Remarks

    Exercises

    7. LEARNING (II)

    SVM & ensemble Learning

    Overview

    Support Vector Machine

    Ensemble Learning

    Remarks

    Exercises

    data analytics

    8. SCALABILITY

    LASSO & PCA

    Overview

    LASSO

    Principal Component Analysis

    Remarks

    Exercises

    9. PRAGMATISM

    Experience & experimental

    Overview

    Kernel Regression Model

    Conditional Variance Regression Model

    Remarks

    Exercises

    10. SYNTHESIS

    Architecture & pipeline

    Overview

    Deep Learning

    inTrees

    Remarks

    Exercises

    CONCLUSION

    APPENDIX: A BRIEF REVIEW OF BACKGROUND KNOWLEDGE

    The normal distribution

    Matrix operations

    Optimization

    Biography

    Shuai Huang is an associate professor at the department of industrial & systems engineering at the university of Washington. He conducts interdisciplinary research in machine learning, data analytics, and applied operations research with applications on healthcare, manufacturing, and transportation areas.

    Houtao Deng is a data science researcher and practitioner. He developed several new decision tree methods such as inTrees. He has built data-driven products for forecasting, scheduling, pricing, recommendation, fraud detection, and image recognition.

    "Another strength of the book is that the authors cover the regression methods comprehensively, starting from the relationship between variables, to the connections between methods. As a result, this book may be an introductory guide for health care professionals, students, and lecturers, both by showing the exercises with manual solutions and giving the R coding of the methods."
    -Selen Yilmaz Isikhan in ISCB, September 2022