1st Edition

Machine Learning for Factor Investing: R Version

By Guillaume Coqueret, Tony Guida Copyright 2021
    342 Pages
    by Chapman & Hall

    342 Pages
    by Chapman & Hall

    Machine learning (ML) is progressively reshaping the fields of quantitative finance and algorithmic trading. ML tools are increasingly adopted by hedge funds and asset managers, notably for alpha signal generation and stocks selection. The technicality of the subject can make it hard for non-specialists to join the bandwagon, as the jargon and coding requirements may seem out of reach. Machine Learning for Factor Investing: R Version bridges this gap. It provides a comprehensive tour of modern ML-based investment strategies that rely on firm characteristics.

    The book covers a wide array of subjects which range from economic rationales to rigorous portfolio back-testing and encompass both data processing and model interpretability. Common supervised learning algorithms such as tree models and neural networks are explained in the context of style investing and the reader can also dig into more complex techniques like autoencoder asset returns, Bayesian additive trees, and causal models.

    All topics are illustrated with self-contained R code samples and snippets that are applied to a large public dataset that contains over 90 predictors. The material, along with the content of the book, is available online so that readers can reproduce and enhance the examples at their convenience. If you have even a basic knowledge of quantitative finance, this combination of theoretical concepts and practical illustrations will help you learn quickly and deepen your financial and technical expertise.

    I Introduction

    1. Preface
     What this book is not about                          
     The targeted audience                              
     How this book is structured                          
     Companion website                               
     Why R?                                      
     Coding instructions                               
     Future developments                              
    2. Notations and data

    3. Introduction
     Portfolio construction: the workflow                      
     Machine Learning is no Magic Wand                     

    4. Factor investing and asset pricing anomalies
     Detecting anomalies                               
     Simple portfolio sorts                          
     Predictive regressions, sorts, and p-value issues            
     Fama-Macbeth regressions                        
     Factor competition                            
     Advanced techniques                           
     Factors or characteristics?                           
     Hot topics: momentum, timing and ESG                   
     Factor momentum                            
     Factor timing                               
     The green factors                             
     The link with machine learning                         
     A short list of recent references                     
     Explicit connections with asset pricing models            
     Coding exercises                                 
    5. Data preprocessing

     Know your data                                 
     Missing data                                   
     Outlier detection                                 
     Feature engineering                               
     Feature selection                             
     Scaling the predictors                          
     Simple labels                               
     Categorical labels                             
     The triple barrier method                        
     Filtering the sample                           
     Return horizons                             
     Handling persistence                               
     Transforming features                          
     Macro-economic variables                        
     Active learning                              
     Additional code and results                           
     Impact of rescaling: graphical representation             
     Impact of rescaling: toy example                    
     Coding exercises                                 

    II Common supervised algorithms

    6. Penalized regressions and sparse hedging for minimum variance portfolios
    Penalised regressions                              
     Simple regressions                            
     Forms of penalizations                          
     Sparse hedging for minimum variance portfolios               
     Presentation and derivations                      
     Predictive regressions                              
     Literature review and principle                     
     Code and results                             
     Coding exercise                                 

    7. Tree-based methods
     Simple trees                                   
     Further details on classification                     
     Pruning criteria                              
     Code and interpretation                         
     Random forests                                 
     Code and results                             
     Boosted trees: Adaboost                            
     Boosted trees: extreme gradient boosting                   
     Managing Loss                              
     Tree structure                               
     Code and results                             
     Instance weighting                            
     Coding exercises                                 
    8. Neural networks
     The original perceptron                             
     Multilayer perceptron (MLP)                          
     Introduction and notations                       
     Universal approximation                         
     Learning via back-propagation                     
     Further details on classification                     
     How deep should we go? And other practical issues             
     Architectural choices                           
     Frequency of weight updates and learning duration          
     Penalizations and dropout                        
     Code samples and comments for vanilla MLP                 
     Regression example                           
     Classification example                          
     Custom losses                               
     Recurrent networks                               
     Code and results                             
     Other common architectures                          
     Generative adversarial networks                    
     A word on convolutional networks                   
     Advanced architectures                         
     Coding exercise                                 
    9. Support vector machines
     SVM for classification                              
     SVM for regression                               
     Coding exercises                                 
    10. Bayesian methods

     The Bayesian framework                            
     Bayesian sampling                                
     Gibbs sampling                              
     Metropolis-Hastings sampling                      
     Bayesian linear regression                            
     Naive Bayes classifier                              
     Bayesian additive trees                             
     General formulation                           
     Sampling and predictions                        

    III From predictions to portfolios
    11. Validating and tuning

     Learning metrics                                 
     Regression analysis                            
     Classification analysis                          
     The variance-bias tradeoff: theory                   
     The variance-bias tradeoff: illustration                 
     The risk of overfitting: principle                     
     The risk of overfitting: some solutions                 
     The search for good hyperparameters                     
     Example: grid search                           
     Example: Bayesian optimization                    
     Short discussion on validation in backtests                  

    12. Ensemble models
     Linear ensembles                                 
     Stacked ensembles                                
     Two stage training                            
     Code and results                             
     Exogenous variables                           
     Shrinking inter-model correlations                   
    13. Portfolio backtesting
     Setting the protocol                               
     Turning signals into portfolio weights                     
     Performance metrics                               
     Pure performance and risk indicators                  
     Factor-based evaluation                         
     Risk-adjusted measures                         
     Transaction costs and turnover                     
     Common errors and issues                           
     Forward looking data                          
     Backtest overfitting                           
     Simple safeguards                            
     Implication of non-stationarity: forecasting is hard              
     General comments                            
     The no free lunch theorem                        
     Coding exercises                                 

    IV Further important topics

    14. Interpretability
     Global interpretations                              
     Simple models as surrogates                       
     Variable importance (tree-based)                    
     Variable importance (agnostic)                     
     Partial dependence plot                         
     Local interpretations                              
     Shapley values                              
    15. Two key concepts: causality and non-stationarity
     Granger causality                             
     Causal additive models                         
     Structural time-series models                      
     Dealing with changing environments                      
     Non-stationarity: yet another illustration               
     Online learning                              
     Homogeneous transfer learning                     
    16. Unsupervised learning
     The problem with correlated predictors                    
     Principal component analysis and autoencoders               
     A bit of algebra                              
     Clustering via k-means                             
     Nearest neighbors                                
     Coding exercise                                 
    17. Reinforcement learning

     Theoretical layout                                
     General framework                            
     The curse of dimensionality                           
     Policy gradient                                  
     Simple examples                                 
     Q-learning with simulations                       
     Q-learning with market data                      
     Concluding remarks                               

    V Appendix

    Data Description
    Solution to exercises


    Guillaume Coqueret is associate professor of finance and data science at EMLYON Business School. His recent research revolves around applications of machine learning tools in financial economics.

    Tony Guida is executive director at RAM Active Investments. He serves as chair of the machineByte think tank and is the author of Big Data and Machine Learning in Quantitative Investment.

    "This book is the perfect one for any data scientist on financial markets. It is well written, with lots of illustrations, examples, pieces of code, tips on the different statistical package available to perform the various algos. This book requires for sure a strong knowledge in quantitative finance and Machine Learning, so it cannot be put in any hands. But for those who are familiar with quantitative finance, this book can be a reference, as Hull's book is as regards to derivatives products. I liked the good and detailed analysis of the different Machine Learning algos, and the different examples used throughout the book. This book is perfect for assets managers having to run backtests and searching for innovative ways to enhance the return of their portfolios. I spent quite a good time reading this manuscript, and I would recommend it."
    -Frédéric Girod, Union of European Football Associations