1st Edition

Computational Genomics with R

By Altuna Akalin Copyright 2021
    462 Pages
    by Chapman & Hall

    462 Pages
    by Chapman & Hall

    Computational Genomics with R provides a starting point for beginners in genomic data analysis and also guides more advanced practitioners to sophisticated data analysis techniques in genomics. The book covers topics from R programming, to machine learning and statistics, to the latest genomic data analysis techniques. The text provides accessible information and explanations, always with the genomics context in the background. This also contains practical and well-documented examples in R so readers can analyze their data by simply reusing the code presented. As the field of computational genomics is interdisciplinary, it requires different starting points for people with different backgrounds. For example, a biologist might skip sections on basic genome biology and start with R programming, whereas a computer scientist might want to start with genome biology. 

    After reading:

    • You will have the basics of R and be able to dive right into specialized uses of R for computational genomics such as using Bioconductor packages.
    • You will be familiar with statistics, supervised and unsupervised learning techniques that are important in data modeling, and exploratory analysis of high-dimensional data.
    • You will understand genomic intervals and operations on them that are used for tasks such as aligned read counting and genomic feature annotation.
    • You will know the basics of processing and quality checking high-throughput sequencing data.
    • You will be able to do sequence analysis, such as calculating GC content for parts of a genome or finding transcription factor binding sites.
    • You will know about visualization techniques used in genomics, such as heatmaps, meta-gene plots, and genomic track visualization.
    • You will be familiar with analysis of different high-throughput sequencing data sets, such as RNA-seq, ChIP-seq, and BS-seq.
    • You will know basic techniques for integrating and interpreting multi-omics datasets.

    Altuna Akalin is a group leader and head of the Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center, Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. He has published an extensive body of work in this area. The framework for this book grew out of the yearly computational genomics courses he has been organizing and teaching since 2015.

    1. Introduction to Genomics
     Genes, DNA and central dogma                  
     What is a genome?                     
     What is a gene?                       
     How genes are controlled? The transcriptional and the post-transcriptional  
     What does a gene look like?                 
     Elements of gene regulation                    
     Transcriptional regulation                 
     Post-transcriptional regulation               
     Shaping the genome: DNA mutation                
     High-throughput experimental methods in genomics       
     The general idea behind high-throughput techniques   
     High-throughput sequencing               
     Visualization and data repositories for genomics         

    2. Introduction to R for Genomic Data Analysis
     Steps of (genomic) data analysis                  
     Data collection                       
     Data quality check and cleaning              
     Data processing                       
     Exploratory data analysis and modeling          
     Visualization and reporting                
     Why use R for genomics ?                  
     Getting started with R                       
     Installing packages                     
     Installing packages in custom locations          
     Getting help on functions and packages          
     Computations in R                         
     Data structures                           
     Data Frames                        
     Data types                              
     Reading and writing data                      
     Reading large files                     
     Plotting in R with base graphics                  
     Combining multiple plots                 
     Saving plots                         
     Plotting in R with ggplot                     
     Combining multiple plots                 
     ggplot and tidyverse                    
     Functions and control structures (for, if/else etc)         
     User defined functions                   
     Loops and looping structures in R             
     Computations in R                     
     Data structures in R                    
     Reading in and writing data out in R            
     Plotting in R                         
     Functions and control structures (for, if/else etc)     

    3. Statistics for Genomics
     How to summarize collection of data points: The idea behind statistical
     Describing the central tendency: mean and median    
     Describing the spread: measurements of variation    
     Precision of estimates: Confidence intervals        
     How to test for differences between samples           
     randomization based testing for difference of the means 
     Using t-test for difference of the means between two samples                             
     multiple testing correction                 
     moderated t-tests: using information from multiple comparisons                           
     Relationship between variables: linear models and correlation  
     How to fit a line                       
     How to estimate the error of the coefficients        
     Accuracy of the model                   
     Regression with categorical variables           
     Regression pitfalls                     
     How to summarize collection of data points: The idea behind
     statistical distributions                
     How to test for differences in samples           
     Relationship between variables: linear models and correlation                             

    4. Exploratory Data Analysis with Unsupervised Machine Learning
     Clustering: grouping samples based on their similarity      
     Distance metrics                      
     Hiearchical clustering                   
     K-means clustering                     
     how to choose “k”, the number of clusters         
     Dimensionality reduction techniques: visualizing complex data sets in D                              
     Principal component analysis               
     Other matrix factorization methods for dimensionality reduction                           
     Multi-dimensional scaling                 
     t-Distributed Stochastic Neighbor Embedding (t-SNE)  
     Dimension Reduction                   

    5. Predictive Modeling with Supervised Machine Learning
     How machine learning models are fit?               
     Machine learning vs Statistics               
     Steps in supervised machine learning               
     Use case: Disease subtype from genomics data          
     Data preprocessing                         
     data transformation                    
     Filtering data and scaling                  
     Dealing with missing values                
     Splitting the data                          
     Holdout test dataset                    
     Bootstrap resampling                    
     Predicting the subtype with k-nearest neighbors         
     Assessing the performance of our model              
     Receiver Operating Characteristic (ROC) Curves     
     Model tuning and avoiding overfitting               
     Model complexity and bias variance trade-off       
     Data split strategies for model tuning and testing     
     Variable importance                        
     How to deal with class imbalance                 
     Sampling for class balance                 
     Altering case weights                    
     selecting different classification score cutoffs       
     Dealing with correlated predictors                 
     Trees and forests: Random forests in action            
     decision trees                        
     Trees to forests                       
     Variable importance                    
     Logistic regression and regularization               
     regularization in order to avoid overfitting        
     variable importance                     
     Other supervised algorithms                    
     Gradient boosting                     
     Support Vector Machines (SVM)              
     Neural networks and deep versions of it          
     Ensemble learning                     
     Predicting continuous variables: regression with machine learning                                 
     Use case: Predicting age from DNA methylation      
     reading and processing the data              
     Running random forest regression             

    6. Operations on Genomic Intervals and Genome Arithmetic
     Operations on Genomic Intervals with GenomicRanges package 
     How to create and manipulate a GRanges object      
     Getting genomic regions into R as GRanges objects    
     Finding regions that do/do not overlap with another set of regions                           
     Dealing with mapped high-throughput sequencing reads     
     Counting mapped reads for a set of regions        
     Dealing with continuous scores over the genome         
     Extracting subsections of Rle and RleList objects     
     Genomic intervals with more information: SummarizedExperiment class                              
     Create a SummarizedExperiment object          
     Subset and manipulate the SummarizedExperiment object                             
     Visualizing and summarizing genomic intervals         
     Visualizing intervals on a locus of interest         
     Summaries of genomic intervals on multiple loci     
     Making karyograms and circos plots            
     Operations on Genomic Intervals with GenomicRanges package                           
     Dealing with mapped high-throughput sequencing reads 
     Dealing with contiguous scores over the genome     
     Visualizing and summarizing genomic intervals     

    7. Quality Check, Processing and Alignment of High-throughput Sequencing Reads
     FASTA and FASTQ formats                     
     Quality check on sequencing reads                 
     Sequence quality per base/cycle              
     Sequence content per base/cycle              
     Read frequency plot                     
     Other quality metrics and QC tools             
     Filtering and trimming reads                    
     Mapping/aligning reads to the genome              
     Further processing of aligned reads                

    8. RNA-seq Analysis
     What is gene expression?                      
     Methods to detect gene expression                 
     Gene Expression Analysis Using High-throughput Sequencing Technologies                            
     Processing raw data                    
     Within sample normalization of the read counts     
     Computing different normalization schemes in R     
     Exploratory analysis of the read count table        
     Differential expression analysis              
     Functional Enrichment Analysis              
     Accounting for additional sources of variation       
     Other applications of RNA-seq                   
     Exploring the count tables                 
     Differential expression analysis              
     Functional enrichment analysis              
     Removing unwanted variation from the expression data 

    9. ChIP-seq analysis
     Regulatory protein-DNA interactions               
     Measuring protein-DNA interactions with ChIP-seq       
     Factors that affect ChIP-seq experiment and analysis quality   
     Antibody specificity                     
     Sequencing depth                      
     PCR duplication                       
     Biological replicates                     
     Control experiments                    
     Using tagged proteins                   
     Pre-processing ChIP data                      
     Mapping of ChIP-seq data                 
     ChIP quality control                        
     The data                           
     Sample clustering                      
     Visualization in the Genome Browser           
     Plus and minus strand cross-correlation          
     GC bias quantification                   
     Sequence read genomic distribution            
     Peak calling                             
     Types of ChIP-seq experiments               
     Peak calling - sharp peaks                  
     Peak calling - Broad regions                
     Peak quality control                     
     Peak annotation                      
     Motif discovery                           
     Motif comparison                      
     What to do next?                          
     Quality control:                       

    10. DNA methylation analysis using bisulfite sequencing data
     What is DNA methylation ?                     
     How DNA methylation is set ?               
     How to measure DNA methylation with bisulfitesequencing                         
     Analyzing DNA methylation data                  
     Processing raw data and getting data into R            
     Data filtering and exploratory analysis               
     Reading methylation call files               
     Further quality check                    
     Merging samples into a single table            
     Filtering CpGs                       
     Clustering samples                     
     Principal component analysis               
     Extracting interesting regions: segmentation and differential methylation                             
     Differential methylation                  
     Methylation segmentation                 
     Working with large files                  
     Annotation of DMRs/DMCs and segments             
     Further annotation with genes or gene sets        
     Other R packages that can be used for methylation analysis    
     Differential methylation                  
     Methylome segmentation                 

    11. Multi-omics Analysis
     Use case: Multi-omics data from colorectal cancer     
     Latent variable models for multi-omics integration        
     Matrix factorization methods for unsupervised multi-omics data integration                             
     Multiple Factor Analysis                  
     Joint Non-negative Matrix Factorization          
     Clustering using latent factors                   
     One-hot clustering                     
     K-means clustering                     
     Biological interpretation of latent factors             
     Inspection of feature weights in loading vectors      
     Making sense of factors using enrichment analysis    
     Interpretation using additional covariates         
     Matrix factorization methods               
     Clustering using latent factors               
     Biological interpretation of latent factors         



    Dr. Altuna Akalin is a bioinformatics scientist and the head of Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center in Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. His interest is in using machine learning and statistics to uncover patterns related to important biological variables such as disease state and type. He has lived in the USA, Norway, Turkey, Japan, and Switzerland in order to pursue research work and education related to computational genomics.

    'This book provides a basic overview of computational tools developed in R for carrying out data analyses in genomics. It can be a valuable companion for anyone whowants to utilise the computational tools developed within the Bioconductor and R environments for education and research. This book’s main target audience are students of computational biology to get a first look at the diversity of machine learning methods. Thebook will also servewell biomedical researchers needing a guide to packages that can help them with the analysis of data that they encounter in their work.'

    Krzysztof Podgórski, International Statistical Review (2021) doi: 10.1111/insr.12453