Computational Genomics with R  book cover
1st Edition

Computational Genomics with R

ISBN 9781498781855
Published December 29, 2020 by Chapman and Hall/CRC
462 Pages

FREE Standard Shipping
USD $130.00

Prices & shipping based on shipping country


Book Description

Computational Genomics with R provides a starting point for beginners in genomic data analysis and also guides more advanced practitioners to sophisticated data analysis techniques in genomics. The book covers topics from R programming, to machine learning and statistics, to the latest genomic data analysis techniques. The text provides accessible information and explanations, always with the genomics context in the background. This also contains practical and well-documented examples in R so readers can analyze their data by simply reusing the code presented. As the field of computational genomics is interdisciplinary, it requires different starting points for people with different backgrounds. For example, a biologist might skip sections on basic genome biology and start with R programming, whereas a computer scientist might want to start with genome biology. 

After reading:

  • You will have the basics of R and be able to dive right into specialized uses of R for computational genomics such as using Bioconductor packages.
  • You will be familiar with statistics, supervised and unsupervised learning techniques that are important in data modeling, and exploratory analysis of high-dimensional data.
  • You will understand genomic intervals and operations on them that are used for tasks such as aligned read counting and genomic feature annotation.
  • You will know the basics of processing and quality checking high-throughput sequencing data.
  • You will be able to do sequence analysis, such as calculating GC content for parts of a genome or finding transcription factor binding sites.
  • You will know about visualization techniques used in genomics, such as heatmaps, meta-gene plots, and genomic track visualization.
  • You will be familiar with analysis of different high-throughput sequencing data sets, such as RNA-seq, ChIP-seq, and BS-seq.
  • You will know basic techniques for integrating and interpreting multi-omics datasets.

Altuna Akalin is a group leader and head of the Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center, Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. He has published an extensive body of work in this area. The framework for this book grew out of the yearly computational genomics courses he has been organizing and teaching since 2015.

Table of Contents

1. Introduction to Genomics
 Genes, DNA and central dogma                  
 What is a genome?                     
 What is a gene?                       
 How genes are controlled? The transcriptional and the post-transcriptional  
 What does a gene look like?                 
 Elements of gene regulation                    
 Transcriptional regulation                 
 Post-transcriptional regulation               
 Shaping the genome: DNA mutation                
 High-throughput experimental methods in genomics       
 The general idea behind high-throughput techniques   
 High-throughput sequencing               
 Visualization and data repositories for genomics         

2. Introduction to R for Genomic Data Analysis
 Steps of (genomic) data analysis                  
 Data collection                       
 Data quality check and cleaning              
 Data processing                       
 Exploratory data analysis and modeling          
 Visualization and reporting                
 Why use R for genomics ?                  
 Getting started with R                       
 Installing packages                     
 Installing packages in custom locations          
 Getting help on functions and packages          
 Computations in R                         
 Data structures                           
 Data Frames                        
 Data types                              
 Reading and writing data                      
 Reading large files                     
 Plotting in R with base graphics                  
 Combining multiple plots                 
 Saving plots                         
 Plotting in R with ggplot                     
 Combining multiple plots                 
 ggplot and tidyverse                    
 Functions and control structures (for, if/else etc)         
 User defined functions                   
 Loops and looping structures in R             
 Computations in R                     
 Data structures in R                    
 Reading in and writing data out in R            
 Plotting in R                         
 Functions and control structures (for, if/else etc)     

3. Statistics for Genomics
 How to summarize collection of data points: The idea behind statistical
 Describing the central tendency: mean and median    
 Describing the spread: measurements of variation    
 Precision of estimates: Confidence intervals        
 How to test for differences between samples           
 randomization based testing for difference of the means 
 Using t-test for difference of the means between two samples                             
 multiple testing correction                 
 moderated t-tests: using information from multiple comparisons                           
 Relationship between variables: linear models and correlation  
 How to fit a line                       
 How to estimate the error of the coefficients        
 Accuracy of the model                   
 Regression with categorical variables           
 Regression pitfalls                     
 How to summarize collection of data points: The idea behind
 statistical distributions                
 How to test for differences in samples           
 Relationship between variables: linear models and correlation                             

4. Exploratory Data Analysis with Unsupervised Machine Learning
 Clustering: grouping samples based on their similarity      
 Distance metrics                      
 Hiearchical clustering                   
 K-means clustering                     
 how to choose “k”, the number of clusters         
 Dimensionality reduction techniques: visualizing complex data sets in D                              
 Principal component analysis               
 Other matrix factorization methods for dimensionality reduction                           
 Multi-dimensional scaling                 
 t-Distributed Stochastic Neighbor Embedding (t-SNE)  
 Dimension Reduction                   

5. Predictive Modeling with Supervised Machine Learning
 How machine learning models are fit?               
 Machine learning vs Statistics               
 Steps in supervised machine learning               
 Use case: Disease subtype from genomics data          
 Data preprocessing                         
 data transformation                    
 Filtering data and scaling                  
 Dealing with missing values                
 Splitting the data                          
 Holdout test dataset                    
 Bootstrap resampling                    
 Predicting the subtype with k-nearest neighbors         
 Assessing the performance of our model              
 Receiver Operating Characteristic (ROC) Curves     
 Model tuning and avoiding overfitting               
 Model complexity and bias variance trade-off       
 Data split strategies for model tuning and testing     
 Variable importance                        
 How to deal with class imbalance                 
 Sampling for class balance                 
 Altering case weights                    
 selecting different classification score cutoffs       
 Dealing with correlated predictors                 
 Trees and forests: Random forests in action            
 decision trees                        
 Trees to forests                       
 Variable importance                    
 Logistic regression and regularization               
 regularization in order to avoid overfitting        
 variable importance                     
 Other supervised algorithms                    
 Gradient boosting                     
 Support Vector Machines (SVM)              
 Neural networks and deep versions of it          
 Ensemble learning                     
 Predicting continuous variables: regression with machine learning                                 
 Use case: Predicting age from DNA methylation      
 reading and processing the data              
 Running random forest regression             

6. Operations on Genomic Intervals and Genome Arithmetic
 Operations on Genomic Intervals with GenomicRanges package 
 How to create and manipulate a GRanges object      
 Getting genomic regions into R as GRanges objects    
 Finding regions that do/do not overlap with another set of regions                           
 Dealing with mapped high-throughput sequencing reads     
 Counting mapped reads for a set of regions        
 Dealing with continuous scores over the genome         
 Extracting subsections of Rle and RleList objects     
 Genomic intervals with more information: SummarizedExperiment class                              
 Create a SummarizedExperiment object          
 Subset and manipulate the SummarizedExperiment object                             
 Visualizing and summarizing genomic intervals         
 Visualizing intervals on a locus of interest         
 Summaries of genomic intervals on multiple loci     
 Making karyograms and circos plots            
 Operations on Genomic Intervals with GenomicRanges package                           
 Dealing with mapped high-throughput sequencing reads 
 Dealing with contiguous scores over the genome     
 Visualizing and summarizing genomic intervals     

7. Quality Check, Processing and Alignment of High-throughput Sequencing Reads
 FASTA and FASTQ formats                     
 Quality check on sequencing reads                 
 Sequence quality per base/cycle              
 Sequence content per base/cycle              
 Read frequency plot                     
 Other quality metrics and QC tools             
 Filtering and trimming reads                    
 Mapping/aligning reads to the genome              
 Further processing of aligned reads                

8. RNA-seq Analysis
 What is gene expression?                      
 Methods to detect gene expression                 
 Gene Expression Analysis Using High-throughput Sequencing Technologies                            
 Processing raw data                    
 Within sample normalization of the read counts     
 Computing different normalization schemes in R     
 Exploratory analysis of the read count table        
 Differential expression analysis              
 Functional Enrichment Analysis              
 Accounting for additional sources of variation       
 Other applications of RNA-seq                   
 Exploring the count tables                 
 Differential expression analysis              
 Functional enrichment analysis              
 Removing unwanted variation from the expression data 

9. ChIP-seq analysis
 Regulatory protein-DNA interactions               
 Measuring protein-DNA interactions with ChIP-seq       
 Factors that affect ChIP-seq experiment and analysis quality   
 Antibody specificity                     
 Sequencing depth                      
 PCR duplication                       
 Biological replicates                     
 Control experiments                    
 Using tagged proteins                   
 Pre-processing ChIP data                      
 Mapping of ChIP-seq data                 
 ChIP quality control                        
 The data                           
 Sample clustering                      
 Visualization in the Genome Browser           
 Plus and minus strand cross-correlation          
 GC bias quantification                   
 Sequence read genomic distribution            
 Peak calling                             
 Types of ChIP-seq experiments               
 Peak calling - sharp peaks                  
 Peak calling - Broad regions                
 Peak quality control                     
 Peak annotation                      
 Motif discovery                           
 Motif comparison                      
 What to do next?                          
 Quality control:                       

10. DNA methylation analysis using bisulfite sequencing data
 What is DNA methylation ?                     
 How DNA methylation is set ?               
 How to measure DNA methylation with bisulfitesequencing                         
 Analyzing DNA methylation data                  
 Processing raw data and getting data into R            
 Data filtering and exploratory analysis               
 Reading methylation call files               
 Further quality check                    
 Merging samples into a single table            
 Filtering CpGs                       
 Clustering samples                     
 Principal component analysis               
 Extracting interesting regions: segmentation and differential methylation                             
 Differential methylation                  
 Methylation segmentation                 
 Working with large files                  
 Annotation of DMRs/DMCs and segments             
 Further annotation with genes or gene sets        
 Other R packages that can be used for methylation analysis    
 Differential methylation                  
 Methylome segmentation                 

11. Multi-omics Analysis
 Use case: Multi-omics data from colorectal cancer     
 Latent variable models for multi-omics integration        
 Matrix factorization methods for unsupervised multi-omics data integration                             
 Multiple Factor Analysis                  
 Joint Non-negative Matrix Factorization          
 Clustering using latent factors                   
 One-hot clustering                     
 K-means clustering                     
 Biological interpretation of latent factors             
 Inspection of feature weights in loading vectors      
 Making sense of factors using enrichment analysis    
 Interpretation using additional covariates         
 Matrix factorization methods               
 Clustering using latent factors               
 Biological interpretation of latent factors         


View More



Dr. Altuna Akalin is a bioinformatics scientist and the head of Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center in Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. His interest is in using machine learning and statistics to uncover patterns related to important biological variables such as disease state and type. He has lived in the USA, Norway, Turkey, Japan, and Switzerland in order to pursue research work and education related to computational genomics.


'This book provides a basic overview of computational tools developed in R for carrying out data analyses in genomics. It can be a valuable companion for anyone whowants to utilise the computational tools developed within the Bioconductor and R environments for education and research. This book’s main target audience are students of computational biology to get a first look at the diversity of machine learning methods. Thebook will also servewell biomedical researchers needing a guide to packages that can help them with the analysis of data that they encounter in their work.'

Krzysztof Podgórski, International Statistical Review (2021) doi: 10.1111/insr.12453