Computational Genomics with R provides a starting point for beginners in genomic data analysis and also guides more advanced practitioners to sophisticated data analysis techniques in genomics. The book covers topics from R programming, to machine learning and statistics, to the latest genomic data analysis techniques. The text provides accessible information and explanations, always with the genomics context in the background. This also contains practical and well-documented examples in R so readers can analyze their data by simply reusing the code presented. As the field of computational genomics is interdisciplinary, it requires different starting points for people with different backgrounds. For example, a biologist might skip sections on basic genome biology and start with R programming, whereas a computer scientist might want to start with genome biology.
After reading:
- You will have the basics of R and be able to dive right into specialized uses of R for computational genomics such as using Bioconductor packages.
- You will be familiar with statistics, supervised and unsupervised learning techniques that are important in data modeling, and exploratory analysis of high-dimensional data.
- You will understand genomic intervals and operations on them that are used for tasks such as aligned read counting and genomic feature annotation.
- You will know the basics of processing and quality checking high-throughput sequencing data.
- You will be able to do sequence analysis, such as calculating GC content for parts of a genome or finding transcription factor binding sites.
- You will know about visualization techniques used in genomics, such as heatmaps, meta-gene plots, and genomic track visualization.
- You will be familiar with analysis of different high-throughput sequencing data sets, such as RNA-seq, ChIP-seq, and BS-seq.
- You will know basic techniques for integrating and interpreting multi-omics datasets.
Altuna Akalin is a group leader and head of the Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center, Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. He has published an extensive body of work in this area. The framework for this book grew out of the yearly computational genomics courses he has been organizing and teaching since 2015.
1. Introduction to Genomics
Genes, DNA and central dogma
What is a genome?
What is a gene?
How genes are controlled? The transcriptional and the post-transcriptional
regulation
What does a gene look like?
Elements of gene regulation
Transcriptional regulation
Post-transcriptional regulation
Shaping the genome: DNA mutation
High-throughput experimental methods in genomics
The general idea behind high-throughput techniques
High-throughput sequencing
Visualization and data repositories for genomics
2. Introduction to R for Genomic Data Analysis
Steps of (genomic) data analysis
Data collection
Data quality check and cleaning
Data processing
Exploratory data analysis and modeling
Visualization and reporting
Why use R for genomics ?
Getting started with R
Installing packages
Installing packages in custom locations
Getting help on functions and packages
Computations in R
Data structures
Vectors
Matrices
Data Frames
Lists
Factors
Data types
Reading and writing data
Reading large files
Plotting in R with base graphics
Combining multiple plots
Saving plots
Plotting in R with ggplot
Combining multiple plots
ggplot and tidyverse
Functions and control structures (for, if/else etc)
User defined functions
Loops and looping structures in R
Exercises
Computations in R
Data structures in R
Reading in and writing data out in R
Plotting in R
Functions and control structures (for, if/else etc)
3. Statistics for Genomics
How to summarize collection of data points: The idea behind statistical
distributions
Describing the central tendency: mean and median
Describing the spread: measurements of variation
Precision of estimates: Confidence intervals
How to test for differences between samples
randomization based testing for difference of the means
Using t-test for difference of the means between two samples
multiple testing correction
moderated t-tests: using information from multiple comparisons
Relationship between variables: linear models and correlation
How to fit a line
How to estimate the error of the coefficients
Accuracy of the model
Regression with categorical variables
Regression pitfalls
Exercises
How to summarize collection of data points: The idea behind
statistical distributions
How to test for differences in samples
Relationship between variables: linear models and correlation
4. Exploratory Data Analysis with Unsupervised Machine Learning
Clustering: grouping samples based on their similarity
Distance metrics
Hiearchical clustering
K-means clustering
how to choose “k”, the number of clusters
Dimensionality reduction techniques: visualizing complex data sets in D
Principal component analysis
Other matrix factorization methods for dimensionality reduction
Multi-dimensional scaling
t-Distributed Stochastic Neighbor Embedding (t-SNE)
Exercises
Clustering
Dimension Reduction
5. Predictive Modeling with Supervised Machine Learning
How machine learning models are fit?
Machine learning vs Statistics
Steps in supervised machine learning
Use case: Disease subtype from genomics data
Data preprocessing
data transformation
Filtering data and scaling
Dealing with missing values
Splitting the data
Holdout test dataset
Cross-validation
Bootstrap resampling
Predicting the subtype with k-nearest neighbors
Assessing the performance of our model
Receiver Operating Characteristic (ROC) Curves
Model tuning and avoiding overfitting
Model complexity and bias variance trade-off
Data split strategies for model tuning and testing
Variable importance
How to deal with class imbalance
Sampling for class balance
Altering case weights
selecting different classification score cutoffs
Dealing with correlated predictors
Trees and forests: Random forests in action
decision trees
Trees to forests
Variable importance
Logistic regression and regularization
regularization in order to avoid overfitting
variable importance
Other supervised algorithms
Gradient boosting
Support Vector Machines (SVM)
Neural networks and deep versions of it
Ensemble learning
Predicting continuous variables: regression with machine learning
Use case: Predicting age from DNA methylation
reading and processing the data
Running random forest regression
Exercises
Classification
Regression
6. Operations on Genomic Intervals and Genome Arithmetic
Operations on Genomic Intervals with GenomicRanges package
How to create and manipulate a GRanges object
Getting genomic regions into R as GRanges objects
Finding regions that do/do not overlap with another set of regions
Dealing with mapped high-throughput sequencing reads
Counting mapped reads for a set of regions
Dealing with continuous scores over the genome
Extracting subsections of Rle and RleList objects
Genomic intervals with more information: SummarizedExperiment class
Create a SummarizedExperiment object
Subset and manipulate the SummarizedExperiment object
Visualizing and summarizing genomic intervals
Visualizing intervals on a locus of interest
Summaries of genomic intervals on multiple loci
Making karyograms and circos plots
Exercises
Operations on Genomic Intervals with GenomicRanges package
Dealing with mapped high-throughput sequencing reads
Dealing with contiguous scores over the genome
Visualizing and summarizing genomic intervals
7. Quality Check, Processing and Alignment of High-throughput Sequencing Reads
FASTA and FASTQ formats
Quality check on sequencing reads
Sequence quality per base/cycle
Sequence content per base/cycle
Read frequency plot
Other quality metrics and QC tools
Filtering and trimming reads
Mapping/aligning reads to the genome
Further processing of aligned reads
Exercises
8. RNA-seq Analysis
What is gene expression?
Methods to detect gene expression
Gene Expression Analysis Using High-throughput Sequencing Technologies
Processing raw data
Alignment
Quantification
Within sample normalization of the read counts
Computing different normalization schemes in R
Exploratory analysis of the read count table
Differential expression analysis
Functional Enrichment Analysis
Accounting for additional sources of variation
Other applications of RNA-seq
Exercises
Exploring the count tables
Differential expression analysis
Functional enrichment analysis
Removing unwanted variation from the expression data
9. ChIP-seq analysis
Regulatory protein-DNA interactions
Measuring protein-DNA interactions with ChIP-seq
Factors that affect ChIP-seq experiment and analysis quality
Antibody specificity
Sequencing depth
PCR duplication
Biological replicates
Control experiments
Using tagged proteins
Pre-processing ChIP data
Mapping of ChIP-seq data
ChIP quality control
The data
Sample clustering
Visualization in the Genome Browser
Plus and minus strand cross-correlation
GC bias quantification
Sequence read genomic distribution
Peak calling
Types of ChIP-seq experiments
Peak calling - sharp peaks
Peak calling - Broad regions
Peak quality control
Peak annotation
Motif discovery
Motif comparison
What to do next?
Exercises:
Quality control:
10. DNA methylation analysis using bisulfite sequencing data
What is DNA methylation ?
How DNA methylation is set ?
How to measure DNA methylation with bisulfitesequencing
Analyzing DNA methylation data
Processing raw data and getting data into R
Data filtering and exploratory analysis
Reading methylation call files
Further quality check
Merging samples into a single table
Filtering CpGs
Clustering samples
Principal component analysis
Extracting interesting regions: segmentation and differential methylation
Differential methylation
Methylation segmentation
Working with large files
Annotation of DMRs/DMCs and segments
Further annotation with genes or gene sets
Other R packages that can be used for methylation analysis
Exercises
Differential methylation
Methylome segmentation
11. Multi-omics Analysis
Use case: Multi-omics data from colorectal cancer
Latent variable models for multi-omics integration
Matrix factorization methods for unsupervised multi-omics data integration
Multiple Factor Analysis
Joint Non-negative Matrix Factorization
iCluster
Clustering using latent factors
One-hot clustering
K-means clustering
Biological interpretation of latent factors
Inspection of feature weights in loading vectors
Making sense of factors using enrichment analysis
Interpretation using additional covariates
Exercises
Matrix factorization methods
Clustering using latent factors
Biological interpretation of latent factors
Biography
Dr. Altuna Akalin is a bioinformatics scientist and the head of Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center in Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. His interest is in using machine learning and statistics to uncover patterns related to important biological variables such as disease state and type. He has lived in the USA, Norway, Turkey, Japan, and Switzerland in order to pursue research work and education related to computational genomics.
'This book provides a basic overview of computational tools developed in R for carrying out data analyses in genomics. It can be a valuable companion for anyone whowants to utilise the computational tools developed within the Bioconductor and R environments for education and research. This book’s main target audience are students of computational biology to get a first look at the diversity of machine learning methods. Thebook will also servewell biomedical researchers needing a guide to packages that can help them with the analysis of data that they encounter in their work.'
- Krzysztof Podgórski, International Statistical Review (2021) doi: 10.1111/insr.12453