1st Edition

Population Genomics with R

By Emmanuel Paradis Copyright 2020
    396 Pages
    by Chapman & Hall

    394 Pages
    by Chapman & Hall

    Population Genomics With R presents a multidisciplinary approach to the analysis of population genomics. The methods treated cover a large number of topics from traditional population genetics to large-scale genomics with high-throughput sequencing data. Several dozen R packages are examined and integrated to provide a coherent software environment with a wide range of computational, statistical, and graphical tools. Small examples are used to illustrate the basics and published data are used as case studies. Readers are expected to have a basic knowledge of biology, genetics, and statistical inference methods. Graduate students and post-doctorate researchers will find resources to analyze their population genetic and genomic data as well as help them design new studies.

    The first four chapters review the basics of population genomics, data acquisition, and the use of R to store and manipulate genomic data. Chapter 5 treats the exploration of genomic data, an important issue when analysing large data sets. The other five chapters cover linkage disequilibrium, population genomic structure, geographical structure, past demographic events, and natural selection. These chapters include supervised and unsupervised methods, admixture analysis, an in-depth treatment of multivariate methods, and advice on how to handle GIS data. The analysis of natural selection, a traditional issue in evolutionary biology, has known a revival with modern population genomic data. All chapters include exercises. Supplemental materials are available on-line (http://ape-package.ird.fr/PGR.html).

    1. Introduction

    Heredity, Genetics, and Genomics

    Principles of Population Genomics

    Units

    Genome Structures

    Mutations

    Drift and Selection

    R Packages and Conventions

    Required Knowledge and Other Readings

    2. Data Acquisition

    Samples and Sampling Designs

    How Much DNA in a Sample?

    Degraded Samples

    Sampling Designs

    Low-Throughput Technologies

    Genotypes From Phenotypes

    DNA Cleavage Methods

    Repeat Length Polymorphism

    Sanger and Shotgun Sequencing

    DNA Methylation and Bisulfite Sequencing

    High-Throughput Technologies

    DNA Microarrays

    High-Throughput Sequencing

    Restriction Site Associated DNA

    RNA Sequencing

    Exome Sequencing

    Sequencing of Pooled Individuals

    Designing a Study With HTS

    The Future of DNA Sequencing

    File Formats

    Data Files

    Archiving and Compression

    Bioinformatics and Genomics

    Processing Sanger Sequencing Data With sangerseqR

    Read Mapping With Rsubread

    Managing Read Alignments With Rsamtools

    Simulation of High-Throughput Sequencing Data

    Exercises

    3. Genomic Data in R

    What is an R Data Object?

    Data Classes for Genomic Data

    The Class "loci" (pegas)

    The Class "genind" (adegenet)

    The Classes "SNPbin" and "genlight" (adegenet)

    The Class "SnpMatrix" (snpStats)

    The Class "DNAbin" (ape)

    The Classes "XString" and "XStringSet" (Biostrings)

    The Package SNPRelate

    Data Input and Output

    Reading Text Files

    Reading Spreadsheet Files

    Reading VCF Files

    Reading PED and BED Files

    Reading Sequence Files

    Reading Annotation Files

    Writing Files

    Internet Databases

    Managing Files and Projects

    Exercises

    4. Data Manipulation

    Basic Data Manipulation in R

    Subsetting, Replacement, and Deletion

    Commonly Used Functions

    Recycling and Coercion

    Logical Vectors

    Memory Management

    Conversions

    Case Studies

    Mitochondrial Genomes of the Asiatic Golden Cat

    Complete Genomes of the Fruit Fly

    Human Genomes

    Influenza HN Virus Sequences

    Jaguar Microsatellites

    Bacterial Whole Genome Sequences

    Metabarcoding of Fish Communities

    Exercises

    5. Data Exploration and Summaries

    Genotype and Allele Frequencies

    Allelic Richness

    Missing Data

    Haplotype and Nucleotide Diversity

    The Class "haplotype"

    Haplotype and Nucleotide Diversity From DNA Sequences

    Genetic and Genomic Distances

    Theoretical Background

    Hamming Distance

    Distances From DNA Sequences

    Distances From Allele Sharing

    Distances From Microsatellites

    Summary by Groups

    Sliding Windows

    DNA Sequences

    Summaries With Genomic Positions

    Package SNPRelate

    Multivariate Methods

    Matrix Decomposition

    Eigendecomposition

    Singular Value Decomposition

    Power Method and Random Matrices

    Principal Component Analysis

    adegenet

    SNPRelate

    flashpcaR

    Multidimensional Scaling

    Case Studies

    Mitochondrial Genomes of the Asiatic Golden Cat

    Complete Genomes of the Fruit Fly

    Human Genomes

    Influenza HN Virus Sequences

    Jaguar Microsatellites

    Bacterial Whole Genome Sequences

    Metabarcoding of Fish Communities

    Exercises

    6. Linkage Disequilibrium and Haplotype Structure

    Why Linkage Disequilibrium is Important?

    Linkage Disequilibrium: Two Loci

    Phased Genotypes

    Theoretical Background

    Implementation in pegas

    Unphased Genotypes

    More Than Two Loci

    Haplotypes From Unphased Genotypes

    The Expectation–Maximization Algorithm

    Implementation in haplostats

    Locus-Specific Imputation

    Maps of Linkage Disequilibrium

    Phased Genotypes With pegas

    SNPRelate

    snpStats

    Case Studies

    Complete Genomes of the Fruit Fly

    Human Genomes

    Jaguar Microsatellites

    Exercises

    7. Population Genetic Structure

    Hardy–Weinberg Equilibrium

    F-Statistics

    Theoretical Background

    Implementations in pegas and in mmod

    Implementations in snpStats and in SNPRelate

    Trees and Networks

    Minimum Spanning Trees and Networks

    Statistical Parsimony

    Median Networks

    Phylogenetic Trees

    Multivariate Methods

    Principles of Discriminant Analysis

    Discriminant Analysis of Principal Components

    Clustering

    Maximum Likelihood Methods

    Bayesian Clustering

    Admixture

    Likelihood Method

    Principal Component Analysis of Coancestry

    A Second Look at F-Statistics

    Case Studies

    Mitochondrial Genomes of the Asiatic Golden Cat

    Complete Genomes of the Fruit Fly

    Influenza HN Virus Sequences

    Jaguar Microsatellites

    Exercises

    8. Geographical Structure

    Geographical Data in R

    Packages and Classes

    Calculating Geographical Distances

    A Third Look at F-Statistics

    Hierarchical Components of Genetic Diversity

    Analysis of Molecular Variance

    Moran I and Spatial Autocorrelation

    Spatial Principal Component Analysis

    Finding Boundaries Between Populations

    Spatial Ancestry (tessr)

    Bayesian Methods (Geneland)

    Case Studies

    Complete Genomes of the Fruit Fly

    Human Genomes

    Exercises

    9. Past Demographic Events

    The Coalescent

    The Standard Coalescent

    The Sequential Markovian Coalescent

    Simulation of Coalescent Data

    Estimation of _

    Heterozygosity

    Number of Alleles

    Segregating Sites

    Microsatellites

    Trees

    Coalescent-Based Inference

    Maximum Likelihood Methods

    Analysis of Markov Chain Monte Carlo Outputs

    Skyline Plots

    Bayesian Methods

    Heterochronous Samples

    Site Frequency Spectrum Methods

    The Stairway Method

    CubSFS

    Popsicle

    Whole-Genome Methods (psmcr)

    Case Studies

    Mitochondrial Genomes of the Asiatic Golden Cat

    Complete Genomes of the Fruit Fly

    Influenza HN Virus Sequences

    Bacterial Whole Genome Sequences

    Exercises

    10. Natural Selection

    Testing Neutrality

    Simple Tests

    Selection in Protein-Coding Sequences

    Selection Scans

    A Fourth Look at F-Statistics

    Association Studies (LEA)

    Principal Component Analysis (pcadapt)

    Scans for Selection With Extended Haplotypes

    FST Outliers

    Time-Series of Allele Frequencies

    Case Studies

    Mitochondrial Genomes of the Asiatic Golden Cat

    Complete Genomes of the Fruit Fly

    Influenza HN Virus Sequences

    Exercises

    A Installing R Packages

    B Compressing Large Sequence Files

    C Sampling of Alleles in a Population

    Biography

    Emmanuel Paradis is senior researcher in the French Institute of Research for Development (IRD). His research focuses on evolutionary models and their applications. The development and publication of software associated to his research has been an important aspect of his activities for more than twenty years. He adopted R as his main software for data analysis in 2000 and has since published and maintained several packages, including ape since 2002 and pegas since 2009. He gives regular workshops and trainings in several countries.

    "The author has taken good care of including several important as well as emerging topics (data acquisition, next generation sequencing) that would be extremely useful for the readers. suggest that this book be targeted to graduate students and researchers who have some background in basic genetics or are taking a graduate level population genetics course…The data acquisition chapter, descriptions of DNA sample quality, and file formats are the strengths. Case studies are very valuable and would provide more "hands-on" training on working on specific population genetics problems."
    ~Santhosh Girirajan, Pennsylvania State University

    "The strength of those chapters is to provide a global coverage of the field of population genetics based on a broad spectrum of statistical methods. The author proposes to deal with population genetic analyses in a unified programming framework that uses specific classes of the R packages ape/pegas and adegenet, and I was impressed by the work done."
    ~Oliver Francois, University Grenoble Alpes

    "This book could serve as both a reference book and a textbook. Population genetics, applied bioinformatics, genomics, molecular ecology, and conservation genetic classes with a lab component at both undergraduate and graduate levels could teach from this text. Graduate students and possible postdocs in evolutionary biology and applied bioinformatics could use this as a reference. Additionally, government and non-profit organizations that process genetic samples for conservation and management purposes would find this instruction useful. …What this text offers is unique in that it is focused on practical steps to analyze data using already available programs that users can install…Given the variety of subjects and types of analyses, I think it could be a valuable resource for many students."
    ~Sarah Hendricks, San Diego Zoo Institute for Conservation Research