
Big Data in Omics and Imaging
Integrated Analysis and Causal Inference
Preview
Book Description
Big Data in Omics and Imaging: Integrated Analysis and Causal Inference addresses the recent development of integrated genomic, epigenomic and imaging data analysis and causal inference in big data era. Despite significant progress in dissecting the genetic architecture of complex diseases by genome-wide association studies (GWAS), genome-wide expression studies (GWES), and epigenome-wide association studies (EWAS), the overall contribution of the new identified genetic variants is small and a large fraction of genetic variants is still hidden. Understanding the etiology and causal chain of mechanism underlying complex diseases remains elusive. It is time to bring big data, machine learning and causal revolution to developing a new generation of genetic analysis for shifting the current paradigm of genetic analysis from shallow association analysis to deep causal inference and from genetic analysis alone to integrated omics and imaging data analysis for unraveling the mechanism of complex diseases.
FEATURES
- Provides a natural extension and companion volume to Big Data in Omic and Imaging: Association Analysis, but can be read independently.
- Introduce causal inference theory to genomic, epigenomic and imaging data analysis
- Develop novel statistics for genome-wide causation studies and epigenome-wide causation studies.
- Bridge the gap between the traditional association analysis and modern causation analysis
- Use combinatorial optimization methods and various causal models as a general framework for inferring multilevel omic and image causal networks
- Present statistical methods and computational algorithms for searching causal paths from genetic variant to disease
- Develop causal machine learning methods integrating causal inference and machine learning
- Develop statistics for testing significant difference in directed edge, path, and graphs, and for assessing causal relationships between two networks
The book is designed for graduate students and researchers in genomics, epigenomics, medical image, bioinformatics, and data science. Topics covered are: mathematical formulation of causal inference, information geometry for causal inference, topology group and Haar measure, additive noise models, distance correlation, multivariate causal inference and causal networks, dynamic causal networks, multivariate and functional structural equation models, mixed structural equation models, causal inference with confounders, integer programming, deep learning and differential equations for wearable computing, genetic analysis of function-valued traits, RNA-seq data analysis, causal networks for genetic methylation analysis, gene expression and methylation deconvolution, cell –specific causal networks, deep learning for image segmentation and image analysis, imaging and genomic data analysis, integrated multilevel causal genomic, epigenomic and imaging data analysis.
Table of Contents
1. Genotype-Phenotype Network Analysis
Undirected Graphs for Genotype Network
Gaussian Graphic Model
Alternating Direction Method of Multipliers for Estimation of Gaussian Graphical Model
Coordinate Descent Algorithm and Graphical Lasso
Multiple Graphical Models
Directed Graphs and Structural Equation Models for Networks
Directed Acyclic Graphs
Linear Structural Equation Models
Estimation Methods
Sparse Linear Structural Equations
Penalized Maximum Likelihood Estimation
Penalized Two Stage Least Square Estimation
Penalized Three Stage Least Square Estimation
Functional Structural Equation Models for Genotype-Phenotype Networks
Functional Structural Equation Models
Group Lasso and ADMM for Parameter Estimation in the Functional Structural Equation Models
Causal Calculus
Effect Decomposition and Estimation
Graphical Tools for Causal Inference in Linear SEMs
Identification and Single-door Criterion
Instrument Variables
Total Effects and Backdoor Criterion
Counterfactuals and Linear SEMs
Simulations and Real Data Analysis
Simulations for Model Evaluation
Application to Real Data Examples
Appendix 1A
Appendix 1B
Exercises
Figure Legend
2 Causal analysis and network biology
Bayesian Networks as a General Framework for Causal Inference
Parameter Estimation and Bayesian Dirichlet Equivalent Uniform Score for Discrete Bayesian Networks
Structural Equations and Score Metrics for Continuous Causal Networks
Multivariate SEMs for Generating Node Core Metrics
Mixed SEMs for Pedigree-based Causal Inference
Bayesian Networks with Discrete and Continuous Variable
Two-class Network Penalized Logistic Regression for Learning Hybrid Bayesian Networks
Multiple Network Penalized Functional Logistic Regression Models for NGS Data
Multi-class Network Penalized Logistic Regression for Learning Hybrid Bayesian Networks
Other Statistical Models for Quantifying Node Score Function
Integer Programming for Causal Structure Leaning
Introduction
Integer Linear Programming Formulation of DAG Learning
Cutting Plane for Integer Linear Programming
Branch and Cut Algorithm for Integer Linear Programming
Sink Finding Primal Heuristic Algorithm
Simulations and Real Data Analysis
Simulations
Real Data Analysis
Figure Legend
Software Package
Appendix 2A Introduction to Smoothing Splines
Smoothing Spline Regression for a Single Variable
Smoothing Spline Regression for Multiple Variables
Appendix 2B Penalized Likelihood Function for Jointly Observational and Interventional Data
Exercises
Figure Legend
3. Wearable Computing and Genetic Analysis of Function-valued Traits
Classification of Wearable Biosensor Data
Introduction
Functional Data Analysis for Classification of Time Course Wearable Biosensor Data
Differential Equations for Extracting Features of the Dynamic Process and for Classification of Time Course Data
Deep Learning for Physiological Time Series Data Analysis
Association Studies of Function-Valued Traits
Introduction
Functional Linear Models with both Functional Response and Predictors for Association Analysis of Function-valued Traits
Test Statistics
Null Distribution of Test Statistics
Power
Real Data Analysis
Association Analysis of Multiple Function-valued Traits
Gene-gene Interaction Analysis of Function-Valued Traits
Introduction
Functional Regression Models
Estimation of Interaction Effect Function
Test Statistics
Simulations
Real Data Analysis
Figure Legend
Appendix 3.A Gradient Methods for Parameter Estimation in the Convolutional Neural
Networks
Multilayer Feedforward Pass
Backpropagation Pass
Convolutional Layer
Exercises
4. RNA-seq Data Analysis
Normalization Methods on RNA-seq Data Analysis
Gene Expression
RNA Sequencing Expression Profiling
Methods for Normalization
Differential Expression Analysis for RNA-Seq Data
Distribution-based Approach to Differential Expression Analysis
Functional Expansion Approach to Differential Expression Analysis of RNA-Seq Data
Differential Analysis of Allele Specific Expressions with RNA-Seq Data
eQTL and eQTL Epistasis Analysis with RNA-Seq Data
Matrix Factorization
Quadratically Regularized Matrix Factorization and Canonical Correlation Analysis
QRFCCA for eQTL and eQTL Epistasis Analysis of RNA-Seq Data
Real Data Analysis
Gene Co-expression Network and Gene Regulatory Networks
Co-expression Network Construction with RNA-Seq Data by CCA and FCCA
Graphical Gaussian Models
Real Data Applications
Directed Graph and Gene Regulatory Networks
Hierarchical Bayesian Networks for Whole Genome Regulatory Networks
Linear Regulatory Networks
Nonlinear Regulatory Networks
Dynamic Bayesian Network and Longitudinal Expression Data Analysis
Single Cell RNA-Seq Data Analysis, Gene Expression Deconvolution and Genetic Screening
Cell Type Identification
Gene Expression Deconvolution and Cell Type-Specific Expression
Figure Legend
Software Package
Appendix 4.1A Variational Bayesian Theory for Parameter Estimation and RNA-Seq
Normalization
Variational Methods for expectation-maximization (EM) algorithm
Variational Methods for Bayesian Learning
Appendix 4.2A Log-linear Model for Differential Expression Analysis of the RNA-Seq Data with Negative Binomial Distribution
Appendix 4.5A Derivation of ADMM Algorithm
Appendix 4.5B Low Rank Representation Induced Sparse Structural Equation Models
Appendix 4.6A Maximum Likelihood (ML) Estimation of Parameters for Dynamic Structural Equation Models
Appendix 4.6B Generalized Least Squares Estimator of The Parameters in Dynamic Structural Equation Models
Appendix 4.6C Proximal Algorithm for L1-Penalized Maximum Likelihood Estimation of Dynamic Structural Equation Model
Appendix 4.6D Proximal Algorithm for L1- Penalized Generalized Least Square Estimation of Parameters in the Dynamic Structural Equation Models
Appendix 4.7A Multikernel Learning and Spectral Clustering for Cell Type Identification
Exercises
5 Methylation Data Analysis
DNA Methylation Analysis
Epigenome-wide Association Studies (EWAS)
Single-Locus Test
Set-based Methods
Epigenome-wide Causal Studies
Introduction
Additive Functional Model for EWCS
Genome-wide DNA Methylation Quantitative Trait Locus (mQTL) Analysis
Causal Networks for Genetic-Methylation Analysis
Structural Equation Models with Scalar Endogenous Variables and Functional Exogenous Variables
Functional Structural Equation Models with Functional Endogenous Variables and Scalar Exogenous Variables (FSEMS)
Functional Structural Equation Models with both Functional Endogenous Variables an Exogenous Variables (FSEMF)
Figure Legend
Software Package
Appendix 5A Biased and Unbiased Estimators of the HSIC
Appendix 5B Asymptotic Null Distribution of Block-Based HSIC
Exercises
6 Imaging and Genomics
Introduction
Image Segmentation
Unsupervised Learning Methods for Image Segmentation
Supervised Deep Learning Methods for Image Segmentation
Two or Three dimensional Functional Principal Component Analysis for Image Data Reduction 645
Formulation
Integral Equation and Eigenfunctions
Association Analysis of Imaging-Genomic Data
Multivariate Functional Regression Models for Imaging-Genomic Data Analysis
Multivariate Functional Regression Models for Longitudinal Imaging-Genetics Analysis
Quadratically Regularized Functional Canonical Correlation Analysis for Gene-Gene Interaction Detection in Imaging-Genetic Studies
Causal Analysis of Imaging-Genomic Data
Sparse SEMs for Joint Causal Analysis of Structural Imaging and Genomic Data
Sparse Functional Structural Equation Models for phenotype and genotype networks.
Conditional Gaussian Graphical Models (CGGMs) for Structural Imaging and Genomic Data Analysis.
Time Series SEMs for Integrated Causal Analysis of fMRI and Genomic Data Models
Reduced Form Equations
Single Equation and Generalized Least Square Estimator
Sparse SEMs and Alternating Direction Method of Multipliers
Causal machine learning
Figure Legend
Software Package
Appendix 6A Factor Graphs and Mean Field Methods for Prediction of Marginal Distribution
Exercises
7. From Association Analysis to Integrated Causal Inference
Genome-wide Causal Studies
Mathematical Formulation of Causal Analysis
Basic Causal Assumptions
Linear Additive SEMs with non-Gaussian Noise
Information Geometry Approach
Causal Inference on Discrete Data
Multivariate Causal Inference and Causal Networks
Markov Condition, Markov Equivalence, Faithfulness and Minimality
Multilevel Causal Networks for Integrative Omics and Imaging Data Analysis
Causal Inference with Confounders
Causal Sufficiency
Instrumental Variables
Figure Legend
Software Package
Appendix 7A Approximation of log-likelihood Ratio for the LiNGAM
Appendix 7B Orthogonality Conditions and Covariance
Appendix 7C Equivalent Formulations Orthogonality Conditions
Appendix 7D M-L Distance in Backward Direction
Appendix 7E Multiplicativity of Traces
Appendix 7F Anisotropy and K-L Distance
Appendix 7G Trace Method for Noise Linear Model
Appendix 7H Characterization of Association
Appendix 7I Algorithm for Sparse Trace Method
Exercises
Author(s)
Biography
Momiao Xiong is a professor of Biostatistics at the University of Texas Health Science Center in Houston where he has worked since 1997. He received his PhD in 1993 from the University of Georgia.
Reviews
"I would like to recommend a new option in the library market, Big Data in Omics and Imaging: Integrated Analysis and Causal Inference, written by Momiao Xiong, a Professor of Biostatistics at the University of Texas Health Science Center in Houston. It is an extensive and comprehensive textbook on big data inbiomedical sciences. Indeed, its contents is very valuable, because it concerns the analysis of large-scale datasets, which now regularly occur in computational biology and medicine, in particular in ‘omics’ problems... The book introduces in detail the currently developed statistical methods and software for big genomic and epi-genomic, wearable biosensors, computing, and image data analysis. It covers important topics in this area, such as: genotype-phenotype network analysis, causal analysis and network biology, wearable computing and genetic analysis of function-valued traits, RNA-seq data analysis, methylation data analysis, imaging, and genomics... It was really interesting and fascinating to go through the pages of the book. It would hold a very valuable position on the home shelf-book or university library; I warmly recommend the book."
- Malgorzata Cwiklinska-Jurkowska, ISCB, December 2019"In his book, Professor Xiong introduces, discusses, and implements a rich variety of statistical tools that can be used to study large-scale features obtained from the human brain and genome, map neural and genetic signatures to behavioral and disease outcomes, and make causal enquiries into their relationships. The scope of the book is comprehensive, the concepts deep, and technicalities oftentimes mathematically heavy...the book discusses statistical concepts and devices that readers may find useful in studying general problems in human neuroscience and human genetics."
- Oliver Y. Chén, Journal of the American Statistical Association, March 2020