1st Edition

# Big Data in Omics and Imaging Integrated Analysis and Causal Inference

**Also available as eBook on:**

**Big Data in Omics and Imaging: Integrated Analysis and Causal Inference** addresses the recent development of integrated genomic, epigenomic and imaging data analysis and causal inference in big data era. Despite significant progress in dissecting the genetic architecture of complex diseases by genome-wide association studies (GWAS), genome-wide expression studies (GWES), and epigenome-wide association studies (EWAS), the overall contribution of the new identified genetic variants is small and a large fraction of genetic variants is still hidden. Understanding the etiology and causal chain of mechanism underlying complex diseases remains elusive. It is time to bring big data, machine learning and causal revolution to developing a new generation of genetic analysis for shifting the current paradigm of genetic analysis from shallow association analysis to deep causal inference and from genetic analysis alone to integrated omics and imaging data analysis for unraveling the mechanism of complex diseases.

FEATURES

- Provides a natural extension and companion volume to
*Big Data in Omic and Imaging: Association Analysis,*but can be read independently. - Introduce causal inference theory to genomic, epigenomic and imaging data analysis
- Develop novel statistics for genome-wide causation studies and epigenome-wide causation studies.
- Bridge the gap between the traditional association analysis and modern causation analysis
- Use combinatorial optimization methods and various causal models as a general framework for inferring multilevel omic and image causal networks
- Present statistical methods and computational algorithms for searching causal paths from genetic variant to disease
- Develop causal machine learning methods integrating causal inference and machine learning
- Develop statistics for testing significant difference in directed edge, path, and graphs, and for assessing causal relationships between two networks

The book is designed for graduate students and researchers in genomics, epigenomics, medical image, bioinformatics, and data science. Topics covered are: mathematical formulation of causal inference, information geometry for causal inference, topology group and Haar measure, additive noise models, distance correlation, multivariate causal inference and causal networks, dynamic causal networks, multivariate and functional structural equation models, mixed structural equation models, causal inference with confounders, integer programming, deep learning and differential equations for wearable computing, genetic analysis of function-valued traits, RNA-seq data analysis, causal networks for genetic methylation analysis, gene expression and methylation deconvolution, cell –specific causal networks, deep learning for image segmentation and image analysis, imaging and genomic data analysis, integrated multilevel causal genomic, epigenomic and imaging data analysis.

**1. Genotype-Phenotype Network Analysis**

**Undirected Graphs for Genotype Network**

Gaussian Graphic Model

Alternating Direction Method of Multipliers for Estimation of Gaussian Graphical Model

Coordinate Descent Algorithm and Graphical Lasso

Multiple Graphical Models

Directed Graphs and Structural Equation Models for Networks

Directed Acyclic Graphs

Linear Structural Equation Models

Estimation Methods

**Sparse Linear Structural Equations**

Penalized Maximum Likelihood Estimation

Penalized Two Stage Least Square Estimation

Penalized Three Stage Least Square Estimation

Functional Structural Equation Models for Genotype-Phenotype Networks

Functional Structural Equation Models

Group Lasso and ADMM for Parameter Estimation in the Functional Structural Equation Models

Causal Calculus

Effect Decomposition and Estimation

Graphical Tools for Causal Inference in Linear SEMs

Identification and Single-door Criterion

Instrument Variables

Total Effects and Backdoor Criterion

Counterfactuals and Linear SEMs

Simulations and Real Data Analysis

Simulations for Model Evaluation

Application to Real Data Examples

**Appendix 1A**

Appendix 1B

Exercises

**Figure Legend**

2 Causal analysis and network biology

Bayesian Networks as a General Framework for Causal Inference

Parameter Estimation and Bayesian Dirichlet Equivalent Uniform Score for Discrete Bayesian Networks

Structural Equations and Score Metrics for Continuous Causal Networks

Multivariate SEMs for Generating Node Core Metrics

Mixed SEMs for Pedigree-based Causal Inference

Bayesian Networks with Discrete and Continuous Variable

Two-class Network Penalized Logistic Regression for Learning Hybrid Bayesian Networks

Multiple Network Penalized Functional Logistic Regression Models for NGS Data

Multi-class Network Penalized Logistic Regression for Learning Hybrid Bayesian Networks

Other Statistical Models for Quantifying Node Score Function

Integer Programming for Causal Structure Leaning

Introduction

Integer Linear Programming Formulation of DAG Learning

Cutting Plane for Integer Linear Programming

Branch and Cut Algorithm for Integer Linear Programming

Sink Finding Primal Heuristic Algorithm

Simulations and Real Data Analysis

**Simulations**

Real Data Analysis

Figure Legend

**Software Package**

**Appendix 2A** Introduction to Smoothing Splines

Smoothing Spline Regression for a Single Variable

Smoothing Spline Regression for Multiple Variables

**Appendix 2B** Penalized Likelihood Function for Jointly Observational and Interventional Data

**Exercises**

**Figure Legend**

3. Wearable Computing and Genetic Analysis of Function-valued Traits

Classification of Wearable Biosensor Data

Introduction

Functional Data Analysis for Classification of Time Course Wearable Biosensor Data

Differential Equations for Extracting Features of the Dynamic Process and for Classification of Time Course Data

Deep Learning for Physiological Time Series Data Analysis

**Association Studies of Function-Valued Traits**

Introduction

Functional Linear Models with both Functional Response and Predictors for Association Analysis of Function-valued Traits

Test Statistics

Null Distribution of Test Statistics

Power

Real Data Analysis

Association Analysis of Multiple Function-valued Traits

Gene-gene Interaction Analysis of Function-Valued Traits

Introduction

Functional Regression Models

Estimation of Interaction Effect Function

Test Statistics

Simulations

Real Data Analysis

**Figure Legend**

**Appendix 3.A** Gradient Methods for Parameter Estimation in the Convolutional Neural

Networks

Multilayer Feedforward Pass

Backpropagation Pass

Convolutional Layer

Exercises

4. RNA-seq Data Analysis

Normalization Methods on RNA-seq Data Analysis

Gene Expression

RNA Sequencing Expression Profiling

Methods for Normalization

**Differential Expression Analysis for RNA-Seq Data**

Distribution-based Approach to Differential Expression Analysis

Functional Expansion Approach to Differential Expression Analysis of RNA-Seq Data

Differential Analysis of Allele Specific Expressions with RNA-Seq Data

eQTL and eQTL Epistasis Analysis with RNA-Seq Data

Matrix Factorization

Quadratically Regularized Matrix Factorization and Canonical Correlation Analysis

QRFCCA for eQTL and eQTL Epistasis Analysis of RNA-Seq Data

Real Data Analysis

Gene Co-expression Network and Gene Regulatory Networks

Co-expression Network Construction with RNA-Seq Data by CCA and FCCA

Graphical Gaussian Models

Real Data Applications

Directed Graph and Gene Regulatory Networks

Hierarchical Bayesian Networks for Whole Genome Regulatory Networks

Linear Regulatory Networks

Nonlinear Regulatory Networks

Dynamic Bayesian Network and Longitudinal Expression Data Analysis

Single Cell RNA-Seq Data Analysis, Gene Expression Deconvolution and Genetic Screening

Cell Type Identification

Gene Expression Deconvolution and Cell Type-Specific Expression

**Figure Legend**

**Software Package**

**Appendix 4.1A** Variational Bayesian Theory for Parameter Estimation and RNA-Seq

Normalization

Variational Methods for expectation-maximization (EM) algorithm

Variational Methods for Bayesian Learning

**Appendix 4.2A** Log-linear Model for Differential Expression Analysis of the RNA-Seq Data with Negative Binomial Distribution

**Appendix 4.5A** Derivation of ADMM Algorithm

**Appendix 4.5B** Low Rank Representation Induced Sparse Structural Equation Models

**Appendix 4.6A** Maximum Likelihood (ML) Estimation of Parameters for Dynamic Structural Equation Models

**Appendix 4.6B** Generalized Least Squares Estimator of The Parameters in Dynamic Structural Equation Models

**Appendix 4.6C** Proximal Algorithm for L1-Penalized Maximum Likelihood Estimation of Dynamic Structural Equation Model

**Appendix 4.6D** Proximal Algorithm for L1- Penalized Generalized Least Square Estimation of Parameters in the Dynamic Structural Equation Models

**Appendix 4.7A** Multikernel Learning and Spectral Clustering for Cell Type Identification

**Exercises**

5 Methylation Data Analysis

**DNA Methylation Analysis**

**Epigenome-wide Association Studies (EWAS**)

Single-Locus Test

Set-based Methods

Epigenome-wide Causal Studies

Introduction

Additive Functional Model for EWCS

Genome-wide DNA Methylation Quantitative Trait Locus (mQTL) Analysis

**Causal Networks for Genetic-Methylation Analysis**

Structural Equation Models with Scalar Endogenous Variables and Functional Exogenous Variables

Functional Structural Equation Models with Functional Endogenous Variables and Scalar Exogenous Variables (FSEMS)

Functional Structural Equation Models with both Functional Endogenous Variables an Exogenous Variables (FSEMF)

**Figure Legend**

**Software Package**

**Appendix 5A** Biased and Unbiased Estimators of the HSIC

**Appendix 5B** Asymptotic Null Distribution of Block-Based HSIC

**Exercises**

6 Imaging and Genomics

**Introduction**

**Image Segmentation**

Unsupervised Learning Methods for Image Segmentation

Supervised Deep Learning Methods for Image Segmentation

Two or Three dimensional Functional Principal Component Analysis for Image Data Reduction 645

Formulation

Integral Equation and Eigenfunctions

**Association Analysis of Imaging-Genomic Data**

Multivariate Functional Regression Models for Imaging-Genomic Data Analysis

Multivariate Functional Regression Models for Longitudinal Imaging-Genetics Analysis

Quadratically Regularized Functional Canonical Correlation Analysis for Gene-Gene Interaction Detection in Imaging-Genetic Studies

Causal Analysis of Imaging-Genomic Data

Sparse SEMs for Joint Causal Analysis of Structural Imaging and Genomic Data

Sparse Functional Structural Equation Models for phenotype and genotype networks.

Conditional Gaussian Graphical Models (CGGMs) for Structural Imaging and Genomic Data Analysis.

**Time Series SEMs for Integrated Causal Analysis of fMRI and Genomic** **Data Models **

Reduced Form Equations

Single Equation and Generalized Least Square Estimator

Sparse SEMs and Alternating Direction Method of Multipliers

Causal machine learning

**Figure Legend**

Software Package

**Appendix 6A** Factor Graphs and Mean Field Methods for Prediction of Marginal Distribution

Exercises

7. From Association Analysis to Integrated Causal Inference

Genome-wide Causal Studies

Mathematical Formulation of Causal Analysis

Basic Causal Assumptions

Linear Additive SEMs with non-Gaussian Noise

Information Geometry Approach

Causal Inference on Discrete Data

Multivariate Causal Inference and Causal Networks

Markov Condition, Markov Equivalence, Faithfulness and Minimality

Multilevel Causal Networks for Integrative Omics and Imaging Data Analysis

Causal Inference with Confounders

Causal Sufficiency

Instrumental Variables

Figure Legend

**Software Package**

**Appendix 7A** Approximation of log-likelihood Ratio for the LiNGAM

**Appendix 7B** Orthogonality Conditions and Covariance

**Appendix 7C** Equivalent Formulations Orthogonality Conditions

**Appendix 7D** M-L Distance in Backward Direction

**Appendix 7E** Multiplicativity of Traces

**Appendix 7F** Anisotropy and K-L Distance

**Appendix 7G** Trace Method for Noise Linear Model

**Appendix 7H** Characterization of Association

**Appendix 7I** Algorithm for Sparse Trace Method

**Exercises**

### Biography

**Momiao Xiong** is a professor of Biostatistics at the University of Texas Health Science Center in Houston where he has worked since 1997. He received his PhD in 1993 from the University of Georgia.

"I would like to recommend a new option in the library market, Big Data in Omics and Imaging: Integrated Analysis and Causal Inference, written by Momiao Xiong, a Professor of Biostatistics at the University of Texas Health Science Center in Houston. It is an extensive and comprehensive textbook on big data inbiomedical sciences. Indeed, its contents is very valuable, because it concerns the analysis of large-scale datasets, which now regularly occur in computational biology and medicine, in particular in ‘omics’ problems... The book introduces in detail the currently developed statistical methods and software for big genomic and epi-genomic, wearable biosensors, computing, and image data analysis. It covers important topics in this area, such as: genotype-phenotype network analysis, causal analysis and network biology, wearable computing and genetic analysis of function-valued traits, RNA-seq data analysis, methylation data analysis, imaging, and genomics... It was really interesting and fascinating to go through the pages of the book. It would hold a very valuable position on the home shelf-book or university library; I warmly recommend the book."

-Malgorzata Cwiklinska-Jurkowska, ISCB, December 2019"In his book, Professor Xiong introduces, discusses, and implements a rich variety of statistical tools that can be used to study large-scale features obtained from the human brain and genome, map neural and genetic signatures to behavioral and disease outcomes, and make causal enquiries into their relationships. The scope of the book is comprehensive, the concepts deep, and technicalities oftentimes mathematically heavy...the book discusses statistical concepts and devices that readers may find useful in studying general problems in human neuroscience and human genetics."

- Oliver Y. Chén,Journal of the American Statistical Association, March 2020