Data Mining for Bioinformatics

By Sumeet Dua, Pradeep Chowriappa

© 2012 – CRC Press

348 pages | 92 B/W Illus.

Purchasing Options:
Hardback: 9780849328015
pub: 2012-11-05
US Dollars$97.95

e–Inspection Copy

About the Book

Covering theory, algorithms, and methodologies, as well as data mining technologies, Data Mining for Bioinformatics provides a comprehensive discussion of data-intensive computations used in data mining with applications in bioinformatics. It supplies a broad, yet in-depth, overview of the application domains of data mining for bioinformatics to help readers from both biology and computer science backgrounds gain an enhanced understanding of this cross-disciplinary field.

The book offers authoritative coverage of data mining techniques, technologies, and frameworks used for storing, analyzing, and extracting knowledge from large databases in the bioinformatics domains, including genomics and proteomics. It begins by describing the evolution of bioinformatics and highlighting the challenges that can be addressed using data mining techniques. Introducing the various data mining techniques that can be employed in biological databases, the text is organized into four sections:

  1. Supplies a complete overview of the evolution of the field and its intersection with computational learning
  2. Describes the role of data mining in analyzing large biological databases—explaining the breath of the various feature selection and feature extraction techniques that data mining has to offer
  3. Focuses on concepts of unsupervised learning using clustering techniques and its application to large biological data
  4. Covers supervised learning using classification techniques most commonly used in bioinformatics—addressing the need for validation and benchmarking of inferences derived using either clustering or classification

The book describes the various biological databases prominently referred to in bioinformatics and includes a detailed list of the applications of advanced clustering algorithms used in bioinformatics. Highlighting the challenges encountered during the application of classification on biological databases, it considers systems of both single and ensemble classifiers and shares effort-saving tips for model selection and performance estimation strategies.

Table of Contents

Introduction to Bioinformatics


Transcription and Translation

The Central Dogma of Molecular Biology

The Human Genome Project

Beyond the Human Genome Project

Sequencing Technology

Dideoxy Sequencing

Cyclic Array Sequencing

Sequencing by Hybridization


Mass Spectrometry

Nanopore Sequencing

Next-Generation Sequencing

Challenges of Handling NGS Data

Sequence Variation Studies

Kinds of Genomic Variations

SNP Characterization

Functional Genomics

Splicing and Alternative Splicing

Microarray-Based Functional Genomics

Comparative Genomics

Functional Annotation

Function Prediction Aspects



Biological Databases and Integration

Introduction: Scientific Work Flows and Knowledge Discovery

Biological Data Storage and Analysis

Challenges of Biological Data

Classification of Bioscience Databases

Primary versus Secondary Databases

Deep versus Broad Databases

Point Solution versus General Solution Databases

Gene Expression Omnibus (GEO) Database

The Protein Data Bank (PDB)

The Curse of Dimensionality

Data Cleaning

Problems of Data Cleaning

Challenges of Handling Evolving Databases

Problems Associated with Single-Source Techniques

Problems Associated with Multisource Integration

Data Argumentation: Cleaning at the Schema Level

Knowledge-Based Framework: Cleaning at the Instance Level

Data Integration


Sequence Retrieval System (SRS)

IBM’s DiscoveryLink

Wrappers: Customizable Database Software

Data Warehousing: Data Management with Query Optimization

Data Integration in the PDB



Knowledge Discovery in Databases


Analysis of Data Using Large Databases

Distance Metrics

Data Cleaning and Data Preprocessing

Challenges in Data Cleaning

Models of Data Cleaning

Proximity-Based Techniques

Parametric Methods

Nonparametric Methods

Semiparametric Methods

Neural Networks

Machine Learning

Hybrid Systems

Data Integration

Data Integration and Data Linkage

Schema Integration Issues

Field Matching Techniques

Character-Based Similarity Metrics

Token-Based Similarity Metrics

Data Linkage/Matching Techniques

Data Warehousing

Online Analytical Processing

Differences between OLAP and OLTP

OLAP Tasks

Life Cycle of a Data Warehouse



Section II

Feature Selection and Extraction Strategies in Data Mining



Data Transformation

Data Smoothing by Discretization

Discretization of Continuous Attributes

Normalization and Standardization

Min-Max Normalization

z-Score Standardization

Normalization by Decimal Scaling

Features and Relevance

Strongly Relevant Features

Weakly Relevant to the Dataset/Distribution

Pearson Correlation Coefficient

Information Theoretic Ranking Criteria

Overview of Feature Selection

Filter Approaches

Wrapper Approaches

Filter Approaches for Feature Selection

FOCUS Algorithm

Relief Method—Weight-Based Approach.

Feature Subset Selection Using Forward Selection

Gram-Schmidt Forward Feature Selection

Other Nested Subset Selection Methods

Feature Construction and Extraction

Matrix Factorization

LU Decomposition

QR Factorization to Extract Orthogonal Features

Eigenvalues and Eigenvectors of a Matrix

Other Properties of a Matrix

A Square Matrix and Matrix Diagonalization

Symmetric Real Matrix: Spectral Theorem

Singular Vector Decomposition (SVD)

Principal Component Analysis (PCA)

Jordan Decomposition of a Matrix

Principal Components

Partial Least-Squares-Based Dimension Reduction (PLS)

Factor Analysis (FA)

Independent Component Analysis (ICA)

Multidimensional Scaling (MDS)



Feature Interpretation for Biological Learning


Normalization Techniques for Gene Expression Analysis

Normalization and Standardization Techniques

Expression Ratios

Intensity-Based Normalization

Total Intensity Normalization

Intensity-Based Filtering of Array Elements

Identification of Differentially Expressed Genes

Selection Bias of Gene Expression Data

Data Preprocessing of Mass Spectrometry Data

Data Transformation Techniques

Baseline Subtraction (Smoothing)



Peak Detection

Peak Alignment

Application of Dimensionality Reduction

Techniques for MS Data Analysis

Feature Selection Techniques

Univariate Methods

Multivariate Methods

Data Preprocessing for Genomic Sequence Data

Feature Selection for Sequence Analysis

Ontologies in Bioinformatics

The Role of Ontologies in Bioinformatics

Description Logics

Gene Ontology (GO)

Open Biomedical Ontologies (OBO)



Section III

Clustering Techniques in Bioinformatics


Clustering in Bioinformatics

Clustering Techniques

Distance-Based Clustering and Measures

Mahalanobis Distance

Minkowiski Distance

Pearson Correlation

Binary Features

Nominal Features

Mixed Variables

Distance Measure Properties

k-Means Algorithm

k-Modes Algorithm

Genetic Distance Measure (GDM)

Applications of Distance-Based Clustering in Bioinformatics

New Distance Metric in Gene Expressions for Coexpressed Genes

Gene Expression Clustering Using Mutual Information Distance Measure

Gene Expression Data Clustering Using a Local Shape-Based Clustering

Exact Similarity Computation

Approximate Similarity Computation

Implementation of k-Means in WEKA

Hierarchical Clustering

Agglomerative Hierarchical Clustering

Cluster Splitting and Merging

Calculate Distance between Clusters

Applications of Hierarchical Clustering Techniques in Bioinformatics

Hierarchical Clustering Based on Partially Overlapping and Irregular Data

Cluster Stability Estimation for Microarray Data

Comparing Gene Expression Sequences Using Pairwise Average Linking

Implementation of Hierarchical Clustering

Self-Organizing Maps Clustering

SOM Algorithm

Application of SOM in Bioinformatics

Identifying Distinct Gene Expression Patterns Using SOM

SOTA: Combining SOM and Hierarchical Clustering for Representation of Genes

Fuzzy Clustering

Fuzzy c-Means (FCM)

Application of Fuzzy Clustering in Bioinformatics

Clustering Genes Using Fuzzy J-Means and VNS Methods

Fuzzy k-Means Clustering on Gene Expression

Comparison of Fuzzy Clustering Algorithms

Implementation of Expectation Maximization Algorithm



Advanced Clustering Techniques

Graph-Based Clustering

Graph-Based Cluster Properties

Cut in a Graph

Intracluster and Intercluster Density

Measures for Identifying Clusters

Identifying Clusters by Computing Values for the Vertices or Vertex Similarity

Distance and Similarity Measure

Adjacency-Based Measures

Connectivity Measures

Computing the Fitness Measure

Density Measure

Cut-Based Measures

Determining a Split in the Graph


Spectral Methods


Graph-Based Algorithms

Chameleon Algorithm

CLICK Algorithm

Application of Graph-Based Clustering in Bioinformatics

Analysis of Gene Expression Data Using Shortest Path (SP)

Construction of Genetic Linkage Maps Using Minimum Spanning Tree of a Graph

Finding Isolated Groups in a Random Graph Process

Implementation in Cytoscape

Seeding Method

Kernel-Based Clustering

Kernel Functions

Gaussian Function

Application of Kernel Clustering in Bioinformatics

Kernel Clustering

Kernel-Based Support Vector Clustering

Analyzing Gene Expression Data Using SOM and Kernel-Based Clustering

Model-Based Clustering for Gene Expression Data

Gaussian Mixtures

Diagonal Model

Model Selection

Relevant Number of Genes

A Resampling-Based Approach for Identifying Stable and Tight Patterns

Overcoming the Local Minimum Problem in k-Means Clustering

Tight Clustering

Tight Clustering of Gene Expression Time Courses

Higher-Order Mining

Clustering for Association Rule Discovery

Clustering of Association Rules

Clustering Clusters



Section IV

Classification Techniques in Bioinformatics


Bias-Variance Trade-Off in Supervised Learning

Linear and Nonlinear Classifiers

Model Complexity and Size of Training Data

Dimensionality of Input Space

Supervised Learning in Bioinformatics

Support Vector Machines (SVMs)


Large Margin of Separation

Soft Margin of Separation

Kernel Functions

Applications of SVM in Bioinformatics

Gene Expression Analysis

Remote Protein Homology Detection

Bayesian Approaches

Bayes’ Theorem

Naive Bayes Classification

Handling of Prior Probabilities

Handling of Posterior Probability

Bayesian Networks


Capturing Data Distributions Using Bayesian Networks

Equivalence Classes of Bayesian Networks

Learning Bayesian Networks

Bayesian Scoring Metric

Application of Bayesian Classifiers in Bioinformatics

Binary Classification

Multiclass Classification

Computational Challenges for Gene Expression Analysis

Decision Trees

Tree Pruning

Ensemble Approaches


Unweighed Voting Methods

Confidence Voting Methods

Ranked Voting Methods


Seeking Prospective Classifiers to Be Part of the Ensemble

Choosing an Optimal Set of Classifiers

Assigning Weight to the Chosen Classifier

Random Forest

Application of Ensemble Approaches in Bioinformatics

Computational Challenges of Supervised Learning



Validation and Benchmarking

Introduction: Performance Evaluation Techniques

Classifier Validation

Model Selection

Challenges Model Selection

Performance Estimation Strategies


Three-Way Split

k-Fold Cross-Validation

Random Subsampling

Performance Measures

Sensitivity and Specificity

Precision, Recall, and f-Measure

ROC Curve

Cluster Validation Techniques

The Need for Cluster Validation

External Measures

Internal Measures

Performance Evaluation Using Validity Indices

Silhouette Index (SI)

Davies-Bouldin and Dunn’s Index

Calinski Harabasz (CH) Index

Rand Index



About the Authors

Sumeet Dua is an Upchurch endowed professor of computer science and interim director of computer science, electrical engineering, and electrical engineering technology in the College of Engineering and Science at Louisiana Tech University. He obtained his PhD in computer science from Louisiana State University in 2002. He has coauthored/edited 3 books, has published over 50 research papers in leading journals and conferences, and has advised over 22 graduate thesis and dissertations in the areas of data mining, knowledge discovery, and computational learning in high-dimensional datasets. NIH, NSF, AFRL, AFOSR, NASA, and LA-BOR have supported his research. He frequently serves as a panelist for the NSF and NIH (over 17 panels) and has presented over 25 keynotes, invited talks, and workshops at international conferences and educational institutions. He has also served as the overall program chair for three international conferences and as a chair for multiple conference tracks in the areas of data mining applications and information intelligence. He is a senior member of the IEEE and the ACM. His research interests include information discovery in heterogeneous and distributed datasets, semisupervised learning, content-based feature extraction and modeling, and pattern tracking.

Pradeep Chowriappa is a research assistant professor in the College of Engineering and Science at Louisiana Tech University. His research focuses on the application of data mining algorithms and frameworks on biological and clinical data. Before obtaining his PhD in computer analysis and modeling from Louisiana Tech University in 2008, he pursued a yearlong internship at the Indian Space Research Organization (ISRO), Bangalore, India. He received his masters in computer applications from the University of Madras, Chennai, India, in 2003 and his bachelor’s in science and engineering from Loyola Academy, Secunderabad, India, in 2000. His research interests include design and analysis of algorithms for knowledge discovery and modeling in high-dimensional data domains in computational biology, distributed data mining, and domain integration.

Subject Categories

BISAC Subject Codes/Headings:
COMPUTERS / Database Management / Data Mining
MATHEMATICS / Probability & Statistics / General
SCIENCE / Biotechnology