Accelerating Discovery: Mining Unstructured Information for Hypothesis Generation, 1st Edition (Hardback) book cover

Accelerating Discovery

Mining Unstructured Information for Hypothesis Generation, 1st Edition

By Scott Spangler

Chapman and Hall/CRC

270 pages | 122 B/W Illus.

Purchasing Options:$ = USD
Hardback: 9781482239133
pub: 2015-10-09
SAVE ~$22.00
$110.00
$88.00
x
eBook (VitalSource) : 9780429157318
pub: 2015-09-18
from $28.98


FREE Standard Shipping!

Description

Unstructured Mining Approaches to Solve Complex Scientific Problems

As the volume of scientific data and literature increases exponentially, scientists need more powerful tools and methods to process and synthesize information and to formulate new hypotheses that are most likely to be both true and important. Accelerating Discovery: Mining Unstructured Information for Hypothesis Generation describes a novel approach to scientific research that uses unstructured data analysis as a generative tool for new hypotheses.

The author develops a systematic process for leveraging heterogeneous structured and unstructured data sources, data mining, and computational architectures to make the discovery process faster and more effective. This process accelerates human creativity by allowing scientists and inventors to more readily analyze and comprehend the space of possibilities, compare alternatives, and discover entirely new approaches.

Encompassing systematic and practical perspectives, the book provides the necessary motivation and strategies as well as a heterogeneous set of comprehensive, illustrative examples. It reveals the importance of heterogeneous data analytics in aiding scientific discoveries and furthers data science as a discipline.

Table of Contents

Introduction

Why Accelerate Discovery?

Scott Spangler and Ying Chen

THE PROBLEM OF SYNTHESIS

THE PROBLEM OF FORMULATION

WHAT WOULD DARWIN DO?

THE POTENTIAL FOR ACCELERATED DISCOVERY: USING COMPUTERS TO MAP THE KNOWLEDGE SPACE

WHY ACCELERATE DISCOVERY: THE BUSINESS PERSPECTIVE

COMPUTATIONAL TOOLS THAT ENABLE ACCELERATED DISCOVERY

ACCELERATED DISCOVERY FROM A SYSTEM PERSPECTIVE

ACCELERATED DISCOVERY FROM A DATA PERSPECTIVE

ACCELERATED DISCOVERY IN THE ORGANIZATION

CHALLENGE (AND OPPORTUNITY) OF ACCELERATED DISCOVERY

Form and Function

THE PROCESS OF ACCELERATED DISCOVERY

CONCLUSION

Exploring Content to Find Entities

SEARCHING FOR RELEVANT CONTENT

HOW MUCH DATA IS ENOUGH? WHAT IS TOO MUCH?

HOW COMPUTERS READ DOCUMENTS

EXTRACTING FEATURES

FEATURE SPACES: DOCUMENTS AS VECTORS

CLUSTERING

DOMAIN CONCEPT REFINEMENT

MODELING APPROACHES

DICTIONARIES AND NORMALIZATION

COHESION AND DISTINCTNESS

SINGLE AND MULTIMEMBERSHIP TAXONOMIES

SUBCLASSING AREAS OF INTEREST

GENERATING NEW QUERIES TO FIND ADDITIONAL RELEVANT CONTENT

VALIDATION

SUMMARY

Organization

DOMAIN-SPECIFIC ONTOLOGIES AND DICTIONARIES

SIMILARITY TREES

USING SIMILARITY TREES TO INTERACT WITH DOMAIN

EXPERTS

SCATTER-PLOT VISUALIZATIONS

USING SCATTER PLOTS TO FIND OVERLAPS BETWEEN NEARBY ENTITIES OF DIFFERENT TYPES

DISCOVERY THROUGH VISUALIZATION OF TYPE SPACE

Relationships

WHAT DO RELATIONSHIPS LOOK LIKE?

HOW CAN WE DETECT RELATIONSHIPS?

REGULAR EXPRESSION PATTERNS FOR EXTRACTING

RELATIONSHIPS

NATURAL LANGUAGE PARSING

COMPLEX RELATIONSHIPS

EXAMPLE: P53 PHOSPHORYLATION EVENTS

PUTTING IT ALL TOGETHER

EXAMPLE: DRUG/TARGET/DISEASE RELATIONSHIP

NETWORKS

CONCLUSION

Inference

CO-OCCURRENCE TABLES

CO-OCCURRENCE NETWORKS

RELATIONSHIP SUMMARIZATION GRAPHS

HOMOGENEOUS RELATIONSHIP NETWORKS

HETEROGENEOUS RELATIONSHIP NETWORKS

NETWORK-BASED REASONING APPROACHES

GRAPH DIFFUSION

MATRIX FACTORIZATION

CONCLUSION

Taxonomies

TAXONOMY GENERATION METHODS

SNIPPETS

TEXT CLUSTERING

TIME-BASED TAXONOMIES

KEYWORD TAXONOMIES

NUMERICAL VALUE TAXONOMIES

EMPLOYING TAXONOMIES

Orthogonal Comparison

AFFINITY

COTABLE DIMENSIONS

COTABLE LAYOUT AND SORTING

FEATURE-BASED COTABLES

COTABLE APPLICATIONS

EXAMPLE: MICROBES AND THEIR PROPERTIES

ORTHOGONAL FILTERING

CONCLUSION

Visualizing the Data Plane

ENTITY SIMILARITY NETWORKS

USING COLOR TO SPOT POTENTIAL NEW HYPOTHESES

VISUALIZATION OF CENTROIDS

EXAMPLE: THREE MICROBES

CONCLUSION

Networks

PROTEIN NETWORKS

MULTIPLE SCLEROSIS AND IL7R

EXAMPLE: NEW DRUGS FOR OBESITY

CONCLUSION

Examples and Problems

PROBLEM CATALOGUE

EXAMPLE CATALOGUE

Problem: Discovery of Novel Properties of Known Entities

ANTIBIOTICS AND ANTI-INFLAMMATORIES

SOS PATHWAY FOR ESCHERICHIA COLI

CONCLUSIONS

Problem: Finding New Treatments for Orphan Diseases from Existing Drugs

IC50:IC50

Example: Target Selection Based on Protein Network Analysis

TYPE 2 DIABETES PROTEIN ANALYSIS

Example: Gene Expression Analysis for Alternative Indications

NCBI GEO DATA

CONCLUSION

Example: Side Effects

Example: Protein Viscosity Analysis Using Medline Abstracts

DISCOVERY OF ONTOLOGIES

USING ORTHOGONAL FILTERING TO DISCOVER IMPORTANT RELATIONSHIPS

Example: Finding Microbes to Clean Up Oil Spills

ENTITIES

USING COTABLES TO FIND THE RIGHT COMBINATION OF FEATURES

DISCOVERING NEW SPECIES

ORGANISM RANKING STRATEGY

CHARACTERIZING ORGANISMS

CONCLUSION

Example: Drug Repurposing

COMPOUND 1: A PDE5 INHIBITOR

PPARα/γ AGONIST

Example: Adverse Events

FENOFIBRATE

PROCESS

CONCLUSION

Example: Discovering New P53 Kinases

AN ACCELERATED DISCOVERY APPROACH BASED ON ENTITY SIMILARITY

RETROSPECTIVE STUDY

EXPERIMENTAL VALIDATION

CONCLUSION

Conclusion and Future Work

ARCHITECTURE

FUTURE WORK

ASSIGNING CONFIDENCE AND PROBABILITIES TO ENTITIES, RELATIONSHIPS, AND INFERENCES

DEALING WITH CONTRADICTORY EVIDENCE

UNDERSTANDING INTENTIONALITY

ASSIGNING VALUE TO HYPOTHESES

TOOLS AND TECHNIQUES FOR AUTOMATING THE DISCOVERY PROCESS

CROWD SOURCING DOMAIN ONTOLOGY CURATION

FINAL WORDS

References appear at the end of most chapters.

About the Author

Scott Spangler is a principal data scientist, distinguished engineer, and master inventor in the Watson Innovations Group at the IBM Almaden Research Center. He has been involved with knowledge base and data mining research for the past 25 years. His recent work has applied Watson technology to help accelerate cancer research. He holds 45 patents and is the author of over 30 publications. He received a BS in mathematics from MIT and an MS in computer science from the University of Texas.

About the Series

Chapman & Hall/CRC Data Mining and Knowledge Discovery Series

Learn more…

Subject Categories

BISAC Subject Codes/Headings:
BUS061000
BUSINESS & ECONOMICS / Statistics
COM012040
COMPUTERS / Programming / Games
COM021030
COMPUTERS / Database Management / Data Mining
COM037000
COMPUTERS / Machine Theory