Text Mining with Machine Learning: Principles and Techniques, 1st Edition (Hardback) book cover

Text Mining with Machine Learning

Principles and Techniques, 1st Edition

By Jan Zizka, Frantisek Darena, Arnost Svoboda

CRC Press

400 pages | 6 Color Illus. | 40 B/W Illus.

Purchasing Options:$ = USD
Hardback: 9781138601826
pub: 2020-01-15
SAVE ~$39.99
Available for pre-order
$199.95
$159.96
x


FREE Standard Shipping!

Description

This book provides a perspective on the application of machine learning-based methods in knowledge discovery from natural languages texts. By analysing various data sets, conclusions, which are not normally evident, emerge and can be used for various purposes and applications. The book provides explanations of principles of time-proven machine learning algorithms applied in text mining together with step-by-step demonstrations of how to reveal the semantic contents in real-world datasets using the popular R-language with its implemented machine learning algorithms. The book is not only aimed at IT specialists, but is meant for a wider audience that needs to process big sets of text documents and has basic knowledge of the subject, e.g. e-mail service providers, online shoppers, librarians, etc.

The book starts with an introduction to text-based natural language data processing and its goals and problems. It focuses on machine learning, presenting various algorithms with their use and possibilities, and reviews the positives and negatives. Beginning with the initial data pre-processing, a reader can follow the steps provided in the R-language including the subsuming of various available plug-ins into the resulting software tool. A big advantage is that R also contains many libraries implementing machine learning algorithms, so a reader can concentrate on the principal target without the need to implement the details of the algorithms her- or himself. To make sense of the results, the book also provides explanations of the algorithms, which supports the final evaluation and interpretation of the results. The examples are demonstrated using real-world data from commonly accessible Internet sources.

Table of Contents

Preface

Introduction to Text Mining with Machine Learning

Introduction

Relation of Text Mining to Data Mining

The Text Mining Process

Machine Learning for Text Mining

Three Fundamental Learning Directions

Big Data

About This Book

Introduction to R

Installing R

Running R

RStudio

Writing and Executing Commands

Variables and Data Types

Objects in R

Functions

Operators

Vectors

Matrices and Arrays

Lists

Factors

Data Frames

Functions Useful in Machine Learning

Flow Control Structures

Packages

Graphics

Structured text representations

Introduction

The Bag-of-words Model

The Limitations of the Bag-of-Words Model

Document Features

Standardization

Texts in Different Encodings

Language Identification

Tokenization

Sentence Detection

Filtering Stop Words, Common, and Rare Terms

Removing Diacritics

Normalization

Annotation

Calculating the Weights in the Bag-of-Words Model

Common Formats for Storing Structured Data

A Complex Example

Classification

Sample Data

Selected Algorithms

Classifier Quality Measurement

Bayes Classifier

Introduction

Bayes’ Theorem

Optimal Bayes Classifier

Na¨ıve Bayes Classifier

Illustrative Example of Na¨ıve Bayes

Na¨ıve Bayes Classifier in R

Nearest Neighbors

Introduction

Similarity as Distance

Illustrative Example of k-NN

k-NN in R

Decision Trees

Introduction

Entropy Minimization-Based c5 Algorithm

C5 Tree Generator in R

Random Forest

Introduction

Random Forest in R

Adaboost

Introduction

Boosting Principle

Adaboost Principle

Weak Learners

Adaboost in R

Support Vector Machines

Introduction

Support Vector Machines Principles

SVM in R

Deep Learning

Introduction

Artificial Neural Networks

Deep Learning in R

Clustering

Introduction to Clustering

Difficulties of Clustering

Similarity Measures

Types of Clustering Algorithms

Clustering Criterion Functions

Deciding on the Number of Clusters

K-means

K-medoids

Criterion Function Optimization

Agglomerative Hierarchical Clustering

Scatter-Gather Algorithm

Divisive Hierarchical Clustering

Constrained Clustering

Evaluating Clustering Results

Cluster Labeling

A Few Examples

Word Embeddings

Introduction

Determining the Context and Word Similarity

Context Windows

Computing Word Embeddings

Aggregation of Word Vectors

An Example

Feature Selection

Introduction

Feature Selection as State Space Search

Feature Selection Methods

Term Elimination Based on Frequency

Term Strength

Term Contribution

Entropy-based Ranking

Term Variance

An Example

References

Index

About the Authors

Jan Žižka is a consultant in machine learning and data mining. He has worked as a system programmer, developer of advanced software systems, and researcher. For the last 25 years, he has devoted himself to AI and machine learning, especially text mining. He has been a faculty at a number of universities and research institutes. He has authored approximately 100 international publications.

František Darena is an associate professor and the head of the Text Mining and NLP group at the Department of Informatics, Mendel University, Brno. He has published numerous articles in international scientific journals, conference proceedings, and monographs, and is a member of editorial boards of several international journals. His research includes text/data mining, intelligent data processing, and machine learning.

Arnošt Svoboda is an expert programer. His speciality includes programming languages and systems such as R, Assembler, Matlab, PL/1, Cobol, Fortran, Pascal, and others. He started as a system programmer. The last 20 years, Arnošt has worked also as a teacher and researcher at Masaryk University in Brno. His current interest are machine learning and data mining.

Subject Categories

BISAC Subject Codes/Headings:
COM021030
COMPUTERS / Database Management / Data Mining
COM037000
COMPUTERS / Machine Theory
MAT004000
MATHEMATICS / Arithmetic
SCI086000
SCIENCE / Life Sciences / General