Textual Data Science with R  book cover
1st Edition

Textual Data Science with R

ISBN 9781138626911
Published March 1, 2019 by Chapman and Hall/CRC
204 Pages 50 B/W Illustrations

FREE Standard Shipping
SAVE $15.99
was $79.95
USD $63.96

Prices & shipping based on shipping country


Book Description

Textual Statistics with R comprehensively covers the main multidimensional methods in textual statistics supported by a specially-written package in R. Methods discussed include correspondence analysis, clustering, and multiple factor analysis for contigency tables. Each method is illuminated by applications. The book is aimed at researchers and students in statistics, social sciences, hiistory, literature and linguistics. The book will be of interest to anyone from practitioners needing to extract information from texts to students in the field of massive data, where the ability to process textual data is becoming essential.

Table of Contents

1. Encoding: from a corpus to statistical tables

Textual and contextual data

Textual data

Contextual data

Documents and aggregate documents

Examples and notation

Choosing textual units

Graphical forms



Repeated segments

In practice


Unique spellings

Partially-automated preprocessing

Word selection

Word and segment indexes

The Life UK corpus: preliminary results

Verbal content through word and repeated segment indexes

Univariate description of contextual variables

A note on the frequency range

Implementation with the Xplortext package

In summary

2. Correspondence analysis of textual data

Data and goals

Correspondence analysis: a tool for linguistic data analysis

Data: a small example


Associations between documents and words

Profile comparisons

Independence of documents and words

The X2 test
Association rates between columns and words

Active row and column clouds

Row and column pro_le spaces

Distributional equivalence and the X2 distance

Inertia of a cloud

Fitting document and word clouds

Factorial axes

Visualizing rows and columns

Category representation

Word representation

Transition formulas

Superimposed representation of rows and columns

Interpretation aids

Eigenvalues and representation quality of the clouds

Contribution of documents and words to axis inertia

Representation quality of a point

Supplementary rows and columns

Supplementary tables

Supplementary frequency rows and columns

Supplementary quantitative and qualitative variables

Validating the visualization

Interpretation scheme for textual CA results

Implementation with Xplortext

Summary of the CA approach

3. Applications of correspondence analysis

Choosing the level of detail for analyses

Correspondence analysis on aggregate free text answers

Data and objectives

Word selection

CA on the aggregate table

Document representation

Word representation

Simultaneous interpretation of the plots

Supplementary elements

Supplementary words

Supplementary repeated segments

Supplementary categories

Implementation with Xplortext

Direct analysis

Data and objectives

The main features of direct analysis

Direct analysis of the culture question

Implementation with Xplortext

4. Clustering in textual analysis

Clustering documents

Dissimilarity measures between documents

Measuring partition quality

Document clusters in the factorial space

Partition quality

Dissimilarity measures between document clusters

The single-linkage method

The complete-linkage method

Ward's method

Agglomerative hierarchical clustering

Hierarchical tree construction algorithm

Selecting the final partition

Interpreting clusters

Direct partitioning

Combining clustering methods

Consolidating partitions

Direct partitioning followed by AHC

A procedure for combining CA and clustering

Example: joint use of CA and AHC

Data and objectives

Data preprocessing using CA

Constructing the hierarchical tree

Choosing the final partition

Contiguity-constrained hierarchical clustering

Principles and algorithm

AHC of age groups with a chronological constraint

Implementation with Xplortext

Example: clustering free text answers

Data and objectives

Data preprocessing

CA: eigenvalues and total inertia

Interpreting the first axes

AHC: building the tree and choosing the final partition

Describing cluster features

Lexical features of clusters

Describing clusters in terms of characteristic words

Describing clusters in terms of characteristic documents

Describing clusters using contextual variables

Describing clusters using contextual qualitative variables

Describing clusters using quantitative contextual variables

Implementation with Xplortext

Summary of the use of AHC on factorial coordinates coming from CA

5. Lexical characterization of parts of a corpus

Characteristic words

Characteristic words and CA

Characteristic words and clustering

Clustering based on verbal content

Clustering based on contextual variables

Hierarchical words

Characteristic documents

Example: characteristic elements and CA

Characteristic words for the categories

Characteristic words and factorial planes

Documents that characterize categories

Characteristic words in addition to clustering

Implementation with Xplortext

6. Multiple factor analysis for textual analysis

Multiple tables in textual analysis

Data and objectives

Data preprocessing

Problems posed by lemmatization

Description of the corpora data

Indexes of the most frequent words



Introduction to MFACT

The limits of CA on multiple contingency tables

How MFACT works

Integrating contextual variables

Analysis of multilingual free text answers

MFACT: eigenvalues of the global analysis

Representation of documents and words

Superimposed representation of the global and partial configurations

Links between the axes of the global analysis and the separate analyses

Representation of the groups of words

Implementation with Xplortext

Simultaneous analysis of two open-ended questions: impact of lemmatization


Preliminary steps

MFACT on the left and right: lemmatized or nonlemmatized

Implementation with Xplortext

Other applications of MFACT in textual analysis

MFACT summary

7. Applications and analysis workflows

General rules for presenting results

Analyzing bibliographic databases

Introduction to the lupus data

The corpus

Exploratory analysis of the corpus

CA of the documents _ words table

The eigenvalues

Meta-keys and doc-keys

Analysis of the year-aggregate table

Eigenvalues and CA of the lexical table

Chronological study of drug names

Implementation with Xplortext

Conclusions from the study

Badinter's speech: a discursive strategy Methods

Breaking up the corpus into documents

The speech trajectory unveiled by CA


Argument flow

Conclusions on the study of Badinter's speech

Implementation with Xplortext

Political speeches

Data and objectives



Data preprocessing

Lexicometric characteristics of the speeches and lexical table coding

Eigenvalues and Cramér's V

Speech trajectory

Word representation


Hierarchical structure of the corpus


Implementation with Xplortext

Corpus of sensory descriptions



Eight Catalan wines


Verbal categorization

Encoding the data


Statistical methodology

MFACT and constructing the mean configuration

Determining consensual words


Data preprocessing

Some initial results

Individual configurations

MFACT: directions of inertia common to the majority of groups

MFACT: representing words and documents on the first plane

Word contributions

MFACT: group representation

Consensual words


View More



Mónica Bécue-Bertaut is an elected fellow of the International Statistical Institute and was named Chevalier des Palmes Académiques by the French Government. She taught statistics and data science at the Universitat Politènica de Catalunya and offered numerous guest lectures on textual data science in different countries. Dr. Bécue-Bertaut published several books (in French or Spanish) and work chapters (in English) on this last topic. She also participated in the design of software related to textual data science, such as SPAD.T and Xplortext; being this latter an R package.


"Even though textual data science cannot be considered as the youngest sibling of other data science fields, there is still quite a big space to be filled with up-to-date textbooks describing and analyzing various methods and facets of this very interesting topic. In this book, Mónica Bécue-Bertaut tries to fill this gap, giving theoretical and practical instructions about one of the relatively little known, but powerful methods in textual data science–Correspondence Analysis (CA)... Extensive graphical images and visualizations represented by various types of plot and diagram are used throughout the material, which provides an even better aid to the reader
for grasping the main ideas of the topic... separate mention should be drawn to the language used in the book. It is clear, simple, and even fun to read, providing an
understandable way of covering complex topics... Mónica Bécue-Bertaut achieved a good blend of theory and practice in her book, which can be used as a handy resource for students and beginners in data science, as well as for specialists in textual data analysis."
- Gia Jgarkava, ISCB December 2019