Textual Data Science with R: 1st Edition (Hardback) book cover

Textual Data Science with R

1st Edition

By Mónica Bécue-Bertaut

Chapman and Hall/CRC

204 pages | 50 B/W Illus.

Purchasing Options:$ = USD
Hardback: 9781138626911
pub: 2019-03-01
SAVE ~$15.99
$79.95
$63.96
x
eBook (VitalSource) : 9781315212661
pub: 2019-03-11
from $39.98


FREE Standard Shipping!

Description

Textual Statistics with R comprehensively covers the main multidimensional methods in textual statistics supported by a specially-written package in R. Methods discussed include correspondence analysis, clustering, and multiple factor analysis for contigency tables. Each method is illuminated by applications. The book is aimed at researchers and students in statistics, social sciences, hiistory, literature and linguistics. The book will be of interest to anyone from practitioners needing to extract information from texts to students in the field of massive data, where the ability to process textual data is becoming essential.

Table of Contents

1. Encoding: from a corpus to statistical tables

Textual and contextual data

Textual data

Contextual data

Documents and aggregate documents

Examples and notation

Choosing textual units

Graphical forms

Lemmas

Stems

Repeated segments

In practice

Preprocessing

Unique spellings

Partially-automated preprocessing

Word selection

Word and segment indexes

The Life UK corpus: preliminary results

Verbal content through word and repeated segment indexes

Univariate description of contextual variables

A note on the frequency range

Implementation with the Xplortext package

In summary

2. Correspondence analysis of textual data

Data and goals

Correspondence analysis: a tool for linguistic data analysis

Data: a small example

Objectives

Associations between documents and words

Profile comparisons

Independence of documents and words

The X2 test

Association rates between columns and words

Active row and column clouds

Row and column pro_le spaces

Distributional equivalence and the X2 distance

Inertia of a cloud

Fitting document and word clouds

Factorial axes

Visualizing rows and columns

Category representation

Word representation

Transition formulas

Superimposed representation of rows and columns

Interpretation aids

Eigenvalues and representation quality of the clouds

Contribution of documents and words to axis inertia

Representation quality of a point

Supplementary rows and columns

Supplementary tables

Supplementary frequency rows and columns

Supplementary quantitative and qualitative variables

Validating the visualization

Interpretation scheme for textual CA results

Implementation with Xplortext

Summary of the CA approach

3. Applications of correspondence analysis

Choosing the level of detail for analyses

Correspondence analysis on aggregate free text answers

Data and objectives

Word selection

CA on the aggregate table

Document representation

Word representation

Simultaneous interpretation of the plots

Supplementary elements

Supplementary words

Supplementary repeated segments

Supplementary categories

Implementation with Xplortext

Direct analysis

Data and objectives

The main features of direct analysis

Direct analysis of the culture question

Implementation with Xplortext

4. Clustering in textual analysis

Clustering documents

Dissimilarity measures between documents

Measuring partition quality

Document clusters in the factorial space

Partition quality

Dissimilarity measures between document clusters

The single-linkage method

The complete-linkage method

Ward's method

Agglomerative hierarchical clustering

Hierarchical tree construction algorithm

Selecting the final partition

Interpreting clusters

Direct partitioning

Combining clustering methods

Consolidating partitions

Direct partitioning followed by AHC

A procedure for combining CA and clustering

Example: joint use of CA and AHC

Data and objectives

Data preprocessing using CA

Constructing the hierarchical tree

Choosing the final partition

Contiguity-constrained hierarchical clustering

Principles and algorithm

AHC of age groups with a chronological constraint

Implementation with Xplortext

Example: clustering free text answers

Data and objectives

Data preprocessing

CA: eigenvalues and total inertia

Interpreting the first axes

AHC: building the tree and choosing the final partition

Describing cluster features

Lexical features of clusters

Describing clusters in terms of characteristic words

Describing clusters in terms of characteristic documents

Describing clusters using contextual variables

Describing clusters using contextual qualitative variables

Describing clusters using quantitative contextual variables

Implementation with Xplortext

Summary of the use of AHC on factorial coordinates coming from CA

5. Lexical characterization of parts of a corpus

Characteristic words

Characteristic words and CA

Characteristic words and clustering

Clustering based on verbal content

Clustering based on contextual variables

Hierarchical words

Characteristic documents

Example: characteristic elements and CA

Characteristic words for the categories

Characteristic words and factorial planes

Documents that characterize categories

Characteristic words in addition to clustering

Implementation with Xplortext

6. Multiple factor analysis for textual analysis

Multiple tables in textual analysis

Data and objectives

Data preprocessing

Problems posed by lemmatization

Description of the corpora data

Indexes of the most frequent words

Notation

Objectives

Introduction to MFACT

The limits of CA on multiple contingency tables

How MFACT works

Integrating contextual variables

Analysis of multilingual free text answers

MFACT: eigenvalues of the global analysis

Representation of documents and words

Superimposed representation of the global and partial configurations

Links between the axes of the global analysis and the separate analyses

Representation of the groups of words

Implementation with Xplortext

Simultaneous analysis of two open-ended questions: impact of lemmatization

Objectives

Preliminary steps

MFACT on the left and right: lemmatized or nonlemmatized

Implementation with Xplortext

Other applications of MFACT in textual analysis

MFACT summary

7. Applications and analysis workflows

General rules for presenting results

Analyzing bibliographic databases

Introduction to the lupus data

The corpus

Exploratory analysis of the corpus

CA of the documents _ words table

The eigenvalues

Meta-keys and doc-keys

Analysis of the year-aggregate table

Eigenvalues and CA of the lexical table

Chronological study of drug names

Implementation with Xplortext

Conclusions from the study

Badinter's speech: a discursive strategy Methods

Breaking up the corpus into documents

The speech trajectory unveiled by CA

Results

Argument flow

Conclusions on the study of Badinter's speech

Implementation with Xplortext

Political speeches

Data and objectives

Methodology

Results

Data preprocessing

Lexicometric characteristics of the speeches and lexical table coding

Eigenvalues and Cramér's V

Speech trajectory

Word representation

Remarks

Hierarchical structure of the corpus

Conclusions

Implementation with Xplortext

Corpus of sensory descriptions

Introduction

Data

Eight Catalan wines

Jury

Verbal categorization

Encoding the data

Objectives

Statistical methodology

MFACT and constructing the mean configuration

Determining consensual words

Results

Data preprocessing

Some initial results

Individual configurations

MFACT: directions of inertia common to the majority of groups

MFACT: representing words and documents on the first plane

Word contributions

MFACT: group representation

Consensual words

Conclusion

About the Author

Mónica Bécue-Bertaut is an elected fellow of the International Statistical Institute and was named Chevalier des Palmes Académiques by the French Government. She taught statistics and data science at the Universitat Politènica de Catalunya and offered numerous guest lectures on textual data science in different countries. Dr. Bécue-Bertaut published several books (in French or Spanish) and work chapters (in English) on this last topic. She also participated in the design of software related to textual data science, such as SPAD.T and Xplortext; being this latter an R package.

About the Series

Chapman & Hall/CRC Computer Science & Data Analysis

Learn more…

Subject Categories

BISAC Subject Codes/Headings:
BUS061000
BUSINESS & ECONOMICS / Statistics
COM021030
COMPUTERS / Database Management / Data Mining
MAT029000
MATHEMATICS / Probability & Statistics / General
REF000000
REFERENCE / General