212 Pages 50 B/W Illustrations
    by Chapman & Hall

    204 Pages 50 B/W Illustrations
    by Chapman & Hall

    204 Pages 50 B/W Illustrations
    by Chapman & Hall

    Textual Statistics with R comprehensively covers the main multidimensional methods in textual statistics supported by a specially-written package in R. Methods discussed include correspondence analysis, clustering, and multiple factor analysis for contigency tables. Each method is illuminated by applications. The book is aimed at researchers and students in statistics, social sciences, hiistory, literature and linguistics. The book will be of interest to anyone from practitioners needing to extract information from texts to students in the field of massive data, where the ability to process textual data is becoming essential.

    1. Encoding: from a corpus to statistical tables

    Textual and contextual data

    Textual data

    Contextual data

    Documents and aggregate documents

    Examples and notation

    Choosing textual units

    Graphical forms



    Repeated segments

    In practice


    Unique spellings

    Partially-automated preprocessing

    Word selection

    Word and segment indexes

    The Life UK corpus: preliminary results

    Verbal content through word and repeated segment indexes

    Univariate description of contextual variables

    A note on the frequency range

    Implementation with the Xplortext package

    In summary

    2. Correspondence analysis of textual data

    Data and goals

    Correspondence analysis: a tool for linguistic data analysis

    Data: a small example


    Associations between documents and words

    Profile comparisons

    Independence of documents and words

    The X2 test
    Association rates between columns and words

    Active row and column clouds

    Row and column pro_le spaces

    Distributional equivalence and the X2 distance

    Inertia of a cloud

    Fitting document and word clouds

    Factorial axes

    Visualizing rows and columns

    Category representation

    Word representation

    Transition formulas

    Superimposed representation of rows and columns

    Interpretation aids

    Eigenvalues and representation quality of the clouds

    Contribution of documents and words to axis inertia

    Representation quality of a point

    Supplementary rows and columns

    Supplementary tables

    Supplementary frequency rows and columns

    Supplementary quantitative and qualitative variables

    Validating the visualization

    Interpretation scheme for textual CA results

    Implementation with Xplortext

    Summary of the CA approach

    3. Applications of correspondence analysis

    Choosing the level of detail for analyses

    Correspondence analysis on aggregate free text answers

    Data and objectives

    Word selection

    CA on the aggregate table

    Document representation

    Word representation

    Simultaneous interpretation of the plots

    Supplementary elements

    Supplementary words

    Supplementary repeated segments

    Supplementary categories

    Implementation with Xplortext

    Direct analysis

    Data and objectives

    The main features of direct analysis

    Direct analysis of the culture question

    Implementation with Xplortext

    4. Clustering in textual analysis

    Clustering documents

    Dissimilarity measures between documents

    Measuring partition quality

    Document clusters in the factorial space

    Partition quality

    Dissimilarity measures between document clusters

    The single-linkage method

    The complete-linkage method

    Ward's method

    Agglomerative hierarchical clustering

    Hierarchical tree construction algorithm

    Selecting the final partition

    Interpreting clusters

    Direct partitioning

    Combining clustering methods

    Consolidating partitions

    Direct partitioning followed by AHC

    A procedure for combining CA and clustering

    Example: joint use of CA and AHC

    Data and objectives

    Data preprocessing using CA

    Constructing the hierarchical tree

    Choosing the final partition

    Contiguity-constrained hierarchical clustering

    Principles and algorithm

    AHC of age groups with a chronological constraint

    Implementation with Xplortext

    Example: clustering free text answers

    Data and objectives

    Data preprocessing

    CA: eigenvalues and total inertia

    Interpreting the first axes

    AHC: building the tree and choosing the final partition

    Describing cluster features

    Lexical features of clusters

    Describing clusters in terms of characteristic words

    Describing clusters in terms of characteristic documents

    Describing clusters using contextual variables

    Describing clusters using contextual qualitative variables

    Describing clusters using quantitative contextual variables

    Implementation with Xplortext

    Summary of the use of AHC on factorial coordinates coming from CA

    5. Lexical characterization of parts of a corpus

    Characteristic words

    Characteristic words and CA

    Characteristic words and clustering

    Clustering based on verbal content

    Clustering based on contextual variables

    Hierarchical words

    Characteristic documents

    Example: characteristic elements and CA

    Characteristic words for the categories

    Characteristic words and factorial planes

    Documents that characterize categories

    Characteristic words in addition to clustering

    Implementation with Xplortext

    6. Multiple factor analysis for textual analysis

    Multiple tables in textual analysis

    Data and objectives

    Data preprocessing

    Problems posed by lemmatization

    Description of the corpora data

    Indexes of the most frequent words



    Introduction to MFACT

    The limits of CA on multiple contingency tables

    How MFACT works

    Integrating contextual variables

    Analysis of multilingual free text answers

    MFACT: eigenvalues of the global analysis

    Representation of documents and words

    Superimposed representation of the global and partial configurations

    Links between the axes of the global analysis and the separate analyses

    Representation of the groups of words

    Implementation with Xplortext

    Simultaneous analysis of two open-ended questions: impact of lemmatization


    Preliminary steps

    MFACT on the left and right: lemmatized or nonlemmatized

    Implementation with Xplortext

    Other applications of MFACT in textual analysis

    MFACT summary

    7. Applications and analysis workflows

    General rules for presenting results

    Analyzing bibliographic databases

    Introduction to the lupus data

    The corpus

    Exploratory analysis of the corpus

    CA of the documents _ words table

    The eigenvalues

    Meta-keys and doc-keys

    Analysis of the year-aggregate table

    Eigenvalues and CA of the lexical table

    Chronological study of drug names

    Implementation with Xplortext

    Conclusions from the study

    Badinter's speech: a discursive strategy Methods

    Breaking up the corpus into documents

    The speech trajectory unveiled by CA


    Argument flow

    Conclusions on the study of Badinter's speech

    Implementation with Xplortext

    Political speeches

    Data and objectives



    Data preprocessing

    Lexicometric characteristics of the speeches and lexical table coding

    Eigenvalues and Cramér's V

    Speech trajectory

    Word representation


    Hierarchical structure of the corpus


    Implementation with Xplortext

    Corpus of sensory descriptions



    Eight Catalan wines


    Verbal categorization

    Encoding the data


    Statistical methodology

    MFACT and constructing the mean configuration

    Determining consensual words


    Data preprocessing

    Some initial results

    Individual configurations

    MFACT: directions of inertia common to the majority of groups

    MFACT: representing words and documents on the first plane

    Word contributions

    MFACT: group representation

    Consensual words



    Mónica Bécue-Bertaut is an elected fellow of the International Statistical Institute and was named Chevalier des Palmes Académiques by the French Government. She taught statistics and data science at the Universitat Politènica de Catalunya and offered numerous guest lectures on textual data science in different countries. Dr. Bécue-Bertaut published several books (in French or Spanish) and work chapters (in English) on this last topic. She also participated in the design of software related to textual data science, such as SPAD.T and Xplortext; being this latter an R package.

    "Even though textual data science cannot be considered as the youngest sibling of other data science fields, there is still quite a big space to be filled with up-to-date textbooks describing and analyzing various methods and facets of this very interesting topic. In this book, Mónica Bécue-Bertaut tries to fill this gap, giving theoretical and practical instructions about one of the relatively little known, but powerful methods in textual data science–Correspondence Analysis (CA)... Extensive graphical images and visualizations represented by various types of plot and diagram are used throughout the material, which provides an even better aid to the reader
    for grasping the main ideas of the topic... separate mention should be drawn to the language used in the book. It is clear, simple, and even fun to read, providing an
    understandable way of covering complex topics... Mónica Bécue-Bertaut achieved a good blend of theory and practice in her book, which can be used as a handy resource for students and beginners in data science, as well as for specialists in textual data analysis."
    - Gia Jgarkava, ISCB December 2019