1st Edition

Statistics in Corpus Linguistics Research A New Approach

By Sean Wallis Copyright 2021
    382 Pages 135 B/W Illustrations
    by Routledge

    382 Pages 135 B/W Illustrations
    by Routledge

    Traditional approaches focused on significance tests have often been difficult for linguistics researchers to visualise. Statistics in Corpus Linguistics Research: A New Approach breaks these significance tests down for researchers in corpus linguistics and linguistic analysis, promoting a visual approach to understanding the performance of tests with real data, and demonstrating how to derive new intervals and tests.

    Accessibly written, this book discusses the ‘why’ behind the statistical model, allowing readers a greater facility for choosing their own methodologies. Accessibly written for those with little to no mathematical or statistical background, it explains the mathematical fundamentals of simple significance tests by relating them to confidence intervals. With sample datasets and easy-to-read visuals, this book focuses on practical issues, such as how to:

    • pose research questions in terms of choice and constraint;

    • employ confidence intervals correctly (including in graph plots);

    • select optimal significance tests (and what results mean);

    • measure the size of the effect of one variable on another;

    • estimate the similarity of distribution patterns; and

    • evaluate whether the results of two experiments significantly differ.

    Appropriate for anyone from the student just beginning their career to the seasoned researcher, this book is both a practical overview and valuable resource.


    1 Why Do We Need Another Book on Statistics?

    2 Statistics and Scientific Rigour

    3 Why Is Statistics Difficult?

    4 Looking Down the Observer’s End of the Telescope

    5 What Do Linguists Need to Know About Statistics?


    A Note on Terminology and Notation

    Contingency Tests for Different Purposes

    PART 1


    1 What Might Corpora Tell Us About Language?

    1.1 Introduction

    1.2 What Might a Corpus Tell Us?

    1.3 The 3A Cycle

    1.4 What Might a Richly Annotated Corpus Tell Us?

    1.5 External Influences: Modal Shall / Will Over Time

    1.6 Interacting Grammatical Decisions: NP Premodification

    1.7 Framing Constraints and Interaction Evidence

    1.8 Conclusions

    PART 2

    Designing Experiments With Corpora

    2 The Idea of Corpus Experiments

    2.1 Introduction

    2.2 Experimentation and Observation

    2.3 Evaluating a Hypothesis

    2.4 Refining the Experiment

    2.5 Correlations and Causes

    2.6 A Linguistic Interaction Experiment

    2.7 Experiments and Disproof

    2.8 What Is the Purpose of an Experiment?

    2.9 Conclusions

    3 That Vexed Problem of Choice

    3.1 Introduction

    3.2 Parameters of Choice

    3.3 A Methodological Progression?

    3.4 Objections to Variationism

    3.5 Conclusions

    4 Choice Versus Meaning

    4.1 Introduction

    4.2 The Meaning of Very

    4.3 The Choice of Very

    4.4 Refining Baselines by Type

    4.5 Conclusions

    5 Balanced Samples and Imagined Populations

    5.1 Introduction

    5.2 A Study in Genre Variation

    5.3 Imagining Populations

    5.4 Multi- Variate and Multi-Level Modelling

    5.5 More Texts – or Longer Ones?

    5.6 Conclusions

    PART 3

    Confidence Intervals and Significance Tests

    6 Introducing Inferential Statistics

    6.1 Why Is Statistics Difficult?

    6.2 The Idea of Inferential Statistics

    6.3 The Randomness of Life

    6.4 Conclusions

    7 Plotting With Confidence

    7.1 Introduction

    7.2 Plotting the Graph

    7.3 Comparing and Plotting Change

    7.4 An Apparent Paradox

    7.5 Conclusions

    8 From Intervals to Tests

    8.1 Introduction

    8.2 Tests for a Single Binomial Proportion

    8.3 Tests for Comparing Two Observed Proportions

    8.4 Applying Contingency Tests

    8.5 Comparing the Results of Experiments

    8.6 Conclusions

    9 Comparing Frequencies in the Same Distribution

    9.1 Introduction

    9.2 The Single-Sample z Test

    9.3 Testing and Interpreting Intervals

    9.4 Conclusions

    10 Reciprocating the Wilson Interval

    10.1 Introduction

    10.2 The Wilson Interval of Mean Utterance Length

    10.3 Intervals on Monotonic Functions of p

    10.4 Conclusions

    11 Competition Between Choices Over Time

    11.1 Introduction

    11.2 The ‘S Curve’

    11.3 Boundaries and Confidence Intervals

    11.4 Logistic Regression

    11.5 Impossible Logistic Multinomials

    11.6 Conclusions

    12 The Replication Crisis and the New Statistics

    12.1 Introduction

    12.2 A Corpus Linguistics Debate

    12.3 Psychology Lessons?

    12.4 The Road Not Travelled

    12.5 What Does This Mean for Corpus Linguistics?

    12.6 Some Recommendations

    12.7 Conclusions

    13 Choosing the Right Test

    13.1 Introduction

    13.2 Tests for Categorical Data

    13.3 Tests for Other Types of Data

    13.4 Conclusions

    PART 4

    Effect Sizes and Meta-Tests

    14 The Size of an Effect

    14.1 Introduction

    14.2 Effect Sizes for Two-Variable Tables

    14.3 Confidence intervals on ϕ

    14.4 Goodness of Fit Effect Sizes

    14.5 Conclusions

    15 Meta- Tests for Comparing Tables of Results

    15.1 Introduction

    15.2 Some Preliminaries

    15.3 Point and Multi-Point Tests for Homogeneity Tables

    15.4 Gradient Tests for Homogeneity Tables

    15.5 Gradient Tests for Goodness of Fit Tables

    15.7 Conclusions

    PART 5

    Statistical Solutions for Corpus Samples

    16 Conducting Research With Imperfect Data

    16.1 Introduction

    16.2 Reviewing Subsamples

    16.3 Reviewing Preliminary Analyses

    16.4 Resampling and p-Hacking

    16.5 Conclusions

    17 Adjusting Intervals for Random-Text Samples

    17.1 Introduction

    17.2 Recalibrating Binomial Models

    17.3 Examples With Large Samples

    17.4 Alternation Studies With Small Samples

    17.5 Conclusions

    PART 6

    Concluding Remarks

    18 Plotting the Wilson Distribution

    18.1 Introduction

    18.2 Plotting the Distribution

    18.3 Example Plots

    18.4 Further Perspectives on Wilson Distributions

    18.5 Alternative Distributions

    18.6 Conclusions

    19 In Conclusion


    A The Interval Equality Principle

    1 Introduction

    2 Applications

    3 Searching for Interval Bounds With a Computer

    B Pseudo-Code for Computational Procedures

    1 Simple Logistic Regression Algorithm With Logit-Wilson Variance

    2 Binomial and Fisher Functions





      Sean Wallis is Principal Research Fellow and Deputy Director of the Survey of English Usage at UCL.

      "Perfect for corpus linguists: an introduction to statistics written by one of them."

      Christian Mair, University of Freiburg, Germany