Statistics in Corpus Linguistics Research: A New Approach, 1st Edition (Paperback) book cover

Statistics in Corpus Linguistics Research

A New Approach, 1st Edition

By Sean Wallis


348 pages | 135 B/W Illus.

Purchasing Options:$ = USD
Paperback: 9781138589384
pub: 2020-07-15
SAVE ~$8.99
Available for pre-order. Item will ship after 15th July 2020
Hardback: 9781138589377
pub: 2020-07-15
SAVE ~$31.00
Available for pre-order. Item will ship after 15th July 2020

FREE Standard Shipping!


Traditional approaches focused on significance tests have often been difficult for linguistics researchers to visualize. Statistics in Corpus Linguistics Research: A new approach breaks these significance tests down for researchers in corpus linguistics and linguistic analysis, promoting a visual approach to understanding the performance of tests with real data, and demonstrating how to derive new intervals and tests.

Software agnostic, this book discusses the "why" behind the statistical model, allowing readers a greater facility for choosing their own methodologies. Accessibly written for those with little to no mathematical or statistical background, it explains the mathematical fundamentals of simple significance tests by relating them to confidence intervals. With sample data sets and easy-to-read visuals, this book focuses on practical issues, such as how to:

• pose research questions in terms of choice and constraint,

• employ confidence intervals correctly (including in graph plots),

• select optimal significance tests (and what results mean),

• measure the size of the effect of one variable on another,

• estimate the similarity of distribution patterns, and

• evaluate whether the results of two experiments significantly differ.

Appropriate for anyone from the student just beginning their career to the seasoned researcher, this book is both a practical overview and valuable resource.


"Perfect for corpus linguists: an introduction to statistics written by one of them."

Christian Mair, University of Freiburg, Germany.

Table of Contents


1. Why do we need another book on statistics?

2. Statistics and scientific rigour

3. Why is statistics difficult?

4. Looking down the observer’s end of the telescope

5. What do linguists need to know about statistics?

6. Acknowledgments

A note on terminology and notation

Contingency tests for different purposes

Part 1. Motivations

1. What might corpora tell us about language?

1. Introduction

2. What might a corpus tell us?

3. The 3A cycle

3.1 Annotation, abstraction and analysis

3.2 The problem of representational plurality

3.3 ICECUP: a platform for treebank research

4. What might a richly annotated corpus tell us?

5. External influences: modal shall / will over time

6. Interacting grammatical decisions: NP premodification

7. Framing constraints and interaction evidence

7.1 Framing frequency evidence

7.2 Framing interaction evidence

7.3 Framing and annotation

7.4 Framing and sampling

8. Conclusions

Part 2. Designing Experiments with Corpora

2. The idea of corpus experiments

1. Introduction

2. Experimentation and observation

2.1 Obtaining data

2.2 Research questions and hypotheses

2.3 From hypothesis to experiment

3. Evaluating a hypothesis

3.1 The chi-square test

3.2 Extracting data

3.3 Visualising proportions, probabilities and significance

4. Refining the experiment

5. Correlations and causes

6. A linguistic interaction experiment

7. Experiments and disproof

8. What is the purpose of an experiment?

9. Conclusions

3. That vexed problem of choice

1. Introduction

1.1 The traditional ‘per million words’ approach

1.2 How did per million word statistics become dominant?

1.3 Choice models and linguistic theory

1.4 The vexed problem of choice

1.5 Exposure rates and other experimental models

1.6 What do we mean by ‘choice’?

2. Parameters of choice

2.1 Types of mutual substitution

2.2 Multi-way choices and decision trees

2.3 Binomial statistics, tests and time series

2.4 Lavandera’s dangerous hypothesis

3. A methodological progression?

3.1 Per million words

3.2 Selecting a more plausible baseline

3.3 Enumerating alternates

3.4 Linguistically restricting the sample

3.5 Eliminating non-alternating cases

3.6 A methodological progression

4. Objections to variationism

4.1 Feasibility

4.2 Arbitrariness

4.3 Oversimplification

4.4 The problem of polysemy

4.5 A complex ecology?

4.6 Necessary reductionism versus complex statistical models

4.7 Discussion

5. Conclusions

4. Choice versus meaning

1. Introduction

2. The meaning of very

3. The choice of very

4. Refining baselines by type

5. Conclusions

5. Balanced samples and imagined populations

1. Introduction

2. A study in genre variation

3. Imagining populations

4. Multi-variate and multi-level modeling

5. More texts – or longer ones?

6. Conclusions

Part 3. Confidence intervals and significance tests

6. Introducing inferential statistics

1. Why is statistics difficult?

2. The idea of inferential statistics

3. The randomness of life

3.1 The Binomial distribution

3.2 The ideal Binomial distribution

3.3 Skewed distributions

3.4 From Binomial to Normal

3.5 From Gauss to Wilson

3.6 Scatter and confidence

4. Conclusions

7. Plotting with confidence

1. Introduction

1.1 Visualising data

1.2 Comparing observations and identifying significant differences

2. Plotting the graph

2.1 Step 1. Gather raw data

2.2 Step 2. Calculate basic Wilson score interval terms

2.3 Step 3. Calculate the Wilson interval

2.4 Step 4. Plotting intervals on graphs

3. Comparing and plotting change

3.1 The Newcombe-Wilson interval

3.2 Comparing intervals: an illustration

3.3 What does the Newcombe-Wilson interval represent?

3.4 Comparing multiple points

3.5 Plotting percentage difference

3.6 Floating bar charts

4. An apparent paradox

5. Conclusions

8. From intervals to tests

1. Introduction

1.1 Binomial intervals and tests

1.2 Sampling assumptions

1.3 Deriving a Binomial distribution

1.4 Some example data

2. Tests for a single Binomial proportion

2.1 The single-sample z test

2.2 The 2 × 1 goodness of fit c 2 test

2.3 The Wilson score interval

2.4 Correcting for continuity

2.5 The ‘exact’ Binomial test

2.6 The Clopper-Pearson interval

2.7 The log-likelihood test

2.8 A simple performance comparison

3. Tests for comparing two observed proportions

3.1 The 2 × 2 c 2 and z test for two independent proportions

3.2 The z test for two independent proportions from independent populations

3.3 The z test for two independent proportions with a given difference in population means

3.4 Continuity-corrected 2 × 2 tests

3.5 The Fisher ‘exact’ test

4. Applying contingency tests

4.1 Selecting tests

4.2 Analysing larger tables

4.3 Linguistic choice

4.4 Case interaction

4.5 Large samples and small populations

5. Comparing the results of experiments

6. Conclusions

9. Comparing frequencies in the same distribution

1. Introduction

2. The single sample z test

2.1 Comparing frequency pairs for significant difference

2.2 Performing the test

3. Testing and interpreting intervals

3.1 The Wilson comparison heuristic

3.2 Visualising the test

4. Conclusions

10. Reciprocating the Wilson interval

1. Introduction

2. The Wilson interval of mean utterance length

2.1 Scatter and confidence

2.2 From length to proportion

2.3 An example: confidence intervals on mean length of utterance

2.4 Plotting the results

3. Intervals on monotonic functions of p

4. Conclusions

11. Competition between choices over time

1. Introduction

2. The ‘S curve’

3. Boundaries and confidence intervals

3.1 Confidence intervals for p

3.2 Logistic curves and Wilson intervals

4. Logistic regression

4.1 From linear to logistic regression

4.2 Logit-Wilson regression

4.3 Example 1: The decline of the to-infinitive perfect

4.3 Example 2: Catenative verbs in competition

4.4 Review

5. Impossible logistic Multinomials

5.1 Binomials

5.2 Impossible Multinomials

5.3 Possible hierarchical Multinomials

5.4 A hierarchical reanalysis of Example 2

5.5 The three-body problem

6. Conclusions

12. The replication crisis and the New Statistics

1. Introduction

2. A corpus linguistics debate

3. Psychology lessons?

4. The road not travelled

5. What does this mean for corpus linguistics?

6. Some recommendations

6.1 Recommendation 1: include a replication step

6.2 Recommendation 2: focus on large effects – and clear visualisations

6.3 Recommendation 3: play devil’s advocate

6.4 A checklist for empirical linguistics

7. Conclusions

13. Choosing the right test

1. Introduction

1.1 Choosing a dependent variable and baselines

1.2 Choosing independent variables

2. Tests for categorical data

2.1 Two types of contingency test

2.2 The benefits of simple tests

2.3 Visualising uncertainty

2.4 When to use goodness of fit tests

2.5 Tests for comparing results

2.6 Optimum methods of calculation

3. Tests for other types of data

3.1 t tests for comparing two independent samples of numeric data

3.2 Reversing tests

3.3 Tests for other types of variables

3.4 Quantisation

4. Conclusions

Part 4. Effect sizes and meta-tests

14. The size of an effect

1. Introduction

2. Effect sizes for two-variable tables

2.1 Simple difference

2.2 The problem of prediction

2.3 Cramér’s 

2.4 Other probabilistic approaches to dependent probability

3. Confidence intervals on 

3.1 Confidence intervals on 2 × 2 

3.2 Confidence intervals for Cramér’s f

3.3 An example: Investigating grammatical priming

4. Goodness of fit effect sizes

4.1 Unweighted f p

4.2 Variance-weighted f e

4.3 Example: Correlating the present perfect

5. Conclusions

15. Meta-tests for comparing tables of results

1. Introduction

1.1 How not to compare test results

1.2 Comparing sizes of effect

1.3 Other meta-tests

2. Some preliminaries

2.1 Test assumptions

2.2 Statistical principles and correcting for continuity

2.3 Example data and notation

3. Point and multi-point tests for homogeneity tables

3.1 Reorganising contingency tables for 2 × 1 tests

3.2 The Newcombe-Wilson point test

3.3 The Gaussian point test

3.4 The multi-point test for r × c homogeneity tables

4. Gradient tests for homogeneity tables

4.1 The 2 × 2 Newcombe-Wilson gradient test

4.2 Cramér’s  interval and test

4.3 r × 2 homogeneity gradient tests

4.4 Interpreting gradient meta-tests for large tables

5. Gradient tests for goodness of fit tables

5.1 The 2 ´ 1 Wilson interval gradient test

5.2 r × 1 goodness of fit gradient tests

6. Subset tests

6.1 Point tests for subsets

6.2 Multi-point subset tests

6.3 Gradient subset tests

6.4 Goodness of fit subset tests

7. Conclusions

Part 5. Statistical solutions for corpus samples

16. Conducting research with imperfect data

1. Introduction

2. Reviewing subsamples

2.1 Example 1: get vs. be passive

2.2 Subsampling and reviewing

2.3 Estimating the observed probability p

2.3 Performing a contingency test and extending to Multinomial dependent variables

3. Reviewing preliminary analyses

3.1 Example 2: embedded and sequential postmodifiers

3.2 Testing the worst-case scenario

3.3 Combining subsampling with worst-case analysis

2.4 Ambiguity and error

4. Resampling and p-hacking

5. Conclusions

17. Adjusting intervals for random-text samples

1. Introduction

2. Recalibrating Binomial models

3. Examples with large samples

3.1 Example 1: interrogative clause probability, ‘direct conversations’

3.2 Example 2: clauses per word, ‘direct conversations’

3.3 Uneven-size subsamples

3.4 Example 1 revisited, across ICE-GB

4. Alternation studies with small samples

4.1 Applying the method

4.2 Singletons, partitioning and pooling

4.3 Discussion

5. Conclusions

Part 6. Concluding remarks

18. Plotting the Wilson distribution

1. Introduction

2. Plotting the distribution

2.1 Calculating w(a ) from the standard Normal distribution

2.2 Plotting points

2.3 Employing a delta approximation

3. Example plots

3.1 Sample size n = 10, observed proportion p = 0.5

3.2 Properties of Wilson areas

3.3 The effect of p tending to extremes

3.4 The effect of very small n

4. Further perspectives on Wilson distributions

4.1 Percentiles of the Wilson distributions

4.2 The logit Wilson distribution

5. Alternative distributions

5.1 Continuity-corrected Wilson distributions

5.2 Clopper-Pearson distributions

6. Conclusions

19. In conclusion


1. The interval equality principle

1. Introduction

1.1 Axiom

1.2 Functional notation

2. Applications

2.1 Wilson score interval

2.2 Wilson score interval with continuity-correction

2.3 Binomial

2.4 Log-likelihood and other significance test functions

3. Searching for interval bounds with a computer

2. Pseudo-code for computational procedures

1. Simple logistic regression algorithm with logit Wilson variance

2.1 Calculate error sum of squared errors e for known m and k

2.2 Find optimum value of k by search for smallest error e for a given gradient m

2.3 Find optimum values of m and k by method of least squares, also return error e

2.4 Perform regression

2. Binomial and Fisher functions

3.1 Core functions

3.2 The Clopper-Pearson interval




About the Author

Sean Wallis is Principal Research Fellow and Deputy Director of the Survey of English Usage at UCL. He is an experienced computer programmer and researcher in artificial intelligence who developed the ICECUP research platform for parsed corpus linguistics, first published in 1998, and maintained and developed thereafter. His publications include Exploring natural language (2002, with G. Nelson and B. Aarts, John Benjamins), The English verb phrase (2013, edited with J. Close, G. Leech and B. Aarts, CUP). He has published book chapters and journal articles on topics from artificial intelligence to statistics and corpus research methodology, and he has contributed to numerous corpus linguistics research articles. He runs a blog on statistics in corpus linguistics, corp.ling.stats ( He is also the convenor of the Convention for Higher Education, a UK network dedicated to defending the academic integrity of the university system.

Subject Categories

BISAC Subject Codes/Headings:
LANGUAGE ARTS & DISCIPLINES / Linguistics / General