Anthony McEnery, Richard Xiao, Yukio Tono
 Home |  About the Book |  Resources |  Related Titles |  About the Series |  Forthcoming Titles  |  Buy this book
Corpora
Tools

   Book Jacket

Corpora Survey

Note: This survey is based on my (forthcoming) chapter "Well-known and influential corpora", written for A. Lüdeling, M. Kyto & A. McEnery (eds) Handbooks of Linguistics and Communication Science Volume Corpus Linguistics. Berlin: Mouton de Gruyter. If you know some corpus that should be included here, I would be obliged if you could send me an introduction – Richard Xiao

Because of the size of this survey, we've split it up into the following 3 pages:

Sections 1-3

Sections 4-8

Sections 9-14

4. Corpora of the Brown family

Brown

Frown

LOB

Pre-LOB

FLOB

Kolhapur

ACE

WWC

LCMC

5. Synchronic corpora

5.1. The International Corpus of English

5.2. The Longman/Lancaster Corpus

5.3. The Longman Written American Corpus

5.4. The CREA corpus of Spanish

5.5. The LIVAC corpus of Chinese

6. Diachronic corpora

6.1. The Helsinki Corpus of English Texts

6.2. The ARCHER corpus

6.3. The Lampeter Corpus of Early Modern English Tracts

6.4 The Dictionary of Old English Corpus in Electronic Form

6.5 Early English Books Online

6.6 The Corpus of Early English Correspondence

6.7. The Zurich English Newspaper Corpus

6.8. The Innsbruck Computer Archive of Machine-Readable English Texts

6.9. The Corpus of English Dialogues

6.10 A Corpus of Late Eighteenth-Century Prose

6.11 A Corpus of Late Modern English Prose

7. Spoken corpora

7.1. The London-Lund Corpus

7.2. SEC, MARSEC and Aix-MARSEC

7.3. The Bergen Corpus of London Teenage Language

7.4. The Cambridge and Nottingham Corpus of Discourse in English

7.5. The Spoken Corpus of the Survey of English Dialects

7.6. The Intonational Variation in English Corpus

7.7. The Longman British Spoken Corpus

7.8. The Longman Spoken American Corpus

7.9. The Santa Barbara Corpus of Spoken American English

7.10. The Saarbrücken Corpus of Spoken English

7.11. The Switchboard Corpus

7.12. The Wellington Corpus of Spoken New Zealand English

7.13. The Limerick corpus of Irish English

7.14. The Hong Kong Corpus of Conversational English

8. Academic and professional English corpora

8.1. The Michigan Corpus of Academic Spoken English

8.2. The British Academic Spoken English corpus

8.3. The Reading Academic Text corpus

8.4. The Academic Corpus

8.5. The Corpus of Professional Spoken American English

8.6. The Corpus of Professional English

4. Corpora of the Brown family

The first modern corpus of English, the Brown University Standard Corpus of Present-day American English (i.e. the Brown corpus, see Kucěra/Francis 1967), was built in the early 1960s for written American English. The population from which samples for this pioneering corpus were drawn was written English text published in the United States in 1961 while its sampling frame was a list of the collection of books and periodicals in the Brown University Library and the Providence Athenaeum. The target population was first grouped into 15 text categories, from which 500 samples of approximately 2,000 words were then drawn proportionally from each text category, totaling roughly one million words.

The Brown corpus was constructed with comparative studies in mind, in the hope of setting the standard for the preparation and presentation of further bodies of data in English or in other languages. This expectation has now been realized. Since its completion, the Brown corpus model has been followed in the construction of a number of corpora for synchronic and diachronic studies as well as for cross-linguistic contrast. Table 10 shows a brief comparison of these corpora.

Table 10: Corpora of the Brown family

Corpus

Language variety

Period

Samples

Words

(Million)

Brown

American English

1961

500

One

Frown

American English

1991-1992

500

One

LOB

British English

1961

500

One

Pre-LOB

British English

1931+/- 3 years

500

One

FLOB

British English

1991-1992

500

One

Kolhapur

Indian English

1978

500

One

ACE

Australian English

1986

500

One

WWC

New Zealand English

1986-1990

500

One

LCMC

Mandarin Chinese

1991+/- 3 years

500

One

As can be seen, these corpora are roughly comparable but have sampled different languages or language varieties. Their sampling periods are either similar for the purposes of synchronic comparison or distanced by about three decades for the purposes of diachronic comparison. For example, the Brown and LOB (the Lancaster-Oslo-Bergen corpus of British English, see Johansson/Leech/Goodluck 1978) can be used to compare American and British English as used in the early 1960s. The updated versions of the two corpora, Frown (see Hundt/Sand/Skandera 1999) and FLOB (see Hundt/Sand/Siemund 1998) can be used to compare the two major varieties of English as used in the early 1990s. Other corpora of the similar sampling period, such as ACE (the Australian Corpus of English, also known as the Macquarie corpus), WWC (the Wellington Corpus of Written New Zealand English) and Kolhapur (the Kolhapur Corpus of Indian English), together with FLOB and Frown, allow for comparison of world Englishes. For diachronic studies, the Brown vs. Frown on the one hand, and the Pre-LOB, LOB and FLOB corpora on the other hand, provide a reliable basis for tracking recent language change over 30-year periods. The LCMC corpus (the Lancaster Corpus of Mandarin Chinese, see McEnery/Xiao/Mo 2003), when used in combination with FLOB/Frown corpora, provides a valuable resource for contrastive studies between Chinese and two major varieties of English.

In comparing these corpora synchronically, caution must be exercised to ensure that the sampling periods are similar. For example, comparing the Brown corpus with FLOB would involve not only language varieties but also language change. Also, as the Brown model may have been modified slightly in some of these corpora, account must be taken of such variation in comparing these corpora across text category by normalizing the raw frequencies to a common basis. Table 11 compares the text categories and number of samples for each category in these corpora.

Table 11: Text categories in the corpora of the Brown family

Code

Text category

Brown

Frown

LOB

FLOB

Pre-LOB

Kolhapur

ACE

WWC

LCMC

A

Press reportage

44

44

44

44

44

44

44

44

44

B

Press editorials

27

27

27

27

27

27

27

27

27

C

Press reviews

17

17

17

17

17

17

17

17

17

D

Religion

17

17

17

17

17

17

17

17

17

E

Skills, trades and hobbies

36

36

38

38

38

38

38

38

38

F

Popular lore

48

48

44

44

44

44

44

44

44

G

Biographies and essays

75

75

77

77

77

70

77

77

77

H

Miscellaneous (reports, official documents)

30

30

30

30

30

37

30

30

30

J

Science (academic prose)

80

80

80

80

80

80

80

80

80

K

General fiction

29

29

29

29

29

59

29

29

29

L

Mystery and detective fiction

24

24

24

24

24

24

15

24

24

M

Science fiction

6

6

6

6

6

2

7

6

6

N

Western and adventure fiction

29

29

29

29

29

15

8

29

29

P

Romantic fiction

29

29

29

29

29

18

15

29

29

R

Humour

9

9

9

9

9

9

15

9

9

S

Historical fiction

-

-

-

-

-

-

22

-

-

W

Womens fiction

-

-

-

-

-

-

15

-

-

It can be seen from the table that the two American English corpora (Brown and Frown) have the same numbers of samples for each of the 15 text categories while the British English corpora share the same proportions. The two groups differ in the numbers of samples for categories E, F, and G. The WWC and LCMC corpora follow the model of FLOB. There are important differences between the Kolhapur corpus and others in both sampling periods and the proportions of text categories. The ACE corpus covers 17 text categories instead of 15. All of these differences should be taken into account when comparing these corpora.

With the exceptions of the Pre-LOB corpus, which is under construction, and LCMC, which is distributed by the European Language Resources Association (ELRA), all of the corpora of the Brown family are available from the International Computer Archive of Modern and Medieval English (ICAME).

The corpora of the Brown family are balanced corpora representing a static snapshot of a language or language variety in a certain period. While they can be used for synchronic and diachronic studies, more appropriate resources for these kinds of research are synchronic and diachronic corpora, which will be introduced in the following two sections.

Back to top

5. Synchronic corpora

While the corpora of the Brown family are generally good for comparing language varieties such as world Englishes, the results from such a comparison must be interpreted with caution when the corpora under examination were built for different periods or the Brown model has been modified. A more reliable basis for comparing language varieties is a synchronic corpus.

Back to top

5.1. The International Corpus of English

A typical corpus of this type is the International Corpus of English (ICE), which is specifically designed for the synchronic study of world Englishes. The ICE corpus consists of a collection of twenty corpora of one million words each, each composed of written and spoken English produced during 1990-1994 in countries or regions in which English is a first or official language (e.g. Australia, Canada, East Africa, Hong Kong as well as Great Britain and the USA). As the primary aim of ICE is to facilitate comparative studies of English worldwide, each component follows a common corpus design as well as a common scheme for grammatical annotation to ensure direct comparability among the component corpora. All ICE corpora contain 500 texts of approximately 2,000 words each, sampled from a wide range of spoken (60%) and written (40%) genres, as shown in Table 12 (see Nelson 1996, 29-30).

Table 12 Corpus design of ICE

 

 

 

 

 

 

 

 

 

 

 

 

 

Spoken (300)

 

 

 

 

 

 

 

 

 

 

Dialogues (180)

 

Private 
(100)

Conversations (90)
Phone calls (10)

Public
(80)

Class lessons (20)
Broadcast discussions (20)
Broadcast interviews (10)
Parliamentary debates (10)
Cross-examinations (10)
Business transactions (10)

 

 

 

Monologues (120)

Unscripted
(70)

Commentaries (20)
Unscripted speeches (30)
Demonstrations (10)
Legal presentations (10)

Scripted
(50)

Broadcast news (20)
Broadcast talks (20)
Non-broadcast talks (10)

 

 

 

 

 

 

 

 

 

 

Written
(200) 

 

 

Non-printed (50)

 

Student writing 
(20)

Student essays (10)
Exam scripts (10)

Letters
(30)

Social letters (15)
Business letters (15)

 

 

 

 

 

 

 

Printed
(150)

 

Academic 
(40)

Humanities (10)
Social sciences (10)
Natural sciences (10)
Technology (10)

Popular 
(40)

Humanities (10)
Social sciences (10)
Natural sciences (10)
Technology (10)

Reportage
(20)

Press reports (20)

Instructional
(20)

Administrative writing (10)
Skills/hobbies (10)

Persuasive
(10)

Editorials (10)

Creative
(20)

Novels (20)

The ICE corpora are marked up and annotated at various levels. In written texts, features of the original layout are marked, including sentence and paragraph boundaries, headings, deletions, and typographic features while spoken texts are transcribed orthographically, and are marked for pauses, overlapping strings, discourse phenomena such as false starts and hesitations, and speaker turns. The bibliographic markup, which gives a complete description (e.g. text category, date, and publisher) of each text, is stored in the corpus header of each file. Different levels of annotation are undertaken for the ICE corpora. Some of them are POS tagged and parsed (e.g. the British component ICE-GB) while others are currently available as unannotated lexical corpora (e.g. the components for India, Singapore and Philippines and New Zealand). The available components of ICE can be ordered from the corpus website.

Back to top

5.2. The Longman/Lancaster Corpus

The Longman/Lancaster Corpus consists of about 30 million words of published English. British data takes up 50% and American data 40% while the other 10% represents other varieties such as Australian, African and Irish English. One half of the samples were selected randomly (microcosmic texts) and the other half selected by a panel of experts (selective texts). Most texts in the corpus are about 40,000 words long but no whole texts are used.

Both imaginative and informative text categories are included. Imaginative texts come from well-known literary works and works randomly sampled from books in print; informative texts come from the natural and social sciences, world affairs, commerce and finance, the arts, leisure, and so on. Imaginative texts are mainly works of fiction in book form while informative texts comprise books, newspapers and journals, unpublished and ephemera. Four external criteria have been used in text selection (see Holmes-Higgin/Abidi/Ahmad 1994): region (language varieties), time (1900s-1980s), medium (books 80%, periodicals 13.3% and ephemera 6.7%), and level (literary, middle and popular for imaginative texts, and technical, lay and popular for informative texts). As part of the Longman Corpus Network, the Longman/Lancaster Corpus is not available for public access.

Back to top

5.3. The Longman Written American Corpus

The Longman Written American Corpus currently contains over 100 million words of running texts taken from newspapers, journals, magazines, best-selling novels, technical and scientific writing, and coffee-table books. The design of the Longman Written American Corpus is based on the general design principles of the Longman/Lancaster Corpus and the written section of the BNC. The corpus is dynamically refined and keeps growing with the constant addition of new materials. Like the other components of the Longman Corpus Network, this corpus does not appear to allow public access.

Back to top

5.4. The CREA corpus of Spanish

The CREA (Corpus de Referencia del Español Actual) is a corpus of standard varieties of Spanish. The corpus currently contains 133 million words sampled from a wide range of written (90%) and spoken (10%) text categories produced in all Spanish speaking countries between 1975-1999 (divided into 5-year periods). The texts in the corpus are distributed evenly between Spain and America. The domains covered in the corpus include Science and technology, social sciences, religion and thought, politics and economics, arts, leisure and ordinary Life, health, and fiction.

The CREA was designed as a monitor corpus which is continually updated so that it always represents the last twenty-five years of the history of Spanish. New data is added proportionally to maintain the corpus balance and to ensure that the various trends in current Spanish are represented. Texts for 2000-2004 are currently being incorporated (Sánchez 2002).

The CREA corpus is marked in SGML. Bibliographic and taxonomic information is encoded in the corpus header of each file. For written texts, both structural (paragraph and page number) and intratextual (notes, formulas, tables, quotations, foreign words etc.) marks are encoded. For spoken texts, the markup scheme indicates structural (speech turns) and non-structural (overlapping, tottering, anacoluthon, etc.) marks (cf. Guerra 1998).

The modular structure of the CREA corpus allows for flexible searches using geographical, generic, temporal, and thematic criteria. The corpus is accessible on the Internet.

Back to top

5.5. The LIVAC corpus of Chinese

The LIVAC (Linguistic Variation in Chinese Speech Communities) project started in 1993 with the aim of building a synchronous corpus for studying varieties of Mandarin Chinese. For this purpose, data has been collected regularly and simultaneously, once every four days since July 1995, from representative Mandarin Chinese newspapers and the electronic media of six Chinese speaking communities: Hong Kong, Taiwan, Beijing, Shanghai, Macau and Singapore. The contents of these texts typically include the editorial, and all the articles on the front page, international and local news pages, as well as features and reviews. The corpus is planned to cover a 10-year period between July 1995 and June 2005, capturing salient pre- and post-millennium evolving cultural and social fabrics of the diverse Chinese speech communities (Tsou/Tsoi/Lai et al 2000). The collection of materials from these diverse communities is synchronized with uniform calendar reference points so that all of the components are comparable. As of the end of 2003, the LIVAC corpus contained over 140 million Chinese characters, with 640,000 words in its dictionary. The corpus is expected to grow until the end of June 2005.

All of the corpus texts in LIVAC are segmented automatically and checked by hand. In addition the corpus, a lexical database is derived from the segmented texts, which includes, apart from ordinary words, those expressing new concepts or undergoing sense shifts, as well as region specific words from the six communities. The database is thus a rich resource for research into linguistics, sociolinguistics, and Chinese language and society.

As LIVAC captures the social, cultural, and linguistic developments of the six Chinese speaking communities within a decade, it allows for a wide range of comparative studies on linguistic variation in Mandarin Chinese. The corpus also provides an important resource for tracking lexical development such as the evolution of new concepts and their expressions in present-day Chinese. A sample of the corpus (data covering the period from July 1995 to June 1996) can be accessed using the online query system at the corpus site, which shows KWIC concordances as well as frequency distribution across the six speech communities.

Back to top

6. Diachronic corpora

Another way to explore language variation is from a diachronic perspective using diachronic corpora. A diachronic (or historical) corpus contains texts from the same language gathered from different time periods. Typically that period is far more extensive than that covered by Brown/Frown and LOB/FLOB or a monitor corpus such as the Bank of English. Diachronic corpora are used to track changes in language evolution. This section introduces a number of corpora of this kind.

Back to top

6.1. The Helsinki Corpus of English Texts

Perhaps the best-known historical corpus is the diachronic part of the Helsinki Corpus of English Texts (i.e. the Helsinki corpus), which consists of approximately 1.5 million words of English in the form of 400 text samples, dating from the 8th to 18th centuries. The corpus is divided into three periods (Old, Middle, and Early Modern English) and eleven subperiods, as shown in Table 13 (cf. Kytö 1996).

Table 13: Periods covered in the Helsinki Diachronic Corpus

Period

Subperiod

Words

Percent

Overall

Old English

I. 850

2,190

0.5

413,250

 

II. 850-950

92,050

22.3

III. 950-1050

251,630

60.9

IV. 1050-1150

67,380

16.3

Total

413,250

100

26.27%

Middle English

I. 1150-1250

113,010

18.6

608,570

 

II. 1250-1350

97,480

16.0

III. 1350-1420

184,230

30.3

IV. 1420-1500

213,850

35.1

Total

608,570

100%

38.70%

Early Modern English

I. 1500-1570

190,160

34.5

551,000

 

II. 1570-1640

189,800

34.5

III. 1640-1710

171,040

31.0

Total

551,000

100

35.03%

Total

1,572,820

 

100%

In addition to the basic selection of texts as indicated in the table, there is a supplementary part in the Helsinki corpus, which focuses on regional varieties. This part consists of 834,000 words of Older Scots and 300,000 words of Old American English. While the primary selectional criteria are the dates of texts, the Helsinki corpus has sought to reflect socio-historical variation (e.g. author sex, age and social rank) and a wide range of text types (e.g. law, handbooks, science, trials, sermons, diaries, documents, plays, private and official correspondence, etc.) for each specific period. The textual markup scheme includes more than thirty genre labels, which indicate, whenever available, parameter values for the dialect and the level of formality of the text, the relationship between the writer and the receiver as well as the authors age, sex, and social rank (Rissanen 2000).

As the Helsinki corpus not only sampled different periods covering one millennium, and it also encoded genre and sociolinguistic information, this corpus allows for researchers to go beyond simply dating and reporting language change by combining diachronic, sociolinguistic and genre studies. The Helsinki corpus can be ordered from ICAME or the Oxford Text Archive (OTA).

Back to top

6.2. The ARCHER corpus

ARCHER, an acronym for A Representative Corpus of Historical English Registers, contains 1.7 million words of data in the form of 1,037 texts sampled from seven 50-year historical periods covering Early Modern English (1650-1990). The corpus is designed as a balanced representation of seven written (journal-diaries, letters, fiction, news, and science, etc.) and three speech-based (fictional conversation, drama and sermons-homilies) genres in British (two thirds of the corpus) and American (one third, data available only for the periods 1750-1799, 1850-1899, 1950-1990) English. Each 50-year subcorpus includes 20,000-30,000 words per register, typically containing ten texts of approximately 2,000-3,000 words each (cf. Biber/Finegan/Atkinson 1994). ARCHER is tagged for grammatical/functional categories. It allows for a wide variety of investigations on recent linguistic change and change in discourse and genre conventions. The corpus is presently being expanded with more American texts to make the American and British data comparable (see ARCHER 2). The expanded version will also enable a systematic comparison of the two varieties of English diachronically. However, because of the copyright problem, ARCHER is not publicly available at the moment. Readers interested in using this corpus can contact Douglas Biber.

In addition to the Helsinki and ARCHER corpora, which cover many centuries, there are a number of well-known historical corpora focusing a particular period or a specific domain or genre, which will be introduced in the following sections.

Back to top

6.3. The Lampeter Corpus of Early Modern English Tracts

The Lampeter Corpus of Early Modern English Tracts is a balanced corpus covering one century between 1640 and 1740, which is divided into ten decades. Each decade consists of data sampled from six domains (religion, politics, economics/trade, science, law and miscellaneous). Two complete texts, ranging from 3,000 to 20,000 words, are included for each domain within each decade, totaling approximately 1.1 million words (Schmied 1994).

The Lampeter corpus is encoded in TEI-compliant SGML. The TEI headers provides the framework for historical, sociolinguistic and stylistic investigations, including information regarding authors (name, age, sex, place of residence, education, social status, political affiliation), printers/publishers, place and date of print, publication format, text characteristics and bibliographical sources. As the corpus includes whole texts rather than smaller samples, the corpus is also useful for study of textual organization in Early Modern English. The Lampeter corpus can be ordered from ICAME or OTA.

Back to top

6.4. The Dictionary of Old English Corpus in Electronic Form

The Dictionary of Old English Corpus in Electronic Form (DOEC, the 2000 release) contains 3,037 texts of Old English, totaling over three million words, in addition to two million words of Latin. The texts in the corpus are practically all extant Old English writings. The DOEC corpus includes at least one copy of each surviving text in Old English while in cases where it is significant because of dialect or date, more than one copy is included. These texts cover six text categories: poetry, prose, interlinear glosses, glossaries, runic inscriptions, and inscriptions in the Latin alphabet. In the prose category in particular, a wide range of text types are covered which include, for example, saints lives, sermons, biblical translations, penitential writings, laws, charters and wills, records (of manumissions, land grants, land sales, land surveys), chronicles, a set of tables for computing the moveable feasts of the Church calendar and for astrological calculations, medical texts, prognostics (the Anglo-Saxon equivalent of the horoscope), charms (such as those for a toothache or for an easy labour), and even cryptograms (cf. the corpus website). The texts in the corpus are encoded in TEI-compliant SGML. The DOEC corpus can be ordered on CDs or assessed online by institutional site license at the corpus website. The web-based query system allows for searches by single words, word combinations, word proximity and bibliographic sources.

Back to top

6.5. Early English Books Online

Early English Books Online (EEBO) is a joint effort launched in 1999 between the University of Michigan, Oxford University and ProQuest Information and Learning to create a full-text archive of Early English. From the first book published in English through the age of Spenser and Shakespeare, the EEBO collection now contains about 100,000 of over 125,000 titles listed in Pollard & Redgraves Short-Title Catalogue (1475-1640) and Wings Short-Title Catalogue (1641-1700) and their revised editions, as well as the Thomason Tracts (1640-1661) collection and the Early English Books Tract Supplement, covering a wide range of domains including, for example, English literature, history, philosophy, linguistics, theology, music, fine arts, education, mathematics and science. The remaining titles will be digitalized and added to the database in the near future. The corpus can be accessed online at the EEBO website.

Back to top

6.6. The Corpus of Early English Correspondence

The Corpus of Early English Correspondence (CEEC, the 1998 version) consists of 96 collections of ca. 6,000 personal letters written by 778 people (women accounting for 20%) between 1417 and 1681, totaling 2.7 million words. The corpus is accompanied by a sender database, which offers users easy access to various sociolinguistic variables, including writer age, gender, place of birth, education, occupation, social rank, domicile and the relationship with the addressee. CEEC is a balanced corpus which can be neatly divided into two parts, both covering chronologically fairly equal periods: the first from ca. 1417 to 1550 and the second from 1551 to 1680 (cf. Laitinen 2002). Table 14 shows the proportions in terms of writers social ranks and domiciles (see Nevalainen 2000: 40). The CEEC corpus is currently being expanded with personal letters written between 1682 and 1800 to cover the 18th-century.

Table 14: the CEEC corpus by rank and domicile

Rank (percent)

Domicile (percent)

Royalty: 2.4

Court: 7.8

Nobility: 14.7

London: 13.9

Gentry: 39.3

East Anglia: 17.1

Clergy: 13.6

North: 12.5

Professionals: 11.2

Other regions: 48.6

Merchants: 8.4

 

Other nongentry: 9.4

 

As the copyright problem has prevented public access to the full release of the CEEC corpus, a CEEC sampler (CEECS) has been published by ICAME, which represents the non-copyrighted materials included CEEC. The sampler reflects the structure of the full CEEC only in some respects. The time covered is nearly the same (1418-1680), which is divided into two parts. CEECS1 (246,055 words) covers the 15th and 16th centuries while CEECS2 (204,030 words) covers the 17th century. The sampler corpus consists of 23 collections of 1,147 letters with 194 informants, totaling 450,085 words. The CEEC sampler is available from ICAME or OTA.

Back to top

6.7. The Zurich English Newspaper Corpus

The Zurich English Newspaper Corpus (ZEN) is a 1.2-million-word collection of newspapers in Early English, covering 120 years (from 1671 to 1791) of British newspaper history. To achieve a representative coverage, a wide variety of newspapers were included. Up to ten issues per newspaper were selected at ten-year intervals throughout the whole period. With the exception of stock market reports, lottery figures, long lists of names and poetry, the whole newspapers were included in the corpus. The news stories are grouped into two major categories: foreign news and home news, with each news category further classified according to its own text genre definition (cf. Fries/Schneider 2000). The corpus is split into four 30-year periods in order to track potential language change, as shown in Table 15 (see Schneider 2002: 202).

Table 15: The ZEN corpus

Section

Period

Words

Sentences

A

1670-1709

242758

7642

B

1710-1739

347825

12163

C

1740-1769

339362

14112

D

1770-1799

298249

11843

Total

1228194

45760

The ZEN corpus is SGML-conformant. It not only allows for linguistic analysis of different types of news stories in the 17th and 18th centuries, it has also made it possible to compare news texts in Early English with modern newspaper language. The ZEN query system allows restricted access to the online database.

Back to top

6.8. The Innsbruck Computer Archive of Machine-Readable English Texts

The Innsbruck Computer Archive of Machine-Readable English Texts (ICAMET) contains ca. 500 Middle English texts totaling 5.7 million words. The database comprises three parts, namely, the Prose Corpus (129 texts written during 1100-1500, accounting for two thirds of the total), the Letter Corpus (254 letters written during 1386-1688, arranged in the diachronic order), and the Prose Varia Corpus (mainly translations or normalized versions of Middle English texts). An advantage of ICAMET is that the database consists of complete texts instead of extracts, which allows literary, historical and topical analyses of various kinds, particularly studies of cultural history (Marcus 1999). Nevertheless, the copyright issue has restricted public access to many prose texts in the corpus. A sampler containing half of the prose texts and all letters is available from ICAME.

Back to top

6.9. The Corpus of English Dialogues

The Corpus of English Dialogues (CED) contains 1.3 million words of Early Modern English dialogue texts produced over a 200-year time span between 1560 and 1760. While the spoken language of the past is inaccessible directly to modern speakers, it is recorded in speech related texts. The CED corpus sampled from six such text categories, including trial proceedings, witness depositions, drama, handbooks in dialogue form, fictional dialogues, and language teaching books (cf. Culpeper/Kytö 1997).

The focus on dialogue will allow insight into the nature of impromptu speech and interactive two-way communication in the Early Modern English period - aspects which have received little research attention. The CED corpus is currently under construction by the Universities of Lancaster and Uppsala.

Back to top

6.10 A Corpus of Late Eighteenth-Century Prose

A Corpus of Late Eighteenth-Century Prose contains 30,000 words of unpublished letters transcribed from the originals dated from the period 1761-1790. The corpus is distributed in both plain text (extended ASCII) and HTML versions. The text version can be used with a concordancer while the HTML version facilitates viewing the corpus in a browser. The
plain text version is marked up in the COCOA format, giving information on writer, date and page breaks, etc. The corpus is intended to complement major diachronic corpora like the Helsinki corpus, which stop in the early eighteenth century. Another aim of the corpus is to illustrate non-literary English and English relatively uninfluenced by prescriptivist ideas, in the belief that it might help with research into change in (ordinary, spoken) language in the late Modern English period (van Bergen/Denison 2004). The corpus is by no means uniform, nor is it balanced. Nevertheless, because of the nature of the material, it is of great use to both linguists and historians. The corpus can be ordered from the Oxford Text Archive, free of charge, for use in education and research.

Back to top

6.11. A Corpus of Late Modern English Prose

A Corpus of Late Modern English Prose contains 10,000 words of informal private letters written by British writers between 1861 and 1919. All decades in this period are represented, with about 6,000 words for the decade 1880-1889, 13,000 words for 1890-1899 and 20,000 words for the other four decades each. These blocks of texts are sampled from five sources.

Stored in seven extended (8-bit) ASCII text files, the corpus is marked up following the conventions used in the Helsinki corpus, with information on writer, recipient, relationship, date, genre, and page etc. encoded in COCOA-style brackets (see Denison 1994). The corpus can be ordered at no cost from the Oxford Text Archive.

In addition to the diachronic corpora introduced in the previous sections, there are a number of online databases which are accessible on the Internet, for example, Michigan Early Modern English Materials, the Corpus of Middle English Prose and Verse (CME), the Middle English Collection (MidEng), and the Korpus of Early Modern Playtexts in English.

Back to top

7. Spoken corpora

While general corpora like national corpora may contain spoken material, there are a number of well-known publicly available spoken English corpora, which will be introduced in this section.

Back to top

7.1. The London-Lund Corpus

The London-Lund Corpus (LLC), as the first electronic corpus of spontaneous language, is a corpus of spoken British English recorded from 1953-1987. The corpus derived from two projects: the Survey of English Usage (SEU) at University College London and the Survey of Spoken English (SSE) at Lund University. There are two versions of LLC, the original version consisting of 87 transcripts from SSE totaling 435,000 words, and the complete version, which has been augmented by 13 supplementary transcripts from SEU. The full LLC corpus comprises 100 texts, each of 5,000 words, totaling half a million running words. A distinction is made between dialogue (e.g. face-to-face conversations, telephone conversations, and public discussion) and monologue (both spontaneous and prepared) in the organization of the corpus (cf. Greenbaum/Svartvik 1990). This textual information is encoded together with speaker information (e.g. gender, age, occupation). The texts in the corpus are transcribed orthographically, with detailed prosodic annotation. The LLC corpus is available from ICAME.

Back to top

7.2. SEC, MARSEC and Aix-MARSEC

The Lancaster/IBM Spoken English Corpus (SEC) consists of approximately 53,000 words of spoken British English, mainly taken from radio broadcasts dating between 1984 and 1991. For a corpus of this size, it is impossible to include samples of every style of spoken English. The SEC corpus has been designed to cover speech categories suitable for speech synthesis, as shown in Table 16 (see Taylor/Knowles 1988).

Tab le 16: The SEC categories

Code

Category

Words

Proportion

A

Commentary

9066

17%

B

News broadcast

5235

10%

C

Lecture aimed at general audience

4471

8%

D

Lecture aimed at restricted audience

7451

14%

E

Religious broadcast including liturgy

 

 

F

Magazine-style reporting

4170

9%

G

Fiction

7299

14%

H

Poetry

1292

2%

J

Dialogue

6826

13%

K

Propaganda

1432

3%

M

Miscellaneous

3352

6%

Total

52637

c. 100%

In the SEC corpus, efforts have been made to achieve a balance between the highly stylized texts (e.g. poetry, religious broadcast, propaganda) and dialogue, and between male and female speakers. Of the 53 speakers in the corpus, 17 are female, representing 30% of the corpus. The higher proportions of male speakers in the news and commentary categories reflect the tendency of the BBC to use mainly male speakers in these types of programmes.

SEC is available in orthographic, prosodic, grammatically tagged and treebank versions, which should prove most useful to those who research in the speech synthesis or speech recognition fields. The corpus can be ordered from ICAME.

The Machine Readable Spoken English Corpus (MARSEC) is an extension of SEC in which the original acoustic recordings were digitalized, and word-level time-alignment between the transcripts and the acoustic signals was included. Tonetic stress marks were also converted into ASCII symbols to make the corpus machine-readable. The prosodically annotated word-level alignment files are available at the MARSEC website.

The Aix-MARSEC database is a further development of MARSEC. The database consists of two major components: the digitalized recordings from MARSEC and the annotations. Annotations have so far been undertaken at nine levels such as phonemes, syllables, words, stress feet, rhythm units, minor and major turn units. Two supplementary levels, the grammatical annotation by CLAWS and a Property Grammar system developed at Aix-en-Provence, are to be integrated soon (cf. Auran/Bouzon/Hirst 2004). The database, together with tools, is available under GNU GPL licensing at the Aix-MARSEC project site.

Back to top

7.3. The Bergen Corpus of London Teenage Language

The Bergen Corpus of London Teenage Language (COLT) is the first large English corpus focusing on the speech of teenagers. It contains half a million words (about 55 hours of recording) of orthographically transcribed spontaneous teenage talk recorded in 1993 by 31 volunteer recruits from five socially different school boroughs. The speakers in the corpus are classified into six age groups: preadolescence (0-9 years old), early adolescence (10-13), middle adolescence (14-16), late adolescence (17-19), young adults (20-29) and older adults (30+). As the name of corpus suggests, the core of the corpus represents teenagers. The early, middle and late adolescence groups account respectively for 24%, 61% and 9%, totaling 94% of the corpus. The older adult group, mostly parents, teachers, takes up 6%. As regards speaker gender, girls and boys contributed roughly the same amount of text: the male speakers about 51.8% (230,616 words) and the female speakers 48.2% (214,215 words). In terms of social class, only about 50% of the corpus material can be assigned a social group value. The material that has been classified is evenly distributed across the three social groups: high, middle, and low. While a wide range of settings are present in the COLT corpus, settings in connection with school (48%) and home (32%) are the most common. Such speaker-specific information (speaker age, gender, social class, etc.) and conversation-specific information (location and setting) is encoded in the header of each corpus text. In the body of the text, paralinguistic features and non-verbal sounds are also marked up (cf. Haslerud/Stenström 1995).

The corpus constitutes part of the British National Corpus. In addition, COLT is released in both orthographically transcribed (pure text) and tagged version (using CLAWS C7 tagset). A prosodically annotated version (a representative selection amounting to approximately 150,000 words) is also available. The corpus is for non-commercial purposes and can be accessed online by registered users or ordered form ICAME.

Back to top

7.4. The Cambridge and Nottingham Corpus of Discourse in English

The Cambridge and Nottingham Corpus of Discourse in English (CANCODE) is part of the Cambridge International Corpus (CIC, see Appendix). The corpus comprises five million words of transcribed spontaneous speech recorded in Britain and Ireland between 1994 and 2001, covering a wide variety of mostly informal settings: casual conversation, people working together, people shopping, people finding out information, discussions and many other types of interaction. As CANCODE is designed as a contextually and interactively differentiated corpus, the data has been carefully collected and sociolinguistically profiled with reference to a range of different speech genres and with an emphasis on everyday communication.

A unique feature of CANCODE is that the corpus has been coded with information pertaining to the relationship between the speakers: whether they are intimates (living together), casual acquaintances, colleagues at work, or strangers. For this purpose, CANOCDE is organized along two main axes: context-type and interaction-type. Alongside the axis of context-type are, on the cline from public to private, transactional, professional, socializing and intimate. Alongside the axis of interaction-type, on the cline from collaborative to non-collaborative, information provision, collaborative idea, and collaborative work. The interactions between the two axes, together with typical settings, are shown in Table 17 (see Carter/McCarthy 2004, 67). This coding allows users to look more closely at how different levels of familiarity (formality) affect the way in which people speak to each other. The corpus is not currently available to the public.

Table 17: CANCODE text types

Context-type

Interaction-type

Information provision

Collaborative idea

Collaborative work

Transactional

commentary by museum guide

chatting with hairdresser

choosing and buying a television

Professional

oral report at group meeting

planning meeting at place of work

colleagues window-dressing

Socializing

telling jokes to friends

reminiscing with friends

friends working together

Intimate

partner relating the story to a film seen

siblings discussing their childhood

couple decorating a room

Back to top

7.5. The Spoken Corpus of the Survey of English Dialects

A corpus that was built specifically for the study of English dialects is the spoken corpus of the Survey of English Dialects (SED, see Beare/Scott 1999). The Survey of English Dialects was started in 1948 by Harold Orton at the University of Leeds. The initial work comprised a questionnaire-based survey of traditional dialects based on extensive interviews of about 1,000 people from 313 locations all over rural England. During the survey, a number of recordings were made as well as the detailed interviews. The recordings, which were made during 1948-1961, consist of about 60 hours of dialogue of people aged 60 or above talking about their memories, families, work and the folklore of the countryside from a century ago. Elderly people were chosen as subjects because they were most likely to speak the traditional, uncontaminated dialect of their area.

The spoken corpus derived from SED consists of transcripts of 314 recordings from 289 (out of the 313) SED localities in England, totaling roughly 800,000 running words. The original recordings were transcribed, with sound files linked to transcripts. The corpus in TEI-compliant SGML and POS tagged using CLAWS.

While the spoken corpus of SED comprises data invariably produced by elderly people, as the survey was conducted nationwide, covering every county of England, it has, for the first time, made it possible to conduct a detailed study of the regional variation in English dialects on a national level. Also, as the data reflects a society which was different in many ways from today, the corpus is a valuable resource for dialectologists, historical linguistics as well as historians. The CD-ROM of the spoken corpus is published by Routledge, London.

Back to top

7.6. The Intonational Variation in English Corpus

The Intonational Variation in English (IviE) corpus was constructed for the investigation of cross-varietal and stylistic variation in British English intonation, focusing on nine urban varieties of English spoken in the British Isles, i.e. Belfast, Bradford, Cambridge, Cardiff, Dublin, Liverpool, Leeds, London, and Newcastle. The corpus comprises 36 hours of speech data in five different speaking styles: phonetically controlled sentences (statements, questions without morpho-syntactic markers, WH-questions, inversion questions, coordination structures), a read text (the fairy tale Cinderella), a retold version of the same text, a map task (find your way around a small town) and free conversations (on the assigned topic of smoking). The data was collected in urban secondary schools, and the speakers were 16 years old at the time when the recordings were made. A minimum of six male and six female speakers from each variety were recorded, though more speakers were included for some of the varieties, totaling 116 speakers in all (cf. Grabe/Post/Nolan 2001). The corpus is available free of charge for non-commercial use only. Orthographic and prosodic transcriptions, together with digitalized sound files can be ordered on CDs or downloaded from the corpus website.

Back to top

7.7. The Longman British Spoken Corpus

The Longman British Spoken Corpus contains 10 million words of natural, spontaneous conversations from a representative sample of the population in terms of speaker age, gender, social group and region, and from the language of lectures, business meetings, after dinner speeches and chat shows. The design criteria are discussed in detail in Crowdy (1993). The Longman British Spoken Corpus is the first large scale attempt to collect spoken data in a systematic way. The corpus is part of the spoken section of the British National Corpus (see section 2.1).

Back to top

7.8. The Longman Spoken American Corpus

The Longman Spoken American Corpus comprises five million words of spoken data collected from everyday conversations of more than 1,000 Americans of various age groups, levels of education, and ethnicity from over 30 US States. Equal numbers of participants were chosen from each region, and a balance was struck between the numbers of participants from rural and city areas within those regions. Recordings were made of four-hour chunks of the normal daily conversations of each participant over periods of at least four days. The participants were chosen to be representative for gender, age, ethnicity and education, as shown by the latest US demographic census statistics (Table 18, see Stern 1997). As part of the Longman Corpus Network, the Longman Spoken American Corpus is a property of the Longman publishers for in-house use only.

Table 18: Demographic distribution of the Longman Spoken American Corpus

Variable

Proportions

Gender

Male: 50%; Female: 50%

Age

18-24: 20%; 25-34: 20%; 35-44: 20%; 4445-60: 20%; 60+: 20%

Ethnicity

White: 75%; Black: 13%; Hispanic: 8%; Asian: 4%

Education

Degree/Higher degree: 33%; College: 33%; High school: 33%

Back to top

7.9. The Santa Barbara Corpus of Spoken American English

The Santa Barbara Corpus of Spoken American English (SBCSAE) is based on hundreds of recordings of spontaneous speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects the many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, etc. (cf. Dubois/Chafe/Meyer et al 2000-2004).

The corpus is particularly useful for research into speech recognition as each speech file is accompanied by a transcript in which phrases are time stamped to allow them to be linked with the audio recording from which the transcription was produced. Personal names, place names, phone numbers, etc, in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognisable. The SBCSAE corpus is distributed by the LDC in five parts, the first three of which have been released to date.

Back to top

7.10. The Saarbrücken Corpus of Spoken English

The Saarbrücken Corpus of Spoken English (SCoSE) consists of three parts: stories, jokes and interviews. The first two parts comprise excerpts transcribed from audio-taped talk recorded by researchers and students at Northern Illinois University and at Saarland University. Most of the excerpts come from real conversations among family members and friends, fellow students and colleagues. The third part includes transcripts of stories recorded in interviews with senior citizens aged 80 and older in a retirement community in Indianapolis, Indiana in the summer of 2002. In all of these parts, speech turns are indicated. The hard copy of corpus (in PDF format), together with a description of transcription conventions, is available at no cost at the corpus site. The electronic copy of the corpus is downloadable and accessible online at Talkbank. Users can use the interface at the Talkbank site to generate three different versions to suit their research interests: the properly marked up text version, the XML version, and the HTML version in which each utterance is aligned with audio recording.

Back to top

7.11. The Switchboard Corpus

The Switchboard Corpus (SWB) is a corpus of is 2,438 spontaneous telephone conversations, averaging 6 minutes in length, recorded for over 542 speakers of both sexes from every major dialect of American English in the early 1990s. The transcripts total three million words (over 240 hours of recordings). Information relevant to speakers' sex, year of birth, education level and dialect region is available in the documentation accompanying the corpus. Table 19 shows the distribution of major sociolinguistic variables (see Godfrey/Holliman 1997).

Table 19: The Switchboard corpus

Dialect

Speaker age

Speaker sex

Education

South Midland (155)

Western (85)

North Midland (77)

Northern (75)

Southern (56)

NYC (33)

Mixed (26)

New England (21)

20-29 (140)

30-39 (179)

40-49 (112)

50-59 (87)

60-69 (13)

Male (292)

Female (239)

High school - (14)

College - (39)

College (309)

College + (176)

Unknown (4)

As each transcript in the corpus is time-aligned at the word level, the corpus is useful for sociolinguistic studies as well for speech recognition. The corpus is distributed by the LDC. It can also be downloaded from the Switchboard website or accessed via the LDC Online.

Back to top

7.12. The Wellington Corpus of Spoken New Zealand English

The Wellington Corpus of Spoken New Zealand English (WSC) comprises one million words of spoken New Zealand English in the form of 551 2,000-word extracts collected between 1988 and 1994 (99% of the data from 19901994, the exception being eight private interviews). A very stringent criterion was adopted to ensure the integrity of the New Zealand samples included in the corpus. Data was collected only from those who lived in New Zealand before the age of 10, those who had spent less than 10 years (or half lifetime, whichever was greater), and those who made an overseas trip over one year before. The extracts are classified into 15 text categories covering a wide range of contexts in which each style of speech is found, as shown in Table 20 (cf. Holmes/Vine/Johnson 1998).

Table 20: Composition of the WSC corpus

Category

Text category

Words

Monologue:
Public scripted, broadcast

Broadcast news

28,929

Broadcast monologue

11,205

Broadcast weather

3,641

Monologue:
Public unscripted

Sports commentary

26,010

Judge's summation

4,489

Lecture

30,406

Teacher monologue

12,496

Dialogue:
Private

Conversation

500,363

Telephone conversation

70,156

Oral history interview

21,972

Social dialect interview

31,058

Dialogue:
Public

Radio talkback

84,321

Broadcast interview

96,775

Parliamentary debate

22,446

Transactions and meetings

102,332

Total

1,046,599

The formal speech section (12%) in the WSC corpus includes all monologue categories and parliamentary debate in the public dialogue category. The semi-formal section (13%) includes the three types of interview (both public and private). All of the other text categories make up the informal speech section (75%), with private conversation alone accounting for 50% of the corpus. In terms of speaker gender, women contributed 52% and men 48% of the final transcribed words, reflecting the New Zealand population balance. With regard to speaker age, data for the age group 20-24 accounts for more than 20% of the corpus, and the proportions for age groups 45-49 and 40-44 both exceed 10% while there is little data for those aged over 70. The distribution across different age groups generally mirrors the population structure in New Zealand. The corpus data also reflects the distribution of population across ethnic groups, with data collected for Pakeha accounting for 76%, and for Maori 18%. Every speech sample included in the corpus is described as fully as possible in terms of sociolinguistic variables such as the gender, age, regional origin, social class, level of education and occupation of its contributor.

The unusually high proportion of private material and the rich sociolinguistic variation make the WSC corpus a valuable resource for research into informal spoken registers as well as for sociolinguistic studies. The corpus is available from ICAME.

Back to top

7.13. The Limerick corpus of Irish English

The Limerick corpus of Irish English (L-CIE) comprises one million words in the form of 375 transcripts of naturally occurring conversations recorded in a wide variety of speech contexts throughout Ireland (excluding Northern Ireland). Speakers range from 14 to 78 years of age and there is an equal representation of both male and female speakers. While the corpus consists mainly of casual conversation, it also has over 200,000 words of professional, transactional and pedagogic Irish English which, along with the casual conversation data, were carefully collected with reference to a range of different speech genres. The corpus has followed the design of CANCODE by organizing the corpus alongside the axes of context type and interaction type, as shown in Table 21 (cf. Farr/Murphy/OKeeffe forthcoming).

Table 21: Design of the L-CIE corpus

 

Information provision

Collaborative idea

Collaborative task

Pedagogic

80,253 words e.g. linguistics lecture

60,473 words e.g.  English poetry tutorial

10,000 words e.g. one-to-one computer lesson

Professional  

145,000 words e.g. real-estate office talk

100,000 words e.g. team meeting

60,000 words e.g. waitresses washing dishes

Socialising

50,000 words e.g. describing a new bar

54,356 words e.g.  friends discussing college

30,000 words e.g. friends assembling a bed 

Intimate  

60,000 words e.g. mother storytelling

266,000 words e.g. partners making holiday plans

60,000 word e.g. family preparing dinner

Transactional

5,000 words e.g. product presentation

10,000 words e.g. chatting in a taxi

1,000 words e.g. eye examination

While it is not designed to be geographically representative it does not include data from every county in the Republic of Ireland, the L-CIE corpus has developed a careful sociolinguistic classification scheme which facilitates inter-corpus comparisons, especially with regard to linguistic choices and the relationships that hold between the speakers. The corpus website allows online access by registered users.

Back to top

7.14. The Hong Kong Corpus of Conversational English

The Hong Kong Corpus of conversational English (HKCCE) comprises 50 hours of recordings made up of 130 separate conversations involving a total of 341 participants. The lengths of the conversations are between 1 hour 15 minutes and 2 minutes 49 seconds, averaging about 23 minutes in length. The corpus is divided into four subcorpora (conversations, academic discourses, business discourses and public discourses), amounting to approximately 500,000 words. The recordings were made in the mid-1990s of conversations between Hong Kong Chinese and nonCantonese speakers (mostly native speakers of English). Table 22 shows the distribution of the data cross various design criteria (cf. Cheng/Warren 1999, 13-16).

Table 22: Design criteria of HKCCE

Criterion

Type

Proportion

 

Gender

Male (Native speaker of English)

34%

Female (Native speaker of English)

18%

Male (Non-native English)

24%

Female (Non-native English)

24%

 

Age

18-29

40%

30-39

21%

40-49

27%

50-59

10%

60+

2%

Education

Form 5 (17 years)

8%

Form 7 (19 years)

5%

University

82%

Other

5%

Domain

Education

35%

Business

23%

Administration

11%

Engineering

8%

Service sector

7%

Arts

5%

Law

3%

Media

2%

Airline industry

2%

Medical

2%

Not employed

2%

Number of speakers

2

58%

3

23%

4

10%

4+

9%

The corpus has not only facilitated sociolinguistic research in English spoken in Hong Kong, it has also made it possible to compare native and non-native spoken English. In addition to the orthographic transcription, the corpus is currently being annotated prosodically to enable them to examine the communicative role of intonation. The corpus has not been publicly released.

Back to top

8. Academic and professional English corpora

As language may vary considerably across genre and domain, specialized corpora provide valuable resources for investigations in the relevant genres and domains. Unsurprisingly, there has recently been much interest in the creation and exploitation of specialized corpora in academic or professional settings. This section introduces a number of well-known English corpora of this kind.

Back to top

8.1. The Michigan Corpus of Academic Spoken English

The Michigan Corpus of Academic Spoken English (MICASE) contains approximately 1.8 million words in the form of 152 transcripts of nearly 200 hours of recordings of 1,571 speakers, focusing on contemporary university speech within the domain of the University of Michigan. Table 23 shows the structure of the corpus (cf. MICASE Manual).

Table 23: The MICASE corpus

Criterion

Distribution

Speaker gender

Male (46%) Female (54%)

Academic role

Faculty (49) Students (44%)

Language status

Native speakers (88%) Non-native speakers (12%)

Academic division

Humanities & Arts (26%) Social Sciences & Education (25%) Biological & Health Sciences (19%) Physical Sciences & Engineering (21%) Other (9%)

Primary discourse mode

Monologue (33%) Panel (8%) Interactive (42%) Mixed (17%)

Speech event type

Advising (3.5%) Colloquia (8.9%) Discussion sections (4.4%) Dissertation defenses (3.4%) Interviews (0.8%) Labs (4.4%) Large lectures (15.2%) Small lectures (18.9%) Meetings (4.1%) Office hours (7.1%) Seminars (8.9%) Study groups (7.7%) Student presentations (8.5%) Service encounters (1.5%) Tours (1.3%) Tutorials (1.6%)

In the MICASE corpus, speakers are divided into four age groups: 17-23, 24-30, 31-50, and 51+. In terms of academic role, they are classified into a number of categories: junior and senior undergraduates, junior and senior postgraduates, junior and senior faculty and researchers, etc. The language status can be native speaker (North American English), other native speaker (non-American English), near native speaker, and non-native speaker.

The MICASE corpus was originally marked up in TEI-compliant SGML. All of the SGML files have now been converted to the XML format in order to meet the requirements for further corpus development including a web-based search interface and the streaming web delivery of the sound recordings, synchronized with the transcripts. At present, only the orthographically transcribed version of the corpus is available, though future releases will include various kinds of annotations such as parts-of-speech, lemmas and discourse-pragmatic categories. The MICASE corpus can be searched online free of charge or ordered at a nominal fee at the corpus website.

Back to top

8.2. The British Academic Spoken English corpus

The British Academic Spoken English (BASE) corpus, which is designed as a British counterpart to the MICASE, is under construction at the Universities of Reading and Warwick. The corpus currently comprises a collection of recordings and marked up transcripts of 160 lectures (63 from Reading and 97 from Warwick, totaling 127 recording hours) and 39 seminars (from Warwick, 32 hours). The lectures and seminars spread evenly across four subject areas, as shown in Table 24 (cf. the corpus website).

Table 24: Components of the BASE corpus

Subject area

Lectures

Seminars

Arts and Humanities

42

10

Social Studies and Sciences

40

11

Physical Sciences

40

8

Life and Medical Sciences

38

10

Total

160

39

Unlike MICASE, the BASE corpus only covers two types of speech event, lectures and seminars. Most of the recordings were made on digital video instead of audiotapes. At the moment, the majority of these recordings have been transcribed (157 lectures and 22 seminars) and marked up in TEI-compliant SGML (114 lectures and 3 seminars). The corpus will not only enable research into spoken academic English at the lexical and structural levels, it will also make it possible, when used in combination with MICASE, to compare academic spoken English in British and US university settings. When it is complete, the BASE corpus will be published on CD-ROM, with transcripts linked to edited video/audio files.

Back to top

8.3. The Reading Academic Text corpus

The Reading Academic Text (RAT) corpus is a collection of academic texts written by academic staff and research students at the University of Reading. The initial corpus was composed of twenty research articles written by staff and a small number of PhD theses contributed by successful doctoral candidates in the Faculty of Agriculture, totaling nearly a million words. The theses included in the corpus are all written by native speakers. Since the corpus was created in 1995, the number of theses has increased from 8 to 38. The corpus is still expanding further to represent the discourses of a greater range of disciplines covering both the natural and social sciences as well as a wider range of text types including dissertations, projects, laboratory reports, and samples of textbook readings for Master's courses. In addition to the original files, the texts have been converted to an HTML version which allows the full text to be viewed in a browser and a plain text version used for linguistic analysis and for coding of the corpus. The RAT corpus has been used to study text construction practices in academic settings such as the organization of theses in different disciplines as well as the various uses of citations. At present, the access to the corpus is restricted to the staff and researchers at the School of Linguistics and Applied Language Studies of Reading University, though it is possible for other users to access the corpus on a Research Attachment arrangement. See the corpus site for contact details.

Back to top

8.4. The Academic Corpus

The Academic Corpus is a written corpus of academic English developed at Victoria University of Wellington. The corpus contains approximately 3.5 million words, covering 28 subject areas from four faculty sections (arts, commerce, law, and science), as shown in Table 25 (cf. Coxhead 2000, 220).

Table 25: Subject areas in the Academic Corpus

Faculty

Arts

Commerce

Law

Science

Total

Texts

122

107

72

113

414

Words

883,214

879,547

874,723

875,846

351,333

Subject areas

Education

History

Linguistics

Philosophy

Politics

Psychology

Sociology

Accounting

Economics

Finance

Industrial relations

Management

Marketing

Public policy

Constitutional law

Criminal law

Family law and medicolegal

International law

Pure commercial law

Quasi-commercial law

Rights and remedies

Biology

Chemistry

Computer science

Geography

Geology

Mathematics

Physics

 

Each of these faculty sections is divided into seven subject areas of ca. 125,000 words, totaling 875,000 words for each section. The corpus comprises 414 academic texts by more than 400 authors which were sampled from journal articles, book surveys, course workbooks, laboratory manuals, course notes and the Internet. With exceptions of the 41 excerpts from the Brown corpus, 31 excerpts from LOB and 42 excerpts from the Wellington Corpus of Written New Zealand English, full texts (excluding bibliographies) were included from other sources. The majority of the texts were written for an international audience, with 64% sourced in New Zealand, 20% in Britain, 13% in the United States, 2% in Canada and 1% in Australia. The texts were selected according to whether they were of suitable length (over 2,000 running words) and were representative of the academic genre in that they were written for an academic audience. Efforts have also been made to balance the corpus with respect to the number of short (2,000-5,000 words), medium-length (5,000-10,000 words) and long (over 10,000 words) texts in the four faculty sections.

The corpus has been used to develop an Academic Word List (AWL) which containing 570 word families (see Coxhead 2000), which is available at the AWL site.

Back to top

8.5. The Corpus of Professional Spoken American English

The Corpus of Professional Spoken American English (CPSAE) has been constructed using a selection of transcripts of interactions of various types occurring in professional settings recorded during 1994-1998. The corpus contains two million words of speech involving over 400 speakers. The CPASE corpus has two main components. The first component is made up of transcripts (0.9 million words) of press conferences from the White House, which contains almost exclusively question and answer sessions in addition to some policy statements by politicians and White House officials. The second component consists of transcripts (1.1 million words) of faculty meetings and committee meetings related to national tests, which involve statements, discussions as well as questions (see Barlow 1998).

The transcripts in the corpus have been marked up in a minimal but consistent way. The markup scheme only indicates speech turns by identifying the last name of the speaker (or VOICE if the name is unknown) with the <SP> element, and puts the non-verbal events such as laughter in the brackets. Two versions of the corpus are available, a raw text version and an annotated version tagged by the Lancaster CLAWS. Both versions can be ordered from the corpus website.

Back to top

8.6. The Corpus of Professional English

A much more ambitious project has been initiated by the Professional English Research Consortium (PERC), which aims to create a 100-million-word Corpus of Professional English (CPE). The corpus is expected to include both spoken and written discourse used by working professionals and professionals-in-training and covering a wide range of domains such as science, engineering, technology, law, medicine, finance and other professions. The CPE corpus is designed as a balanced representation of professional English via texts published between 1995 and 2001 by over 1,000 major review and research journals, trade magazines, and textbooks, in American and British English, based on selection criteria such as impact factors provided by the Journal of Citation Reports, and other pertinent criteria.

The Corpus of Professional English is marked up in XML. The contextual information such as author's name, title, publication year and journal title is stored in the corpus header. The structural information is also encoded to show paragraphs, sections, headings and similar features in written texts. Linguistic annotations such as POS and semantic tagging will be carried out on the corpus using tools developed at Lancaster University.

The CPE corpus can be used for linguistic research as well as for the development of educational resources, such as specialized dictionaries, handbooks, language tests, and other materials that will be useful to working professionals and professionals-in-training. The corpus, when completed, will be made available to consortium members for online access at the PERC website.

Back to top

 

 

Copyright © 2006 Taylor & Francis Group plc