Anthony McEnery, Richard Xiao, Yukio Tono
 Home |  About the Book |  Resources |  Related Titles |  About the Series |  Forthcoming Titles  |  Buy this book
Corpora
Tools

   Book Jacket

Corpora Survey

Note: This survey is based on my (forthcoming) chapter "Well-known and influential corpora", written for A. Lüdeling, M. Kyto & A. McEnery (eds) Handbooks of Linguistics and Communication Science Volume Corpus Linguistics. Berlin: Mouton de Gruyter. If you know some corpus that should be included here, I would be obliged if you could send me an introduction – Richard Xiao

Because of the size of this survey, we've split it up into the following 3 pages:

Sections 1-3

Sections 4-8

Sections 9-14

1. Introduction

2. National corpora

2.1. The British National Corpus

2.2. The American National Corpus

2.3. The Polish National Corpus

2.4. The Czech National Corpus

2.5. The Hungarian National Corpus

2.6. The Russian Reference Corpus

2.7. The CORIS corpus

2.8. The Hellenic National Corpus

2.9. The German National Corpus

2.10. The Slovak National Corpus

2.11. The Modern Chinese Language Corpus

2.12. The Sejong Balanced Corpus

2.13. Other National corpora

3. Monitor corpora

3.1. The Bank of English

3.2. The global English Monitor Corpus

1. Introduction

As corpus building is an activity that takes times and costs money, readers may wish to use ready-made corpora to carry out their work. However, as a corpus is always designed for a particular purpose, the usefulness of a ready-made corpus must be judged with regard to the purpose to which a user intends to put it. There are thousands of corpora in the world, but most of them are created for specific research projects and are thus not publicly available. While abundant corpus resources for languages other than English are also available now, this survey focuses upon major English corpora, which are grouped in terms of their primary uses so that readers will find it easier to choose corpus resources suitable for their particular research questions. Note, however, that overlaps are inevitable in our classification. It is used in this survey simply to give a better account of the primary uses of the relevant corpora.

Back to top

2. National corpora

National corpora are normally general reference corpora which are supposed to represent the national language of a country. They are balanced with regard to genres and domains that typically represent the language under consideration. While an ideal national corpus should cover proportionally both written and spoken language, most existing national corpora and those under construction consist only of written data, as spoken data is much more difficult and expensive to capture than written data. This section introduces a number of major national corpora.

Back to top

2.1. The British National Corpus

The first and best-known national corpus is perhaps the British National Corpus (BNC), which is designed to represent as wide a range of modern British English as possible so as to make it possible to say something about language in general (Burnard 2002, 56). The BNC comprises approximately 100 million words of written texts (90%) and transcripts of speech (10%) in modern British English. Written texts were selected using three criteria: domain, time and medium. Domain refers to the content type (i.e. subject field) of the text; time refers to the period of text production, while medium refers to the type of text publication such as books, periodicals or unpublished manuscripts. Table 1 summarizes the distribution of these criteria (see Aston/Burnard 1998, 29-30).

Back to top

Table 1: Composition of the written BNC

Domain

%

Date

%

Medium

%

Imaginative

21.91

1960-74

2.26

Book

58.58

Arts

8.08

1975-93

89.23

Periodical

31.08

Belief and thought

3.40

Unclassified

8.49

Misc. published

4.38

Commerce/Finance

7.93

 

 

Misc. unpublished

4.00

Leisure

11.13

 

 

To-be-spoken

1.52

Natural/pure science

4.18

 

 

Unclassified

0.40

Applied science

8.21

 

 

 

 

Social science

14.80

 

 

 

 

World affairs

18.39

 

 

 

 

Unclassified

1.93

 

 

 

 

The spoken data in the BNC was collected on the basis of two criteria: demographic and context-governed. The demographic component is composed of informal encounters recorded by 124 volunteer respondents selected by age group, sex, social class and geographical region, while the context-governed component consists of more formal encounters such as meetings, lectures and radio broadcasts recorded in four broad context categories. The two components of spoken data complement each other, as many types of spoken text would not have been covered if demographic sampling techniques alone were used in data collection. Table 2 summarizes the composition of the spoken BNC. Note that in the table, the first two columns apply to both demographic and context-governed components while the third column refers to the latter component alone.

Back to top

Table 2: Composition of the spoken BNC

Region

%

Interaction type

%

Context-governed

%

South

45.61

Monologue

18.64

Educational/informative

20.56

Midlands

23.33

Dialogue

74.87

Business

21.47

North

25.43

Unclassified

6.48

Institutional

21.86

Unclassified

5.61

 

 

Leisure

23.71

 

 

 

 

Unclassified

12.38

In addition to part-of-speech (POS) information, the BNC is annotated with rich metadata (i.e. contextual information) encoded according to the TEI guidelines, using ISO standard 8879. Because of its generality, as well as the use of internationally agreed standards for its encoding, the BNC corpus is a useful resource for a very wide variety of research purposes, in fields as distinct as lexicography, artificial intelligence, speech recognition and synthesis, literary studies and, of course, linguistics. There are a number of ways one can access the BNC corpus. It can be accessed online remotely using the BNC Online service or the BNCWeb interface. Alternatively, if a local copy of the corpus is available, the BNC can be explored using corpus exploration tools such as WordSmith (Scott 1999).

The current version of the full release of the BNC is BNC-2, the World Edition. This version has removed a small number of texts (less than 50) which restrict the worldwide distribution of the corpus. The BNC World has also corrected errors relating to mislabeled texts and indeterminate part-of-speech codes in the first version, and has included a classification system of genre labels developed by Lee (2001) at Lancaster. The World Edition is still marked up in TEI-compliant SGML, but an XML version of the corpus will be released shortly. As a prelude to this full release of the XML version, a four-million-word subset of the BNC BNC Baby was released in October 2004 together with the XML-aware corpus tool Xaira. BNC Baby was originally developed as a manageable subcorpus from the BNC for use in the language classroom, consisting of comparable samples for four kinds of English unscripted conversation, newspapers, academic prose and written fiction (Berglund/Burnard/Wynne 2004).

The BNC model for achieving corpus balance and representativeness has been followed by a number of national corpus projects including, for example, the American National Corpus, the Polish National Corpus and the Russian Reference Corpus.

Back to top

2.2. The American National Corpus

The American National Corpus (ANC) project was initiated in 1998 with the aim of building a corpus comparable to the BNC. While the ANC follows the general design of the BNC, there are differences with regarding to its sampling period and text categories. The ANC only samples language data produced from 1990 onwards whereas the sampling period for the BNC is 1960-1993. This time frame has enabled the ANC to cover text categories which have developed recently and thus were not included in the BNC, e.g. emails, web pages and chat room talks, as shown in Table 3. In addition to the BNC-like core, the ANC will also include specialized satellite corpora (cf. Reppen/Ide 2004, 106-107).

Table 3: Text categories in the ANC

Channel

Text category

%

Written

Books (41% informative texts for various domains and 14% imaginative texts of various types)

55

Newspapers, magazines and journals

20

Electronic (emails, web pages etc)

10

Miscellaneous (published and unpublished)

  5

Spoken

Face-to-face/phone conversations, speech, meetings

10

The ANC corpus is encoded in XML, following the guidelines of the XML version of the Corpus Encoding Standard. The standalone annotation, i.e. with the primary data and annotations kept in separate documents but linked with pointers, has enabled the corpus to be POS tagged using different tagsets (e.g. Bibers (1988) tags, the CLAWS C5/C7 tagsets (Garside/Leech/Sampson 1987) and the Penn tags (Marcus/Santorini/Marcinkiewicz 1993) to suit the needs of different users.

The full release of the ANC is expected to be available in late 2005. At present the first release of the corpus, which contains 11.5 million words of written and spoken data (8.3 million words for writing and 3.2 million words for speech, but not balanced for genre), is now available from the Linguistic Data Consortium (LDC).

Back to top

2.3. The Polish National Corpus

The Polish National Corpus (PNC) is under construction on the PELCRA (Polish and English Language Corpora for Research and Application) project, which is undertaken jointly by the Universities of Lodz and Lancaster. The project aims to develop a large, fully annotated reference corpus of native Polish, mirroring the BNC in terms of genres and its coverage of written and spoken language (Lewandowska-Tomaszczyk 2003, 106). A total of 130 million words of running texts have been collected, and part of the data (30 million words) has been compiled into a balanced corpus, which covers genres, and styles comparable in proportions to those included the BNC. The PNC is TEI-compliant and is annotated for part-of-speech. Presently, a balanced PNC sampler, which contains 10 millions of both written and spoken data reflecting proportionally the text categories in the BNC, can be ordered from the PELCRA project site.

Back to top

2.4. The Czech National Corpus

The Czech National Corpus (CNC) consists of two sections: synchronous and diachronic. Each section is designed to include written, spoken and dialectal components. As some of the components are currently hardly more than blueprints for future work (see Kučera 2002, 254), we will only introduce the written and spoken components in the synchronous section.

The written component of the synchronous section, which contains 100 million words, was completed in 2000 and thus named SYN2000. SYN2000 includes both imaginative (15%) and informative (85%) texts, each being divided into a number of text categories, as shown in Table 4 (see Kučera 2002, 247-248). The technical and specialized texts in the corpus proportionally cover nine domains: lifestyle (5.55%), technology (4.61%), social sciences (3.67%), arts (3.48%), natural sciences (3.37%), economics/management (2.27%), law/security (0.82%), belief/religion (0.74%) and administrative texts (0.49%).

Table 4: Design of SYN2000

Major category

Genre

%

 

Imaginative

(15%)

Fiction

11.02

Poetry

 0.81

Drama

 0.21

Other literary texts

 0.36

Transitional text types

 2.6

Informative

(85%)

Journal

60

Technical/specialized texts

25

Table 5: Sampling frame of the Prague Spoken Corpus

Criteria

Type

Proportion

Speaker sex

Male

50%

Female

50%

Speaker age

21-35

50%

35+

50%

Education level

Secondary school

50%

University

50%

Discourse type

Formal

50%

Informal

50%

The spoken component of the synchronous section, the so-called Prague Spoken Corpus (PMK), contains 800,000 words of transcription of authentic spoken language sampled in a balanced way according to four sociolinguistic criteria: speaker sex, age, educational level and discourse type, as shown in Table 5. The data contained the Prague Spoken corpus consists exclusively of impromptu spoken language (roughly equivalent to the demographically sampled component in the BNC). Texts representing various blends of written and spoken language such as lectures, political speeches and play scripts are included in a special section in the written corpus (cf. Kučera 2002, 248, 253).

Both SYN2000 and the Prague Spoken Corpus are marked up in TEI-compliant SGML and tagged to show part-of-speech categories. SYN2000 is licensed free of charge for non-commercial use. A scaled-down version of SYN2000, PUBLIC, which contains 20 million words with the same genre distribution, is accessible online at the corpus website. The tagged version of the Prague Spoken Corpus will also be made publicly available in the near future.

Back to top

2.5. The Hungarian National Corpus

The Hungarian National Corpus (HNC) is a balanced reference corpus of present-day Hungarian. The corpus contains 153.7 million words of texts produced from the mid-1990s onwards, which are divided into five subcorpora, each representing a written text type: media (52.7%), literature (9.43%), scientific texts (13.34%), official documents (12.95%) and informal texts (e.g. electronic forum discussion, 11.58%). The size of the literary subcorpus is expected to increase from the current 14.5 million words to approximately 40 million words (see Váradi 2002, 386). The HNC is encoded in SGML in compliance with Corpus Encoding Standard (CES) and annotated for part-of-speech. The corpus can be accessed free of charge after registration via the online query system at the corpus site.

Back to top

2.6. The Russian Reference Corpus

The Russian Reference Corpus (BOKR) is designed as a Russian match for the BNC. The corpus contains 100 million words of modern Russian, following the general sampling frame of the BNC, as shown in Table 6 (see Sharoff 2004).

Table 6: Sampling frame of the Russian Reference Corpus

Text category

Proportion

Spoken

 5%

Life (Imaginative texts in the BNC)

30%

Natural sciences

 5%

Applied sciences

10%

Social sciences

12%

Politics (World affairs in the BNC)

15%

Commerce

 5%

Arts

 5%

Religion and philosophy (Belief and thought in the BNC)

 3%

Leisure

10%

The BOKR corpus is encoded in TEI-compliant SGML and annotated for part-of-speech. As Russian is a highly inflective language, the technique used in annotating English corpora with complex POS tags is impractical for Russian, because that would entail thousands of tags which would make corpus exploration ineffective, if not impossible at all. Hence in the Russian Reference corpus, each word is annotated with a bundle of lexical and syntactic features such as part-of-speech, aspect, transitivity, voice, gender, number and tense. Separate features from a feature bundle associated with each word can be selected in a window in the query interface. The corpus is under construction and its final release is expected by the end of 2004 (cf. Sharoff 2004).

Back to top

2.7. The CORIS corpus

The CORIS (Corpus di Italiano Scritto) corpus is a general reference corpus of present-day Italian. It contains 100 million words of written Italian sampled from five text categories, which constitute five subcorpora, as shown in Table 7.

Table 7: Components of the CORIS corpus

Category

Subcategory

Proportion

Press

Newspapers, periodicals, supplement

38%

Fiction

Novel, short stories

25%

Academic prose

Human sciences, natural sciences, physics, experimental sciences

12%

Legal and administrative prose

Legal, bureaucratic, administrative documents

10%

Miscellaneous

Books on religion, travel, cookery, hobbies, etc.

10%

Ephemeral

Letters, leaflets, instructions

5%

Unlike most national corpora that are sample corpora, the CORIS corpus follows a dynamic corpus model, which will be updated every two years by means of a built-in monitor corpus (Rossini Favretti/Tamburini/de Santis 2004). The current version of the corpus can be accessed online free of charge via a web-based query system at the corpus website.

Back to top

2.8. The Hellenic National Corpus

The Hellenic National Corpus is a 32-million-word corpus of written Modern Greek sampled from several publication media covering various genres (articles, essays, literary works, reports, biographies etc.) and domains (economy, medicine, leisure, art, human sciences etc.) published from 1976 onwards. Of the four types of medium, books account for 15.75% of the total texts, newspapers 69.01%, periodicals 6.97% and miscellaneous (correspondence, electronic texts, ephemera, and hand-written/typed material) 8.27%. The text classification with regard to medium, genre and domain follows the PAROLE standards. This taxonomy information, together with the bibliographic information, is encoded in TEI-compliant SGML (cf. Hatzigeorgiu/Gavrilidou/Piperidis et al 2000, 1737). The corpus can be accessed online at the corpus site, where users can make queries concerning the lexicon, morphology, syntax and usage of Modern Greek (e.g. words, lemmas, part-of-speech categories or combinations of the three).

Back to top

2.9. The German National Corpus

The German National Corpus is a product of the DWDS (Digital Dictionary of the 20th Century German Language) project. The corpus is divided into two parts, a 100-million-word balanced core and a much larger opportunistic subcorpus. This section introduces the core corpus, which is roughly comparable to the British National Corpus, covering the whole 20th century (1900-2000). Table 8 shows the text categories covered in the corpus.

Table 8: Design criteria of the German National Corpus

Text category

Proportion

Literature

25%

Journalistic prose

25%

Scientific texts

20%

Specialized texts (advert, manuals, etc)

20%

Spoken (everyday language, televised debates, dialect, etc)

10%

The metadata such as genre information is encoded in XML. Linguistic annotation consists basically of lemmatization, part-of-speech and semantic annotation on the word level, as well as prepositional phrase and noun phrase recognition on the phrase level (Cavar/Geyken/Neumann 2000). The core corpus is available for online search at the corpus site after free-of-charge registration.

Back to top

2.10. The Slovak National Corpus

The Slovak National Corpus is presently under construction. The project aims to create a 200-million-word corpus of the Slovak language. The first phase of the project has produced a corpus containing 30 million words of written texts published between 1990 and 2003, which will be expanded to other periods of the contemporary language (1955 2005) to the target size at the second phase of the project (2003-2006). The final corpus will also include diachronic and dialectological texts.

At present the 30-million-word part of the corpus has been annotated with lemmatization, morphological and source (bibliographical and style-genre) information. Users can access the corpus using a simple online query system at the corpus website. More complex searches require the corpus manager, which supports regular expressions and can be downloaded at the same site.

We have so far introduced national corpora for European languages. The next two sections will introduce two national corpora of Asian languages, namely Chinese and Korean.

Back to top

2.11. The Modern Chinese Language Corpus

The Modern Chinese Language Corpus (MCLC) is Chinas national corpus built under the auspices of the National Language Committee of China. The corpus initially contained texts of 700 million Chinese characters sampled systematically from texts of 1.4 billion characters produced during 1919-1992 (divided into five sampling periods), with the majority of texts produced after 1977. 1919 is generally considered as the beginning of modern Chinese. Fresh data has been filtered in proportionally at the rate of 3.5 million annually since completion so that the corpus currently contains 85 million characters. The corpus covers four text categories, which include more than 40 subcategories, as shown in Table 9. Most samples are approximately 2,000 characters in length, with the exception of samples taken from books, which may contain up to 10,000 characters. The digitalized, texts were proofread three times so that errors are less than 0.02% (see Wang 2001, 283).

Table 9: Components of the MCLC corpus

Category

Subcategory

Proportion

Humanities and social sciences

(8 subcategories)

Politics and laws, history, society, economics, arts, literature, military and physical education, life

59.6%

Natural sciences

(6 subcategories)

Mathematics and physics, biology and chemistry, astronomy and geography, oceanology and meteorology, agriculture and forestry, medical and health

17.24%

Miscellaneous

(6 subcategories)

Official documents, regulations, judicial documents, business documents, ceremonial speech, ephemera

9.36%

Newspapers

 

13.79%

A scaled down version of the corpus, the core, which contains 20 million characters proportionally sampled from the larger corpus, is tokenized and tagged with part-of-speech categories. The MCLC license can be purchased from the National Language Committee of China.

Back to top

2.12. The Sejong Balanced Corpus

The 21st Century Sejong project was launched in 1998 as a ten-year development project to build various kinds of language resources including Korean corpora and Korean electronic dictionaries. One of the goals of the project is to construct a balanced national corpus (300 million words and phrases from modern Korean, spoken materials, North Korean language, words of foreign origin, etc.), comparable to the BNC. By 2003 a raw corpus of modern Korean was compiled, containing 57 million words with 75 million more words already existing electronic texts and being processed and standardized. The corpus also includes around 3 million words of spoken data.

The markup scheme used in the Sejong Corpus is TEI-compliant. As of 2003, 10 million words have been morphologically annotated, 5.5 million words sense tagged, and 150,000 words treebanked (see Kang/Kim 2004, 1747). The corpus is accessible over the Internet after registration at the corpus site.

Back to top

2.13. Other National corpora

In addition to those introduced above, there are a number of nation-level corpora which are either already available or are under construction. They include, for example, the FRANTEXT Database, the Croatian National Corpus (30 million words), Korpus 2000 for Danish (28 million words), the National Corpus of Irish (15 million words). A number of corpora representing other national languages are also under construction, including, for example, Norwegian (Choukri 2003), Dutch (Wittenburg/Brugman/Broeder 2000), Maltese (Dalli 2001), Basque (Aduriz/Aldezabal/Alegria et al 2003), Kurdish (Gautier 1998), Nepali (Glover 1998), Tamil (Malten 1998) and Indonesia (Riza 1999).

Back to top

3. Monitor corpora

While most of the national corpora introduced in section 2 follow a static sample corpus model, there are also corpora which are constantly updated to track rapid language change, such as the development and the life cycle of neologisms. Corpora of this type are referred to as monitor corpora.

Back to top

3.1. The Bank of English

The best-known monitor corpus is the Bank of English (BoE), which was initiated in 1991 on the COBUILD (Collins Birmingham University International Language Database) project. The corpus was designed to represent standard English as it was relevant to the needs of learners, teachers and other users, while also being of use to researchers in present-day English language. Written texts (75%) come from newspapers, magazines, fiction and non-fiction books, brochures, reports, and websites while spoken data (25%) consists of transcripts of television and radio broadcasts, meetings, interviews, discussions, and conversations. The majority of the material in the corpus represents British English (70%) while American English and other varieties account for 20% and 10% respectively. Presently the BoE contains 524 million words of written and spoken English. The corpus keeps growing with the constant addition of new material.

The BoE corpus is particularly useful for lexical and lexicographic studies, for example, tracking new words, new uses or meanings of old words, and words falling out of use. A 56 million word sampler of the corpus can be accessed online free of charge at the corpus website. Access to larger corpora is granted by special arrangement.

Back to top

3.2. The global English Monitor Corpus

Another corpus of the monitor type is the Global English Monitor Corpus, which was started in late 2001 as an electronic archive of the worlds leading newspapers in English. The corpus aims at monitoring language use and semantic change in English as reflected in newspapers so as to allow for research into whether the English language discourses in Britain, the United States, Australia, Pakistan and South Africa have changed in the same way or differently. As the Global English Monitor Corpus will monitor as accurately as conceivable all relevant changes of attitudes and beliefs, it will prove a useful tool not only for lexicographers, historical linguists and semanticists, but also for those interested in social and political studies all over the world. With its first results being available at the end of 2003, the corpus is expected to reach billions of words within a few years.

Back to top

 

Copyright © 2006 Taylor & Francis Group plc