Anthony McEnery, Richard Xiao, Yukio Tono
 Home |  About the Book |  Resources |  Related Titles |  About the Series |  Forthcoming Titles  |  Buy this book
Corpora
Tools

   Book Jacket

Corpora Survey

Note: This survey is based on my (forthcoming) chapter "Well-known and influential corpora", written for A. Lüdeling, M. Kyto & A. McEnery (eds) Handbooks of Linguistics and Communication Science Volume Corpus Linguistics. Berlin: Mouton de Gruyter. If you know some corpus that should be included here, I would be obliged if you could send me an introduction – Richard Xiao

Because of the size of this survey, we've split it up into the following 3 pages:

Sections 1-3

Sections 4-8

Sections 9-14

9. Parsed corpora

9.1. The Lancaster-Leeds Treebank

9.2. The Lancaster Parsed Corpus

9.3. The SUSANNE corpus

9.4. The CHRISTINE corpus

9.5. The LUCY corpus

9.6. ICE-GB

9.7. The Penn Treebank

9.8. Parsed historical corpora

10. Developmental and learner corpora

10.1. The Child Language Data Exchange System

10.2. The Louvain Corpus of Native English Essays

10.3. The Polytechnic of Wales corpus

10.4. The International Corpus of Learner English

10.5. The LINDSEI corpus

10.6. The Longman Learners Corpus

10.7. The Cambridge Learner Corpus

10.8. Other learner corpora

11. Multilingual corpora

11.1. The Canadian Hansard Corpus

11.2. The English-Norwegian Parallel Corpus

11.3. The English-Swedish Parallel Corpus

11.4. The Oslo Multilingual Corpus

11.5. The ET10/63 and ITU/CRATER parallel corpora

11.6. The IJS-ELAN Slovene-English Parallel Corpus

11.7. The CLUVI parallel corpus

11.8. European Corpus Initiative Multilingual Corpus I

11.9. The MULTEXT corpora

11.10. The PAROLE corpora

11.11. Multilingual Corpora for Cooperation

11.12. The EMILLE Corpus

11.13. The BFSU Chinese-English Parallel Corpus

11.14. The Babel Chinese-English Parallel Corpus

11.15. Hong Kong Parallel Text

12. Non-English monolingual corpora

12.1. The COSMAS corpora

12.2. The CETEMPúblico Corpus

12.3. The INL corpora

12.4. The CEG corpus

12.5. The Scottish Corpus of Texts and Speech

12.6. The Prague Dependency Treebank

12.7. Academia Sinica Balanced Corpus

12.8. Sinica Treebank

12.9. Penn Chinese Treebank

12.10. Spoken Chinese Corpus of Situated Discourse

13. Well-known distributors of corpus resources

CSLU

ELRA

ELSNET

ICAME

OTA

TRACTOR

14. Conclusion

References

9. Parsed corpora

Parsing, or called treebanking, is a form of corpus annotation. It is independent of corpus design criteria. Hence, a corpus, whether balanced or specialized, whether written or spoken, can be syntactically parsed. However, as parsing is a much more challenging task which often necessitates human correction, parsed corpora are typically very small in size. Of the corpora we have introduced so far, only ICE-GB is parsed. This section introduces a number of well-known parsed corpora.

Back to top

9.1. The Lancaster-Leeds Treebank

The Lancaster-Leeds Treebank is perhaps the first syntactically parsed corpus. The corpus is a subset of 45,000 words taken from all text categories in the LOB corpus which was parsed manually by Geoffrey Sampson using a specially devised surface-level phrase structure grammar compatible with the CLAWS word-tagging scheme (cf. Sampson 1987). The annotation scheme used in the Lancaster-Leeds Treebank, which consisted of 47 labels for daughter nodes (14 phrase and clause classes, 28 word classes and five classes of punctuation mark), represented surface grammar only, without indications of logical form. This hand-crafted treebank provided training data for the automatic probabilistic parser which was used to analyze the Lancaster Parsed Corpus. The corpus was not published but is available from UCREL at Lancaster University.

Back to top

9.2. The Lancaster Parsed Corpus

The Lancaster Parsed Corpus (LPC) is a much larger sample of approximately 144,000 words taken from the LOB corpus that has been parsed. Except for categories M (science fiction, six samples) and R (humor, nine samples), which are all included, LPC takes the first 10 samples from each of the other 13 text categories in LOB, totaling 145 files which account for 13.29% of the full LOB corpus. Even in these 145 samples, longer sentences have been excluded from the parsed corpus because the parser was unable to process sentences over 20-25 words in length, with the result that the parsed corpus no longer contains LOB text extracts in their entirety. The errors resulting from automatic parsing were corrected by hand to ensure the corpus is reasonably error free (cf. Garside/Leech/Váradi 1992).

The Lancaster Parsed Corpus can be regarded as a treebank broadly representative of the syntax of written English across a great variety of styles and text types. It provides a testbed for wide-coverage general-purpose grammars and parsers of English and a valuable resource for quantitative linguistic studies of English syntax. The corpus is available through ICAME.

Back to top

9.3. The SUSANNE corpus

The SUSANNE (an acronym for surface and underlying structural analysis of natural English) is a 130,000 word sub-sample taken from the Brown corpus of American English that has been parsed. The parsed corpus comprises 64 text samples, with 16 taken from each of the four text categories: A (press reportage), G (belle letters, biography and memoir), J (learned writing) and N (adventure and Western fiction).

The parsing was largely undertaken manually in accordance with the SUSANNE analytic scheme developed by Geoffrey Sampson in collaboration with Geoffrey Leech on the basis of samples from written British and American English. In SUSANNE, a parse tree is represented as a bracketed string, with the labels of non-terminal nodes inserted between opening and closing brackets. There are three types of information in the parsing scheme: a form tag, a function tag and an index. The hierarchy of form tag ranks (word, phrase, clause and root) defines the shape of a parse tree. The function tags identify surface roles such as surface and logical subject, agent of passive, and time and place adjuncts. An index shows referential identity between nodes (cf. Sampson 1995).

The SUSANNE corpus was first released in 1992 and its latest version, Release 5, was published in 2000. Each successive release has corrected errors found in earlier releases. The latest release, together with the documentation accompanying the corpus, is distributed free of charge at Sampsons website.

Back to top

9.4. The CHRISTINE corpus

The CHRISTINE corpus is a spoken counterpart to SUSANNE, developed by Geoffrey Sampson and his team. It is one of the first treebanks of spontaneous speech. The CHRISTINE analytic scheme includes explicit extensions to the SUSANNE annotation which are designed to handle speech phenomena such as pauses, discourse items and speech repairs. The first stage of CHRISTINE (CHRISTINE/I), which was released in 1999, is based on 40 extracts chosen at random from the demographically sampled component in the spoken BNC and other sources, totaling approximately 80,500 words of spoken data representing 147 identified speakers in addition to a great number of unidentifiable speakers. The information about speakers and the metadata originally contained in the BNC corpus header were converted into database files accompanying the corpus (cf. Sampson 2000).

The full version of the CHRISTINE corpus includes 66 further texts drawn from the spoken BNC and other sources. The overall proportion of the BNC data accounts for 50% of the full CHRISTINE corpus, with 40% from the London-Lund corpus and 10% from the Reading Emotional Speech Corpus (see Stibbard 2001 for a description). The full release also incorporates a minor change in the distribution of analytic information between the fields to make it more compatible with SUSANNE and easier to read. This version became available in 2000. At present only CHRISTINE/I can be downloaded at Sampsons website.

Back to top

9.5. The LUCY corpus

The LUCY corpus is the third in Sampsons series of treebanks. This corpus represents written English in modern Britain, ranging from published prose to the less skilled writing of young adults, and spontaneous writing by children aged 9-12. To deal with writing of this latter type, the LUCY parsing scheme contains some further extensions to the SUSANNE scheme which can identify cases where an unskilled writer fails to put words together in a meaningful way (cf. Sampson 2003).

There are 239 text files in LUCY, amounting to 165,000 words. The corpus consists of three sections: polished writing (41 text files, 102,000 words), young adult writing (48 text files, 33,000 words), and child writing (150 files, 30,000 words). The polished texts are taken from both informative and imaginative categories in the written section of the British National Corpus. The young adult writing comprises three groups, namely, A-level general study scripts, access-course coursework, and first-year undergraduate essays. The child writing section is composed of material from the Nuffield corpus, a collection of writing by children aged between 9 and 12 years in 1965.

In addition to providing a valuable source of information on the realities of skilled written usage in modern Britain, LUCY holds the promise to support study of the process through which English-speaking children acquire writing skills. The corpus was released in 2003. As the data from the BNC (about half of the corpus) is copyright protected, a copyright free edition and an unreduced edition were prepared. The only difference is that in the copyright free edition, for those files where copyright is an issue, the words of the original texts are replaced by abbreviations. While these abbreviations may be recoverable to human eyes, they are by no means recoverable computationally. This reduced version is available from Sampsons website. The unreduced edition is only available to those who have purchased a copy of the BNC corpus.

Back to top

9.6. ICE-GB

The British component of the International Corpus of English (ICE-GB) is the first corpus that has been completed in the ICE series. Like all of the ICE components, ICE-GB comprises 300 spoken and 200 written texts from 32 categories, amounting to one million words. As noted in section 5.1, this corpus is not only POS tagged but also fully parsed and hand checked. The corpus contains 83,394 parse trees, including 59,640 in the spoken part of the corpus. Each node in the tree is labelled with up to three types of information: word class/syntactic category, syntactic function and features (e.g. transitivity), the latter being optional (cf. Nelson/Wallis/Aarts 2002).

Unlike the SUSANNE, CHRISTINE and LUCY corpora, which come without retrieval software, ICE-GB is distributed together with a utility program, ICEUP, which allows very complex queries of various kinds, e.g. markup queries, exact and inexact grammatical node queries, text fragment queries, Fuzzy Tree Fragment (FTF) queries, and sociolinguistic variable queries.

The first full release of the corpus and ICEUP can be ordered on CD-ROM from the ICE-GB website. The ICE-GB sampler, which includes ICEUP and ten ICE-GB texts, is also available free of charge at the site. Release 2 will include the digitized speech recordings of the spoken part of the corpus, aligned with the text. This will allow researchers to hear the original source of what they see on-screen. In addition to the online help included in ICEUP, Nelson/Wallis/Aarts (2002) provides a comprehensive reference guide to both corpus and software.

Back to top

9.7. The Penn Treebank

The Penn Treebank (PTB) is an example of skeleton parsing. Three releases of the treebank have so far been published by the LDC. The original release (Penn Treebank I, 1992) contains over 4.5 million words of American English data. The whole corpus is POS tagged while two thirds of the data is parsed. All of this material has been corrected by hand after automatic processing. Table 26 shows the components of Penn Treebank I (cf. Marcus/Santorini/Marcinkiewicz 1993).

Table 26: Penn Treebank Release 1

Component

Tagged words

Parsed words

Dow Jones news stories

3,065,776

1,061,166

Brown corpus retagged

1,172,041

1,172,041

Dept. of Energy abstract

231,404

231,404

MUC-3 messages

111,828

111,828

Library of America texts

105,652

105,652

IBM manuals

89,121

89,121

Dept. of Agriculture bulletins

78,555

78,555

ATIS sentences

19,832

19,832

WBUR radio transcripts

11,589

11,589

Total

4,855,798

2,881,188

Penn Treebank I applies a parsing scheme which is extended and modified on the basis of the Lancaster parsing scheme. While both annotation schemes employ a phrase structure grammar which covers noun, verb, adjective, adverbial and preposition phrases, the Lancaster scheme also distinguishes between different clause types such as adverbial clause, comparative clause, nominal clause and relative clause whereas the Penn Treebank scheme differentiates between different types of wh-clauses (e.g. noun, adverb and prepositional phrases). The latter also includes a variety of null elements which indicate, for example, the understood subject of infinitive or imperative, and the zero variant of that in subordinate clauses.

Penn Treebank Release 2, which was published in 1995, features the new Treebank II bracketing style. The new bracketing style is designed to facilitate the extraction of simple predicate/argument structure (see Bies/Ferguson/Katz et al 1994). Penn Treebank II contains one million words of 1989 Wall Street Journal material and a small sample of ATIS-3 material annotated in Treebank II style in addition to a cleaned copy of the Release 1 material annotated in Treebank I style. Penn Treebank Release 3 (1999) includes tagged and parsed Switchboard transcripts which are also dysfluency-annotated. The Penn Treebank can be ordered on CDROM from the LDC. The corpus is also searchable via the LDC Online.

Back to top

9.8. Parsed historical corpora

In addition to the treebanks of present-day English introduced above, this section introduces a number of parsed historical corpora. These corpora are largely based on the diachronic part of the Helsinki Corpus.

The Penn-Helsinki Parsed Corpus of Middle English version 2 (PPCME2) is a corpus of prose text samples of Middle English, annotated for syntactic structure to allow searching not only for words and word sequences but also for syntactic structure. Based on the Middle English section of the Helsinki corpus (with additions and deletions), PPCME2 comprises 55 text samples amounting to 1.3 million words. The annotation scheme for the corpus follows the basic formatting conventions of the Penn Treebank (Taylor 2000). PPCME2 is an improved and extended version of an earlier corpus, PPCME1, which was smaller (510,000 words) and which used a simpler annotation scheme (no POS tagging, no indication of the internal structure of noun phrases, less detailed annotation of several complex sentence and phrase types). Both versions of the corpus are available at the corpus website. PPCME1 is free for downloading while PPCME2 can be ordered on CDROM at a nominal cost. The corpus search tool, CorpusSearch, is freely available.

The York-Helsinki Parsed Corpus of Old English Poetry is a selection of poetic texts from the Old English Section of the Helsinki Corpus which have been annotated to facilitate searches on lexical items and syntactic structure. The corpus contains 71,490 words of Old English text samples ranging from 4,000 to 17,000 words in length. The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English is a selection of texts from the Old English Section of the Helsinki Corpus of English Texts. The corpus contains 106,210 words of Old English text samples, ranging 5,000 to 10,000 words in length, which represent a range of dates of composition, authors and genres. A much larger corpus with much more detailed annotation is the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), which contains 1.5 million words of Old English prose texts taken from the Toronto Dictionary of Old English Corpus, with special formatting which has made it possible to search conveniently for syntactic structure using a computer search engine. These corpora apply the PPCME2 annotation scheme. They are available at no cost for non-commercial use at the corpus website or via OTA.

Back to top

10. Developmental and learner corpora

Two types of corpora are particularly relevant to language learning: developmental corpora and learner corpora. A learner corpus is a collection of the writing or speech produced by learners acquiring a second language (L2). The term is used here as opposed to a developmental corpus, which consists of data produced by children acquiring their first language (L1). This section introduces well-known corpora of these two types.

Back to top

10.1. The Child Language Data Exchange System

The Child Language Data Exchange System (CHILDES) is an international database organized for the study of first and second language acquisition. The database consists of three parts: Codes for the Human Analysis of Transcripts (CHAT), Computerized Language Analysis (CLAN), and a database. The CHILDES database contains transcripts of data collected from children and adults who are learning both first and second languages. The total size of the database is now approximately 180 million characters (ca. 20 million words), covering 25 languages. The database is divided into six major components: English, non-English, narratives, language impairments, bilingual acquisition, and books. Some files have associated audio and video recordings. The transcripts from normal English-speaking children constitute about half of the total CHILDES database. All of the data is transcribed in the CHAT format and can be analyzed using the CLAN programs, which support four basic types of linguistic analysis: lexical analysis, morpho-syntactic analysis, discourse analysis, and phonological analysis (cf. MacWhinney 1995).

The CHILDES database has been used in a wide range of research of normal and abnormal child language. The database and computer programs are freely available for research at the CHILDES website.

Back to top

10.2. The

The Louvain Corpus of NatiLouvain Corpus of Native English Essaysve English Essays (LOCNESS) is a corpus of argumentative essays on a great variety of topics written by native British and American university students (cf. Granger/Tyson 1996). The LOCNESS corpus comprises three parts, British pupils 114 A-Level essays (60,209 words), British university students 90 essays (95,695 words), and American university students 232 essays (168,400 words), totaling 324,304 words. As the age group of those students is comparable to that of the non-native EFL students in the International Corpus of Learner English (ICLE, see section 20.10.4), LOCNESS provides control data in comparing writings of native and non-native learners. The corpus can be ordered from the Centre for English Corpus Linguistics (CELC) at the University of Louvain).

Back to top

10.3. The Polytechnic of Wales corpus

The Polytechnic of Wales (POW) corpus contains 65,000 words of informal conversations of about 120 6-to-12-year-old children, which were collected between 1978 and 1984 in South Wales. The children were selected in order to minimise any Welsh or other second language influence and divided into four groups of 30, each within three months of the ages 6, 8, 10, and 12. These groups were subdivided by sex (boys, girls) and socio-economic class (A, B, C, D). The corpus is fully parsed by hand using a using a Systemic Functional Grammar with rich syntactico-semantic categories, capable of handling raising, dummy subject clauses, ellipsis, and replacement strings (cf. Souter 1993). The corpus contains 11,396 parse trees in 184 files, each file with a reference header which identifies the age, sex and social class of the child, and whether the text is from a play session or an interview. Only the parsed corpus is available in machine readable form via ICAME or OTA. The recorded tapes and 4-volume transcripts with intonation contours are available in hard copy from the British Library.

Back to top

10.4. The International Corpus of Learner English

The first and best-known learner corpus is the International Corpus of Learner English (i.e. ICLE). The corpus comprises argumentative essays written by advanced learners of English, i.e. university students of English as a foreign language (EFL) in their 3rd or 4th year of study. The primary goal of ICLE is the investigation of the interlanguage of the foreign language learner (cf. Granger 2003).

ICLE version 1.1, published on CDROM in 2002, contained over 2.5 million words in the form of 3,640 texts ranging between 500-1,000 words in length written by EFL learner from 11 mother tongue backgrounds, namely, Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish, and Swedish. The corpus is still expanding with additional subcorpora (each containing 200,000 words) of eight other L1 backgrounds including Brazilian, Chinese, Japanese, Lithuanian, Norwegian, Portuguese, South African (Setswana) and Turkish (see the ICLE website for the current state of affairs). ICLE published on CDROM is not tagged for parts of speech or learner errors. The error and POS-tagged version of corpus will be available in near future.

In addition to allowing the comparison of the writing of learners from different backgrounds, the corpus can be used in combination with LOCNESS to compare native and learner English. The ICLE corpus is available for linguistic research but cannot be used for commercial purposes. Orders can be placed via i6Doc.

Back to top

10.5. The LINDSEI corpus

The Louvain International Database of Spoken English Interlanguage (LINDSEI) is a spoken counterpart to ICLE. Each subcorpus represents an L2 background and comprises transcripts of fifty 15-minute interviews with 3rd and 4th year university students. The first component of LINDSEI contains transcripts of interviews with 30 female and 20 male French learners of English, totaling ca. 100,000 words. The database is currently being expanded with additional components representing other L1 mother tongues including Bulgarian, Chinese, Italian, Japanese, Spanish, and Swedish. As most learner corpora have used written data only, this type of data allows new research into a wide range of features of oral interlanguage. See the LINDSEI website for the latest development of the corpus.

Back to top

10.6. The Longman Learners' Corpus

The Longman Learners Corpus contains ten million words of essays written during 1990-2002 by students of English at a range of levels of proficiency from 20 different L1 backgrounds. The elicitation tasks varied, ranging from in-class essays with or without the use of a dictionary to exam essays or assignments. Each script in the corpus is coded for the students L1 background, proficiency level, text type (essay, letter, exam script, etc.), target variety (British, American or Australian English), and for the country of residence. This corpus has been designed to provide balanced and representative coverage for each of these categories (cf. Gillard/Gadsby 1998, 160). Taken as a whole it offers a multi-faceted picture of interlanguage, which can be explored in a variety of ways. The Longman Learners Corpus is not POS tagged, but part of the corpus has been error-tagged manually, although this portion is only for internal use by the Longman publishers. Longman Learners Corpus is a commercial corpus. It is also available for academic use. At present around 10 million words can be supplied. Users can also order a subcorpus for a certain proficiency level or L1 background. For details, see the Longman website

Back to top .

10.7. The Cambridge Learner Corpus

As part of the Cambridge International Corpus (CIC), the Cambridge Learner Corpus (CLC) is a large collection of examples of English writing from learners of English all over the world. It contains over 20 million words and is expanding continually. The English in the CLC comes from anonymized exam scripts written by students taking Cambridge ESOL English exams worldwide. The corpus currently contains 50,000 scripts from 150 countries (100 different L1 backgrounds). Each script is coded with information about the students first language, nationality, level of English, age, etc. Over eight million words (or about 25,000 scripts) have been coded for errors using the Learner Error Coding system developed by Cambridge University Press. CLC is a commercial corpus. Currently the corpus can only be accessed by authors and writers working for Cambridge University Press and by members of staff at Cambridge ESOL.

Back to top

10.8. Other learner corpora

In addition to the corpora which cover multiple L1 backgrounds as introduced above, there are a number of learner corpora specific to one particular mother tongue.

The HKUST Corpus of Learner English is one such example. The corpus contains 25 million words of essays and exam scripts of upper-secondary and tertiary-level Chinese learners of English in Hong Kong (mainly Cantonese speakers). The average length of these essays is 1,000 words. The corpus is partly tagged for part of speech and learner error (see Milton/Chowdhury 1994). In addition to the corpus data, a number of computer programs such as AutoLANG and WordPilot have been developed for use with the corpus. These tools are freely downloadable at the website of HKUST. The HKUST learner corpus is available to the public for use in research on a collaborative basis.

The Chinese Learner English Corpus (CLEC) contains one million words from writing produced Chinese learners of English from five proficiency levels: middle school students, junior and senior non-English majors, and junior and senior English majors. The five types of learners are equally represented in the corpus. The CLEC material includes writings for tests, guided writings and free writings. The corpus is not POS tagged, but it is fully annotated with learner errors using an annotation scheme which consists of 61 error types clustered in 11 categories (see Gui/Yang 2001). The CLEC corpus, together with a companion book, can be ordered from Shanghai Foreign Language Education Press (SFLEP). The corpus can also be searched online at the CLEC website.

The JEFLL (Japanese EFL Learner) corpus contains one million words. It has three components. The most important part is the written (ca. 400,000 words) and spoken (ca. 50,000 words) data produced by Japanese EFL learners from Years 7-12 in secondary schools. The second component is an L2 target language subcorpus which contains 150,000 words of English textbook material used in Japan. The third component is an L1 Japanese corpus which comprises 100,000 words of general Japanese writing as well as the same written essay topics used for eliciting English data. As the learner data in the JEFLL corpus is developmental in nature, the corpus can be used for interlanguage development studies. The roughly comparable L1 data also makes it possible to study the potential effect of L1 transfer/interference. The textbook material in the corpus is useful in studying the influence of textbooks on learner performance. JEFLL is POS tagged and partly tagged for learner errors. The corpus will be made publicly available for free online access via the Shogakukan Corpus Network when it is ready.

The Standard Speaking Test (SST) corpus, also known as the NICT JLE (Japanese Learner English) Corpus, contains one million words of error tagged spoken English produced by Japanese learners. Based entirely upon the audio-recordings of an English oral proficiency interview test called the Standard Speaking Test (SST), the corpus comprises 1,200 samples transcribed from 15-minute oral interview test (around 300 hours of recording in total). This is the largest spoken learner corpus which has been built to date. The subjects are classified into nine SST
proficiency levels, thus making it possible to compare speech across different learner proficiency groups. Two types of tagging have been used in the SST corpus: discourse tagging and error tagging. The tags are XML-compliant. More than 30 basic tags are used to mark up discourse phenomena in the learners
utterances, which are clustered into four main categories: tags for representing the structure of the entire transcription file, tags for the interviewees profile, tags for speaker turns, and tags for representing utterance phenomena such as fillers and repetitions (see Izumi/Uchimoto/Isahara 2004, 34). The error tagging scheme consists of 47 tags. Each tag shows three types of information: part of speech, a grammatical/lexical rule, and a corrected form (cf. Izumi/Isahara 2004). The SST corpus CD and a companion book can be ordered from the publishers website.

Thai English Learner Corpus (TELC) currently contains 1.5 million words of writings by Thai learners of English. Two thirds of the materials were taken from university entrance exams at the Institute for English Language Education (IELE, Assumption University) and one third came from writings by undergraduate students at various stages during the four-year EFL course. The corpus continues to grow with the constant addition of new data. The whole TELC corpus is tagged for part of speech and lemma. A demonstration version of the corpus can be accessed at the TELC website, but the query system only displays a maximum of 50 concordance lines though it also indicates a total number of matches in the whole corpus.

The Uppsala Student English (USE) corpus contains 1.2 million words in the form of 1,489 essays written during 1999-2001 by 440 Swedish university students of English at three different levels, the majority in their first term of full-time studies. These essays were written out of class, against a deadline of 2-3 weeks, length limitations imposed (usually 700-800 words), and suitable text structure suggested. There are a variety of essay types in the corpus, including evaluation, argumentation, and discussion, etc. The corpus is available for non-commercial use only. It can be accessed at the USE site or ordered via OTA.

The Polish Learner English Corpus is designed by the PELCRA project (see section 2.3) as a half-a-million-word corpus of written learner data produced by Polish learners of English from a range of learner styles at different proficiency levels, from beginning learners to post-advanced learners (cf. Lewandowska-Tomaszczyk 2003, 107). The data was collected between 1998 and 2000 from the exam essays of Polish learners of English at the Institute of English Studies in Lodz and two teacher-training colleges affiliated with the University of Lodz. Each data file contains a TEI lite conformant header. The corpus is tagged using CLAWS with the standard C7 tagset. Learner errors are identified by comparing the questionable language portions in the learner corpus with materials from native English corpora (e.g. the BNC and ANC) on the one hand and the PELCRA corpus of native Polish on the other. Some sample files are available at the PELCRA project site (see Appendix).

The JPU (Janus Pannonius University) learner corpus contains 400,000 words of essays and research papers by advanced level Hungarian university students, which were collected from 1992 to 1998. JPU has five subcorpora: Postgraduate, Writing and Research Skills, Language Practice, Electives and Russian Retraining (cf. Horvath 1999). Only a small part of the corpus is currently accessible via the JPU corpus site.

Back to top

11. Multilingual corpora

We have so far introduced major monolingual corpora of English and a number of other languages. This section introduces multilingual corpora. The term multilingual is used here in a broad sense to include bilingual corpora. Multilingual corpora can be parallel or comparable. Corpora of this kind are particularly useful in translation and contrastive studies.

Back to top

11.1. The Canadian Hansard Corpus

The earliest and perhaps best-known parallel corpus is the Canadian Hansard Corpus, which consists of debates from the Canadian Parliament published in the countrys official languages, English and French. While its content is limited to legislative discourse, the corpus covers a broad range of topics and styles, e.g. spontaneous discussion, written correspondence, as well as prepared speeches.

There are several versions of the Canadian Hansard parallel corpus. The USC version comprises 1.3 million pairs of aligned text chunks (i.e. sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament (1997-2000) with ca. 2 million words in English and French each. This version is freely downloadable at the USC site (USC Hansard, see Appendix). TransSearch offers an online service which allows subscribed users to access all of the Hansard texts from 1986 to February 2003 (approximately 235 million words). The LDC released a collection of Hansard parallel texts in 1995, covering a time span from the mid-1970's through 1988. This version is available on CDROM from the LDC. The Canadian Hansard Treebank contains 750,000 words of skeleton-parsed texts from proceedings in the Canadian Parliament, which is available from UCREL of Lancaster University.

Back to top

11.2. The English-Norwegian Parallel Corpus

The English-Norwegian Parallel Corpus (ENPC) is one of the earliest and best-known parallel corpora. The corpus is bi-directional in that it contains both original and translated texts in the two languages. ENPC consists of 100 original texts between 10,000 to 15,000 words in length in English and Norwegian together with their corresponding translations in the two languages, totaling 2.6 million words. Unlike most parallel corpora which are limited to a particular domain or text type, efforts have been made to balance the ENPC corpus. Both fiction (30 originals plus translations in each language) and non-fiction (20 originals plus translations in each language) texts are sampled. Fiction texts include childrens fiction, detective fiction and general fiction. Non-fiction texts cover religion, social sciences, law, natural sciences, medicine, arts, and geography/history (see Johansson/Ebeling/Oksefjell 2002). ENPC is marked up in TEI-compliant SGML. The English texts in the corpus are POS tagged and lemmatized while the Norwegian part has also been tagged recently. The corpus is aligned at the sentence level. The ENPC corpus is available for non-commercial research. Registered users can access the corpus online. See the corpus homepage for details of registration.

Back to top

11.3. The English-Swedish Parallel Corpus

The English-Swedish Parallel Corpus (ESPC) follows ENPC in its design. The corpus consists of 64 English text samples and their translations into Swedish and 72 Swedish text samples and their translations into English, amounting to 2.8 million words. The samples from each language have been drawn from two main text categories, fiction and non-fiction. The fiction categories include childrens fiction, crime and mystery fiction, and general fiction while non-fiction texts cover memoirs and biography, geography, humanities, natural sciences, social sciences, applied sciences, legal documents, and prepared speech. The text types of the originals from both languages are comparable in terms of genre, subject matter, type of audience and register (cf. Altenberg/Aijmer/Svensson 2001). ESPC is aligned at the sentence level and marked up in TEI-compliant SGML. The corpus is for non-commercial research and only registered users can access the corpus. See the ESPC site for contact details.

Back to top

11.4. The Oslo Multilingual Corpus

The Oslo Multilingual Corpus (OMC) is an extension of ENPC which covers more languages including, in addition to English and Norwegian, also German, French, Swedish, Dutch, Finnish and Portuguese. The corpus is composed of many subcorpora that differ in composition with regard to languages and number of texts included. OMC is a corpus under construction. Apart from ENPC and ESPC, the corpus currently includes an English-German subcorpus (1.3 million words), a French-Norwegian subcorpus (0.5 million words), a German-Norwegian subcorpus (1.5 million words), a Norwegian-English-German subcorpus (289,230 words of Norwegian original texts, 419,500 words of English original texts, and 220,600 words of English original texts, plus the translations in the other two languages), an English-Dutch subcorpus (0.3 million words), an English-Norwegian-Portuguese subcorpus (0.6 million words), a Norwegian-French-German subcorpus (1.5 million words), a Norwegian-English-French-German subcorpus (1.7 million words), and an English-Finnish subcorpus (0.3 million words).

OMC has been constructed following the same principles as ENPC; and like ENPC, the corpus is coded and marked up in TEI-compliant SGML. The OMC corpus is for academic, non-commercial purposes but it can be accessed only by registered users. See the OMC homepage for the current status of the corpus.

Back to top

11.5. The ET10/63 and ITU/CRATER parallel corpora

ET10/63 is a bilingual parallel corpus of English and French, containing ca. one million words of EC official documents on telecommunications in each language. The corpus is POS tagged and also lemmatized. This bilingual parallel corpus has been extended to include Spanish on the Corpus Resources and Terminology Extraction project. The extension is thus named the CRATER parallel corpus, which contains one million words in each of the three languages. The corpus is sentence aligned and tagged with part-of-speech in all three languages (cf. Garside/Hutchinson/Leech et al. 1994). An expanded version of the CRATER corpus, CRATER 2, has increased the size of the English and French components of the parallel corpus from one million to 1.5 million words. Both versions of CRATER are available via ELRA. The corpus can also been accessed online or downloaded via FTP at the CRATER site.

Back to top

11.6. The IJS-ELAN Slovene-English Parallel Corpus

The Slovene-English Parallel Corpus (IJS-ELAN) contains one million words from 15 terminology-rich bilingual texts produced in the 1990s. One half of the corpus (in terms of the text size) consists of 11 Slovene texts and their English translations while the other half comprises four English texts and their Slovene translations. The corpus is aligned at the sentence level (cf. Erjavec 2002). Two versions of the IJS-ELAN corpus are available, with one version marked up in TEI-compliant SGML and the other encoded in XML and lemmatized and POS tagged. Both versions are freely available for downloading at the corpus website, which also allows free online access.

Back to top

11.7. The CLUVI parallel corpus

The CLUVI (Linguistic Corpus of the University of Vigo) parallel corpus is an open textual corpus of specialized registers (taken from fiction, computing, journalism and legal and administrative fields), totaling eight million words of running texts. The corpus currently comprises five main sections. They are the LEGA parallel corpus of Galician-Spanish legal texts (1.9 million words), the UNESCO parallel corpus of English-Galician-French-Spanish scientific-technical divulgation (1.8 million words), the TECTRA parallel corpus of English-Galician literary texts (1.3 million words), the FEGA parallel corpus of French-Galician literary texts (1.3 million words), and the LEGE-BI corpus of Basque-Spanish legal texts (2.4 million words). The corpus is being expanded with four additional sections: Galician-Spanish economy texts, English-Portuguese literary texts, English-Spanish literary texts, and English-Galician film subtitling (cf. Gomez Guinovart/Sacau Fontenla 2004). The IJS-ELAN corpus is aligned at the sentence level and marked up in XML. The completed sections of the corpus are freely accessible at the CLUVI website.

Back to top

11.8. European Corpus Initiative Multilingual Corpus I

European Corpus Initiative Multilingual Corpus I (ECI/MCI) was released in 1994 by ELSNET (see section 13). The corpus contains 98 million words of texts from 27 languages, covering most of the major European languages as well as some non-European languages such as Chinese, Japanese and Malay. The corpus has 48 components, 12 of which are parallel corpora composed of 2-9 subcorpora. It also includes a great diversity of text types such as newspapers, novels and stories, technical papers and dictionaries and wordlists, though most components are quite homogeneous in contents (cf. Thompson/Armstrong-Warwick/McKelvie et al. 1994).

ECI/MCI is marked up in TEI P2 conformant SGML, but the markup has been undertaken in such a way that users can also get easy access to the source text without markup. The corpus is available from ELSNET or the LDC.

Back to top

11.9. The MULTEXT corpora

Multilingual Tools and Corpora (MULTEXT) is a series of projects whose aims are to develop standards and specifications for the encoding and processing of linguistic corpora, and to develop tools, corpora and linguistic resources embodying these standards. The multilingual corpus used for developing linguistic tools is the JOC (Official Journal of European Community) corpus, which comprises 40 files in five languages: English, German, Italian, Spanish and French. Of these ten files in five languages (English, French, German, Spanish and Italian) are POS tagged and 10 files in four language pairs (English-French, English-German, English-Italian and English-Spanish) are aligned at the sentence level. The corpus is conformant with the Corpus Encoding Standard. The availability of the corpus is unknown, but some samples can be downloaded at the MULTEXT website.

MULTEXT-East is a project which is intended to extend the scope of MULTEXT by transferring MULTEXTs expertise, methodologies, and tools to Central and Eastern European countries, thus enabling the extension and validation of these methodologies and tools on a new range of languages. The Multext-East parallel corpus consists of the English original of George Orwell's Nineteen Eighty-Four (100,000 words) together with its translations into the nine project languages: Bulgarian, Czech, Estonian, Hungarian, Lithuanian, Romanian, Russian, Serbian, and Slovene. The corpus contains extensive CES-compliant headers and markup for document structure, sentences, and various sub-sentence annotations. The translations of Nineteen Eighty-Four are automatically POS tagged and sentence aligned with the English original, with the alignments validated by hand. The MULTEXT-East multilingual comparable corpus comprises a fiction subset and a news subset of at least 100,000 words each, for each of the six project languages (Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene). Each language component is comparable in terms of the number and size of texts. The multilingual comparable corpus is marked up in CES format with over 40 different elements (see Erjavec 2004). The parallel and comparable corpora, together with other MULTEXT-East language resources, are mounted on the Web. The corpora are restricted to research only. Registered users can browse or download full resources. Registrations can be made on the MULTEXT-East website.

Back to top

11.10. The PAROLE corpora

PAROLE (Preparatory Action for Linguistic Resources Organization for Language Engineering) represents a large-scale harmonized effort to create comparable text corpora and lexica for EU languages. Fourteen languages are involved on the PAROLE project, including Belgian French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek, Irish, Italian, Norwegian, Portuguese and Swedish. Corpora containing 20 million words and lexica containing 20,000 entries were constructed for each of these languages using the same design and composition principles during 1996-1998. These corpora all include specific proportions of texts from the categories book (20%), newspaper (65%), periodical (5%) and miscellaneous (10%) within a settled range.

The PAROLE corpora are marked up according to CES-conformant PAROLE DTD (Document Type Declaration). An equal proportion of the texts (up to 250,000 running words) in each PAROLE corpus were POS tagged according to a common PAROLE tagset and morpho-syntactic annotation standards. Part of the tagged data was validated: 50,000 words checked for maximum granularity and 200,000 for part of speech. For some PAROLE corpora, only a copyright-free subset is available the public. The PAROLE corpora that are currently available are distributed by ELRA.

Back to top

11.11. Multilingual Corpora for Cooperation

Multilingual Corpora for Cooperation (MLCC) is a corpus acquisition project which aims to collect a set of texts representing a substantial improvement in range, quantity and quality of corpus material available. The MLCC multilingual data consists of the Multilingual Parallel Corpus and the comparable Polylingual Document Collection. The parallel corpus comprises translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. This corpus has two datasets, with one set taken from the Official Journal of the European Commission, C Series: Written Questions 1993, totaling approximately 10.2 million words (1.1 million words per language), and the other set taken from the Official Journal of the European Commission, Annex: Debates of the European Parliament 1992-1994, with 5-8 million words for each language. The comparable corpus includes financial newspaper articles from the early 1990s in six European languages: Dutch (8.5 million words), English (30 million words), French (10 million words), German (33 million words), Italian (1.88 million words), and Spanish (10 million words). The MLCC multilingual and parallel corpora are marked up in TEI-compliant SGML (cf. Armstrong/Kempen/McKelvie et al. 1998). The resources are available via ELRA.

We have so far introduced multilingual corpora of European languages. The following sections are concerned with corpora involving other languages.

Back to top

11.12. The EMILLE Corpus

The EMILLE Corpus is a product of the Enabling Minority Language Engineering project which develops language resources for South Asian languages. Two versions of the EMILLE Corpus are available: the EMILLE/CIIL Corpus distributed free of charge for non-commercial research, and the EMILLE/Lancaster Corpus for commercial use only.

The EMILLE/CIIL Corpus consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu. The EMILLE monolingual corpora contain approximately 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component includes the Urdu monolingual and parallel corpora annotated for part of speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. The EMILLE/Lancaster Corpus consists of three components: monolingual, parallel and annotated corpora. This version differs from the EMILLE/CIIL Corpus in its monolingual component, which consists of monolingual corpora covering seven South Asian languages (Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil, and Urdu), totaling approximately 58,880,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel and annotated components are the same as in the EMILLE/CIIL Corpus (cf. Baker/Hardie/McEnery et al. 2004).

The EMILLE Corpus is marked up using CES-compliant SGML, and encoded using Unicode. More information about the corpus is available on the EMILLE corpus site. Both versions of the corpus are distributed via ELRA.

Back to top

11.13. The BFSU Chinese-English Parallel Corpus

The BFSU (Beijing Foreign Studies University) Chinese-English Parallel Corpus contains 30 million words. Presently it is the largest parallel corpus of English and Chinese. The corpus is composed of four subcorpora, i.e. Balanced Corpus, Translation Corpus, Bilingual Sentences Corpus and Corpus for Specific Purpose. The bidirectional parallel corpus includes both literary (fiction, prose and play scripts) and non-literary texts, which are sampled from 12 text categories covering three major domains: humanities, social sciences and natural sciences. The Chinese-English and English-Chinese texts account for 40% and 60% respectively while literary and non-literary texts account for 55% and 45% respectively. The BFSU parallel corpus is automatically sentence aligned and hand validated. It has been annotated in such a way as to allow concordances of words, phrases, collocations, and sentence patterns (cf. Wang 2004). The corpus is available from China National Research Centre for Foreign Language Education.

Back to top

11.14. The Babel Chinese-English Parallel Corpus

The Babel Chinese-English Parallel Corpus contains 20 million Chinese characters and 10 million English words of bilingual texts sampled from a great variety of text categories including government documents, news, academic prose, fiction, play scripts, and speech, etc. Babel is designed as a balanced corpus covering three styles (literature, practical writing and news), six fields (arts, business/economics, politics, science, sports, and society/culture), two modes (written, spoken), and four periods (ancient, early modern, modern, and contemporary for Chinese texts, and Old English, Middle English, Early Modern English and present-day English for English texts). Presently only contemporary/present-Day written texts are included, and about 400,000 sentence pairs have been aligned (cf. Bai/Chang/Zhan 2002). The Babel parallel corpus is marked up in XML. Each document has two parts, the text header and the text body. The header part shows Chinese and English titles, author, translator, style, field, mode and period. The text body is annotated for paragraphs, aligned anchoring points, sentences, and words. The Chinese texts in the corpus are tokenized and POS tagged while the English texts are POS tagged and lemmatized. The completed part of the corpus can be accessed online at the Babel website.

Back to top

11.15. Hong Kong Parallel Text

Hong Kong Parallel Text is a large parallel corpus released by the LDC in 2004. The corpus contains approximately 59 million English words and 49 million Chinese words (or 98 million Chinese characters). It consists of the updates of three parallel corpora published in 2000: Hong Kong Hansards, Hong Kong Laws, and Hong Kong News. The Hong Kong Hansards component contains excerpts from the Official Record of Proceedings of the Legislative Council of the HKSAR from October 1985 to April 2003, totaling 36,140,737 English words and 56,618,181 Chinese characters. The Hong Kong Laws component contains statute laws of Hong Kong in English and Chinese, constitutional instruments, national laws and other relevant instruments published by the Department of Justice of the HKSAR up to year 2000, amounting to 8,396,243 English words and 14,868,621 Chinese characters. The Hong Kong News component contains press releases from the Information Services Department of the HKSAR between July 1997 and October 2003, amounting to 14,798,671 English words and 26,677,514 Chinese characters. All of the three components in the Hong Kong Parallel Text corpus are aligned at the sentence level. The English and Chinese texts are kept in separate files, with alignment indicated by corresponding sentence numbers. The corpus is available from the LDC.

Back to top

12. Non-English monolingual corpora

We have so far been concerned with well-known and influential English corpora and multilingual corpora involving English, in addition to some national corpora. This section introduces a number of major monolingual corpora of other languages.

Back to top

12.1. The COSMAS corpora

COSMAS (Corpus Search, Management and Analysis System) is a large collection of German text corpora developed at the Mannheim IDS (Institut für deutsche Sprache). With a size of almost two billion words, this is the worlds largest, ever-growing collection of German online corpora for linguistic research. The collection covers a wide variety of sources, e.g. classic literary texts, national and regional newspapers, transcribed spoken language, morpho-syntactically annotated texts and several unique corpora.

The copy-right free part of the COSMAS collection (over 1.1 billion words) is publicly available free of charge for searching via the COSMAS online toolbox, which allows complex queries, collocation analysis, clustering, and virtual corpus composition, etc. The COSMAS corpora are only available for non-commercial use and anonymous COSMAS sessions are limited to 60 minutes.

Back to top

12.2. The CETEMPúblico Corpus

The CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público) corpus includes the text of around 2,600 editions of the Portuguese daily newspaper Público, written between 1991 and 1999, amounting to approximately 180 million words. The corpus is marked up in SGML. Having removed some repeated extracts from version 1.0, CETEMPúblico version 1.7 consists of over 1.5 million extracts. The first million words (8,043 extracts) have been parsed. This subset represents a balanced selection from the whole period (1991-1999) rather than early years alone. It also covers all of the categories included in the full corpus (cf. Santos/Rocha 2001). CETEMPúblico can be used for research and technological development, but direct commercial exploitation is not permitted. There are a number of ways to access the corpus: CDROM from the LDC, FTP download, and online access at the corpus website.

Back to top

12.3. The INL corpora

The Institute for Dutch Lexicology (INL) has offered three corpora over the Web. The 5 Million Words Corpus 1994 has diversified compositions. It comprises texts of present-day Dutch derived from 17 text sources dating from 1989-1994, including books, magazines, newspapers and TV broadcasts which cover topics such as journalism, politics, environment, linguistics, leisure and business/employment (see Kruyt 1995). The 27 million Words Dutch Newspaper Corpus 1995 consists of newspaper texts derived from issues published in 1994-1995 by a major national newspaper, NRC (see Kruyt/Raaijmakers/van der Kamp et al. 1996). The 38 Million Words Corpus 1996 has three main components: a component with varied composition (books, magazines, newspaper texts, TV broadcasts, parliamentary reports, 1970-1995, 12.7 million words), a newspaper component (Meppeler Courant, 1992-1995, 12.4 million words), and a legal component (Dutch legal texts operative in 1989, with some dating back as early as 1814, 12.9 million words) (see Kruyt/Dutilh 1997). All three corpora are lemmatized and tagged for part of speech and users can define subcorpora using the parameters encoded therein. They are available for non-commercial research purposes only. Access to these corpora is free of charge but subject to an individual user agreement, which can be obtained from the INL website.

Back to top

12.4. The CEG corpus

The CEG (Cronfa Electroneg o Gymraeg) corpus contains one million words of written Welsh prose. The corpus is designed as Welsh parallel to the Brown and LOB corpora, consisting of five hundred 2,000-word samples selected from a representative range of text types to illustrate modern (mainly post 1970) Welsh prose writing. However, the text categories and their proportions in the corpus are different from those in Brown and LOB. The texts in CEG are grouped into two broad categories: factual prose and fiction. There are seven types of fiction such as novels and short stories while the factual prose is further divided into 22 categories such as various types of press material, administrative documents, academic texts and biography (see Ellis/O'Dochartaigh/Hicks et al. 2001).

The corpus is of value for lexical and syntactic analyses of modern Welsh prose. It is available as both raw and annotated texts. Annotations include lemmatization and POS tagging. Both versions are available at the CEG website.

Back to top

12.5. The Scottish Corpus of Texts and Speech

The Scottish Corpus of Texts and Speech (SCOTS) is an ongoing project which aims to build a large electronic corpus of both written and spoken texts for the languages of Scotland. As a product of the first phase of the project, the current version of the corpus consists of 385 texts (around half one million words) contributed by 113 people, which include written, spoken and visual materials from a range of genres such as conversation, interviews, correspondence, poetry, fiction and prose, with much of the material being poetry or literary prose. The initial text collection includes Scots and Scottish English. The corpus is being expanded on the second phase project to extend to Gaelic and the non-indigenous community languages used in Scotland, with a target total size of four million words (800 texts), 20% of which will be spoken. The problem of genre balance will also be addressed.

The SCOTS corpus is marked up in SGML. Extensive sociolinguistic metadata, including, for example, resource type, text type, setting, medium, audience, text details, author/speaker details (gender, age, geographic region, education, occupation, religious background, languages used, etc.), and copyright information (see Anderson 2005). The current version of the SCOTS Corpus is not linguistically annotated, but the transcripts of spoken data are aligned with digital audio/video recordings. The available texts can be browsed and downloaded at the project site.

Back to top

12.6. The Prague Dependency Treebank

The Prague Dependency Treebank (PDT, version 1.0) contains 1.8 million words of texts drawn from the Czech National Corpus (see section 2.4) which have been annotated morphologically and syntactically. Of the texts included in the treebank, general newspaper articles related to politics, sports, culture, hobbies, etc. account for 60%, economic news and analyses 20%, and popular science magazines 20%. PDT version 1.0 is marked up in SGML, which is migrated to XML in version 2.0. The annotation scheme consists of three levels. The morphological level assigns a lemma and a morphological tag to each token. The analytical level uses dependency syntax to annotate the structure of the parse tree and the analytical function of every node, which determines the relationship between the dependent node and its governing node one level higher in the tree. The highest level of annotation, the tectogrammatical level, uses the dependency framework to describe the linguistic meaning of a sentence (see Bohmova/Hajic/Hajicova et al. 2001). In version 1.0, only the first two levels have been annotated. The third level annotation is undertaken in PDT version 2.0. The same texts are annotated on all three levels, but the amount of annotated material decreases with the complexity of the levels, specifically about 1.8 million tokens on the morphological level, about 1.3 million tokens at the analytical level, and one million tokens on the tectogrammatical level. The Prague Dependency Treebank version 1.0 is available on CDROM from the LDC. It can also be accessed at the PDT website using an online tool which allows users to search and view parse trees.

Back to top

12.7. Academia Sinica Balanced Corpus

Academia Sinica Balanced Corpus (ASBC) is the first annotated corpus of modern Chinese. The corpus is a representative sample of Mandarin Chinese as used in Taiwan. It contains five million words of texts sampled from different areas and classified according to five criteria: genre, style, mode, topic, and source. Table 27 (cf. Huang/Chen 1995/1998) shows the proportions of texts categories in terms of these criteria.

Table 27: Composition of ASBC

Criterion

Proportions

Genre

Press reportage: 56.25%, Press review: 10.01%, Advert: 0.59%, Letter: 1.29%, Fiction: 10.12%, Essay: 8.48%, Biography and diary: 0.50%, Poetry: 0.29%, Quotes: 0.03%, Manual: 2.03%, Play script: 0.05%, Public speech: 8.19%, Conversation: 1.34%, Meeting minutes: 0.11%

Style

Narrative texts: 70.66%, Argumentative texts: 12.24%, Expository texts: 14.72%, Descriptive texts: 2.83%

Mode

Written: 90.14%, Written-to-be-read: 1.38%, Written-to-be-spoken: 0.82%, Spoken: 7.29%, Spoken-to-be read: 0.35%

Topic

Philosophy: 8.68%, Natural science: 12.97%, Social science: 34.99%, Arts: 9.28%, General/leisure: 17.89%, Literature: 16.20%

Source

Newspaper: 31.28%, General magazine: 29.18%, Academic journal: 0.70%, Textbook: 4.08%, Reference book: 0.13%, Thesis: 1.36%, General book: 8.45%, Audio/video medium: 22.83%, Conversation/interview: 1.63%, Public speech: 0.25%

The values of these parameters, together with bibliographic information, are encoded at the beginning of each text in the corpus. The whole corpus is tagged for part of speech and a range of linguistic features such as nominalization and reduplication. The Sinica corpus is accessible online at the ASBC website using the query system which also allows users to define subcorpora.

Back to top

12.8. Sinica Treebank

Sinica Treebank (version 2.1) contains 23 texts (290,114 words) extracted from the ASBC corpus, covering subject areas such as politics, travelling, sports, finance and society. There are 54,902 structural trees in the treebank. Like the Prague Dependency Treebank, the thematic relation between a predicate and an argument is marked in addition to grammatical categories in Sinica Treebank. Six non-terminal phrasal categories are annotated in the treebank: S (a complete tree headed by a predicate), VP (a verb phrase headed by a predicate), NP (a noun phrase headed by a noun), GP (a phrase headed by locational noun or locational adjunct), PP (a prepositional phrase headed by a preposition), and XP (a conjunctive phrase that is headed by a conjunction). There are three different kinds of grammatical heads: Head, head and DUMMY. Head indicates a grammatical head in a phrasal category); head indicates a semantic head which does not simultaneously function as a syntactic head); and DUMMY indicates the semantic head(s) whose categorical or thematic identity cannot be locally determined). A total of 63 thematic roles are annotated in the treebank including, for example, agent, causer, condition and instrument for verbs, and time and location for nouns (see Huang/Chen/Chen et al. 2000). Sinica Treebank can be accessed online  using the Web-based interface which allows users to search the treebank and view diagrammatical parse trees.

Back to top

12.9. Penn Chinese Treebank

Penn Chinese Treebank (CTB version 4.0) contains 404,156 words in the form of 838 data files sampled from three newswire sources: 698 articles from the Xinhua News Agency (1994-1998), 55 articles from the Information Services Department of HKSAR (1997), and 80 articles from Sinorama magazine, Taiwan (1996-1998 & 2000-2001). The annotation format of CTB follows that of the Penn English treebank. The formal structural properties are represented with structural labels (such as NP, VP) in brackets while the functional properties are represented with functional labels such as -ADV, -TMP, and -SBJ. Six main grammatical relations are represented in the Chinese treebank, with complementation, adjunction and coordination represented structurally while predication, modification and apposition represented non-configurationally (see Xue/Xia/Chiou et al. 2004). There are 15,162 parsed sentences in the treebank. The corpus is available from the LDC.

Back to top

12.10. Spoken Chinese Corpus of Situated Discourse

Spoken Chinese Corpus of Situated Discourse (SCCSD) is an ongoing project under the auspices of the Chinese Academy of Social Science which aims to collect 1,000 hours of recordings of Mandarin Chinese spoken in China. The corpus consists of three sub-corpora, one for workshop discourse, one for major dialects in China, and one for speeches. At present, 650 hours of audio and 150 hours of video recordings have been collected. The sampling frame for the societal discourse was established sociologically on the basis of a yellow book while the familial discourse was defined in terms of habitation and occupation, as shown in Table 28 (cf. Gu 2003).

Table 28: Discourse types in SCCSD

Category

Subcategory

Example

Societal

Major activities of organisation

government and political discourse, business discourse, educational and academic discourse, legal and mediatory discourse, mass media discourse, discourse of medicine and health, discourse of sports, public service discourse, public welfare discourse, religious and superstitious discourse,

Activities common to organization

administrative discourse, banquet discourse, discourse of celebration and ceremony, discourse of entertainment and leisure, office discourse, political study discourse, telephone discourse

Special discourse

pathological discourse, criminal discourse, military discourse, miscellaneous

Familial discourse

Family discourse in a metropolis

family of high-ranking officials, family of entrepreneurs, family of businessmen, family of academics, family of white collar, family of blue collar, family of suburb farmers, family of immigrant labor

Family discourse in a small town

family of academics, family of white collar

The corpus is presently being transcribed and annotated, with segmented audio/video chunks linked to the corresponding transcripts. When the corpus is completed, about 50-100 hours will be mounted at the SCCSD website and made available on the Internet in a multimedia form.

Back to top

13. Well-known distributors of corpus resources

While many corpora introduced in this survey are made available at individual project or corpus websites, there are a number of organisations which aim at creating, collecting and distributing corpus resources. The best-known of these include CSLU, ELRA/ELDA, ELSNET, ENABLER, ICAME, the LDC, OTA, and TELRI/TRACTOR.

CSLU (Centre for Spoken Language Understanding) is a research centre at Oregon Graduate Institute of Science and Technology (OGI) that focuses on spoken language technologies. The centre offers a range of products and services. For non-commercial purposes (educational, research, personal and evaluation), most products are freely available. Some products (generally source codes) are also available for commercial use via a membership agreement. CSLU has created, collected and distributed speech corpora in over 20 languages for use in the area of voice processing. A description of the corpora currently available from the centre is available at the CSLU website.

ELRA (The European Language Resources Association) is a non-profit organization established in 1995 with the goal of promoting the creation, validation, and distribution of language resources (LRs) for the Human Language Technology (HLT) community, and evaluating language engineering technologies. Many of these tasks are carried out by ELRAs operational body ELDA (Evaluations and Language resources Distribution Agency), which is set up to identify, classify, collect, validate and produce language resources. The language resources available from ELRA are classified into four major categories: spoken LRs (telephone/microphone recordings, speech related resources), written LRs (corpora, monolingual and multilingual lexicons), terminological resources (monolingual, bilingual and multilingual), and multimodal/multimedia LRs. See the ELRA catalogue for the available language resources.

ELSNET (European Network in Language and Speech) is a Europe-based forum which aims to advance human language technologies in a broad sense by bringing together Europes key players in research, development, integration or deployment in the field of language and speech technology and neighboring areas. See ELSNET resources page for the corpora made available or supported by ELSNET.

The ENABLER (European National Activities for Basic Language Resources) Network aims at improving cooperation among the national activities which provide language resources for their respective languages. ENABLER has worked in close collaboration with ELSNET to develop the Language Resources Roadmap and the Language Resources Landscape. Resources offered by ENABLER include written, spoken, and multimodal corpora as well as lexical resources. See the ENABLER catalogue for a list of available corpora.

ICAME (International Computer Archive of Modern and Medieval English) is an international organization of linguists and information scientists working with English corpora. The aim of the organization is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions. About 20 corpora amounting to 17 million words are currently available on CDs from ICAME.

The LDC (Linguistic Data Consortium) is an open consortium of universities, companies and government research laboratories which creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The LDC is the largest distributor of corpus resources, but most LDC resources are specialized corpora which are more geared towards language engineering than linguistic analysis. See the LDC catalogue for a list of available corpora.

OTA (Oxford Text Archive) is one of the oldest and best-known electronic text centres in the world. It works closely with members of the Arts and Humanities academic community to collect, catalogue, and preserve high-quality electronic texts for research and teaching. OTA currently distributes more than 2,500 resources in over 25 different languages, which include a great variety of language corpora in addition to electronic editions of works by individual authors, manuscript transcriptions and reference works. See the OTA catalogue for available resources.

TRACTOR is the TELRI (Trans-European Language Resources Infrastructure) Research Archive of Computational Tools and Resources, which aims at collecting, promoting, and making available monolingual and multilingual language resources and tools for the extraction of language data and linguistic knowledge, with a special focus on Central and Eastern European languages. The TRACTOR archive features monolingual and multilingual corpora as well as lexicons in a wide variety of languages, currently including Bulgarian, Croatian, Czech, Dutch, English, Estonian, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Romanian, Russian, Serbian, Slovak, Slovenian, Swedish, Turkish, Ukrainian and Uzbek. Resources distributed through TRACTOR are available for non-commercial use only, but TRACTOR aims to promote and foster commercial links between academic and industrial researchers.

Back to top

14. Conclusion

This survey introduced well-known and influential corpora for various research purposes, including national corpora, monitor corpora, corpora of the Brown family, synchronic corpora, diachronic corpora, spoken corpora, academic/professional corpora, parsed corpora, developmental/learner corpora, multilingual corpora, and non-English monolingual corpora. This discussion, however, only covers a very small proportion of the available corpus resources. The classification used in this survey was for illustrative purpose only. The distinctions given have been forced for the purpose of this introduction. It is not unusual to find that any given corpus will be a blend of many of the features introduced here.

Back to top

References

Aduriz, I./Aldezabal, I./Alegria, I./Arriola, J./Diaz de Ilarraza, A./Ezeiza, N./Gojenola, K. (2003), Finite State Applications for Basque. In: Proceedings of EACL'2003 Workshop on Finite-State Methods in Natural Language Processing. Budapest, 13-14 April 2003.  Accessed online on 6 December 2004 at http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1044982346/publikoak/fs-eacl-aduriz.pdf.

Altenberg, B./Aijmer, K./Svensson, M. (2001), The English-Swedish Parallel Corpus (ESPC): Manual of enlarged version. Universities of Lund and Göteborg.

Anderson, W. (2005), The SCOTS Corpus: a resource for language contact study. In: Ureland, S. & Pugh, S. (eds) Symposium Logos Series Studies in Eurolinguistics Vol. 4.

Armstrong, S./Kempen, M./McKelvie, D./ Petitpierre, D./Rapp, R./Thompson, H. (1998), HCRC Publication: Multilingual Corpora for Cooperation. In: LREC 1998 Proceedings.

Aston, G./Burnard, L. (1998), The BNC Handbook. Edinburgh: Edinburgh University Press.

Auran, C./Bouzon, C./Hirst, D. (2004), The aix-MARSEC project: an evolutive database of spoken British English. In: SP-2004, 561-564.

Bai, X./Chang, B./Zhan, W. (2002), Building a large Chinese-English parallel corpus. In Huang, H. (ed) Proceedings of the National Symposium on Machine Translation 2002. Beijing: electronic Industry Press, 124-131.

Baker, P./Hardie, A./McEnery, T./Xiao, R./Bontcheva, K./Cunningham, H./Gaizauskas, R./Hamza, O./Maynard, D./Tablan, V./Ursu, C./Jayaram, B./Leisher, M. (2004), Corpus linguistics and South Asian languages: Corpus creation and tool development. In: Literary and Linguistic Computing 19(4), 509-524.

Barlow, M. (1998), A Corpus of Spoken Professional American English. Houston, TX: Athelstan.

Beare, J./Scott, B. (1999), The Spoken Corpus of the Survey of English Dialects: language variation and oral history. In: Proceedings of ALLC/ACH 1999. Virginia.

Berglund, Y./Burnard, L./Wynne, M. (2004), BNC-baby: using corpora in the virtual classroom. In: Proceedings of the 6th Teaching and Language Corpora conference. Granada, July 2004.

Biber, D. (1988), Variation Across Speech and Writing. Cambridge: Cambridge University Press.

Biber, D./Finegan, E./Atkinson, D. (1994), ARCHER and its challenges: compiling and exploring a representative corpus of historical English registers. In: Fries, U., Tottie, G. & Schneider, P. (eds) Creating and Using English Language Corpora. Amsterdam: Rodopi, 114. 

Bies, A./Ferguson, M./Katz, K./Schasberger, B. (1994), The Penn Treebank: annotating predicate-argument structure. In: ARPA Human Language Technology Workshop.

Bohmova, A./Hajic, J./Hajicova, E./Hladka, B. (2001), The Prague Dependency Treebank: three-Level Annotation Scenario. In: Abeille, A. (ed) Treebanks: Building and Using Syntactically Annotated Corpora. Dordrecht: Kluwer.

Burnard, L. (2002), Where did we go wrong? A retrospective look at the British National Corpus. In: Ketterman, B. & Marko, G. (eds) Teaching and Learning by doing Corpus Analysis: Proceedings of the Fourth International TALC. Amsterdam: Rodopi, 51-70.

Carter, R./McCarthy, M. (2004), Talking, creating: interactional language, creativity, and context. In: Applied Linguistics 25(1), 62-88.

Cavar, D./Geyken, A./Neumann, G. (2000), Digital Dictionary of the 20th Century German Language. In: Erjavec, T. & Gros, J. (eds) Proceedings of the Language Technologies Conference. Ljubljana, October 2000. Accessed online on 6 December 2004 at http://nl.ijs.si/isjt00/index-en.html.

Cheng, W./Warren, M. (1999), Facilitating a description of intercultural conversations: the Hong Kong Corpus of Conversational English. In: ICAME Journal 23, 5-20.

Choukri, K. (2003), Brief overview of activities in Europe. In: Proceedings of COCOSDA Workshop 2003. Geneva, August 2003.

Coxhead, A. (2000), A new academic word list. In: TESOL Quarterly 34(2), 213-238.

Crowdy, S. (1993), Spoken corpus design. In: Literary and Linguistic Computing 8(4), 259-265.

Culpeper, J./Kytö, M. (1997), Towards a corpus of dialogues, 1550-1750. In: Ramisch, H. & Wynne, K. (eds) Language in Time and Space: Studies in Honour of Wolfgang Viereck on the Occasion of his 60th Birthday. Stuttgart: Franz Steiner Verlag Stuttgart, 60-73.

Dalli, A. (2001), Interoperable extensible linguistics databases. In: Proceedings of IRCS Workshop on Linguistic Databases. Philadelphia, 13 December 2001. Accessed online on 6 December 2004 at http://www.ldc.upenn.edu/annotation/database/papers/.

Denison, D. (1994), A corpus of late Modern English prose. In Kytö, M., Rissanen, M. & Wright, S. (eds) Corpora across the Centuries. Amsterdam: Rodopi, 7-16.

Dubois, J./Chafe, W./Meyer, C./Thompson, S. (2000-2004), Santa Barbara Corpus of Spoken American English Parts 1-3. Linguistic Data Consortium.

Ellis, N./O'Dochartaigh, C./Hicks, W./ Morgan, M./Laporte, N. (2001), Cronfa Electroneg o Gymraeg (CEG): a 1 million word lexical database and frequency count for Welsh. Accessed online on 8 December 2004 at http://www.bangor.ac.uk/ar/cb/ceg/ceg_eng.html

English Language Institute (2003), MICASE Manual: The Michigan Corpus of Academic Spoken English (version 1.1). University of Michigan. Accessed online on 7 December 2004 at http://www.lsa.umich.edu/eli/micase/MICASE_MANUAL.pdf.

Erjavec, T. (2002), The IJS-ELAN Slovene-English Parallel Corpus. In: International Journal of Corpus Linguistics 7(1), 1-20.

Erjavec, T. (2004), MULTEXT-East Version 3: Multilingual Morpho-syntactic Specifications, Lexicons and Corpora. In: LREC 2004 Proceedings.

Farr, F./Murphy, B./OKeeffe, A. (forthcoming), The Limerick Corpus of Irish English: design, description and application. In: Teanga 21.

Fries, U./Schneider, P. (2000), ZEN: Preparing the Zurich English Newspaper Corpus. In: Ungerer, F. (ed) English Media Texts: Past and Present. Amsterdam: John Benjamins, 1-24.

Garside, R./Leech, G./Váradi, T. (1992), Manual of Information for the Lancaster Parsed Corpus. Lancaster University.

Garside, R./Hutchinson, J./ Leech, G./McEnery, A./Oakes, M. (1994), The exploitation of parallel corpora in projects ET10/63 and CRATER. In: New Methods in Language Processing. Manchester: UMIST, 108-115.

Garside, R./Leech, G./Sampson, G. (eds) (1987), The Computational Analysis of English: A Corpus-Based Approach. Harlow: Longman.

Gautier, G. (1998), Building a Kurdish language corpus. Paper presented at ICEMCO 98 6th International Conference and Exhibition on Multilingual Computing. Cambridge, April 1998. Accessed online on 6 December 2004 at http://ggautierk.free.fr/e/icem_98.htm.

Gillard, P./Gadsby, A. (1998), Using a learners corpus in compiling ELT dictionaries. In: Granger, S. (ed) Learner English on Computer. London: Longman, 159171.

Glover, W. (1998), Toward a Nepali national corpus. In: Yadava, P & Kansakar, T. (eds) Lexicography in Nepal: Proceedings of the Institute on Lexicography, 1995. Kamaladi, Kathmandu: Royal Nepal Academy, 24-28.

Godfrey, J./Holliman, E. (1997), The Switchboard-1 Telephone Speech Corpus. Linguistic Data Consortium.

Gomez Guinovart, X./Sacau Fontenla, E. (2004), Parallel corpora for the Galician language: building and processing of the CLUVI (Linguistic Corpus of the University of Vigo). In: LREC 2004 Proceedings.

Grabe, E./Post, B./Nolan, F. (2001), The IViE Corpus. Department of Linguistics, University of Cambridge.

Granger, S. (2003), The International Corpus of Learner English: a new resource for foreign language learning and teaching and second language acquisition research. In: TESOL Quarterly 37(3), 538-546.

Granger, S./Tyson, s. (1996), Connector usage in the English essay writing of native and non-native EFL speakers of English. In: World Englishes 15(1), 17-27.

Greenbaum, S./Svartvik, J. (1990), The London-Lund Corpus of Spoken English. In: Svartvik, J. (ed) The London Lund Corpus of Spoken English: Description and Research [Lund Studies in English 82]. Lund: Lund University Press.

Gu, Y. (2003), Exploring multi-modal corpus segmentation and annotation. Talk given at the Corpus Research Group, Lancaster University on 8 December 2003.

Guerra, L. (1998), Research in language and literature: Old problems, new solutions. Paper presented at the conference of the Future of Humanities in the Digital Age. Bergen, 25-26 September 1998. Accessed online on 6 December 2004 at http://ultibase.rmit.edu.au/Articles/dec98/guerra1.htm.

Gui, S./Yang, H. (2002), Chinese Learner English Corpus. Shanghai: Shanghai Foreign Language Education Press.

Haslerud, V./Stenström, A. (1995), The Bergen Corpus of London Teenager Language (COLT). In: Leech, G., Myers, G & Thomas, J. (eds) Spoken English on Computer. Transcription, Mark-up and Application. London: Longman, 235-242.

Hatzigeorgiu, N./Gavrilidou, M./Piperidis, S./Carayannis, G./Papakostopoulou, A./Spiliotopoulou, A./Vacalopoulou, A./Labropoulou, P./Mantzari, E./Papageorgiou, H./Demiros, I. (2000), Design and Implementation of the Online ILSP Greek Corpus. In: Proceedings of LREC 2000.

Holmes, J./Vine, B./Johnson, G. (1998), Guide to the Wellington Corpus of Spoken New Zealand English. Victoria University of Wellington.

Holmes-Higgin, P./Abidi, S./Ahmad, K. (1994), A description of texts in a corpus: Virtual and Real corpora. In: Martin, W., Meijs, W. Moerland, M. ten Pas, E., van Sterkenburg, P. & Vossen, P. (eds) EURALEX'94 Proceedings. Amsterdam: Vrije Universiteit, 390-402.

Horvath, J. (1999), Advanced Writing in English as a Foreign Language, A Corpus-based Study of Processes and Products. PhD thesis. Janus Pannonius University.

Huang, C./Chen, K. (1995/1998), CKIP Technical Report 95-02/98-04. Taipei: Academia Sinica.

Huang, C./Chen, F./Chen, K./Gao, Z./Chen, K. (2000), Sinica Treebank: design criteria, annotation guidelines, and on-line interface. In: Bagga, A., Pustejovsky, J. & Zadrozny, W. (eds) Proceedings of NAACL-ANLP 2000 Workshop: Syntactic and Semantic Complexity in Natural Language Processing Systems. Seattle.

Hundt, M./Sand, A./Siemund, R. (1998), Manual of Information to Accompany the Freiburg-LOB Corpus of British English (FLOB). Accessed online on 6 December 2004 at http://www.hit.uib.no/icame/flob/index.htm.

Hundt, M./Sand, A./Skandera, P. (1999), Manual of Information to Accompany the Freiburg-Brown Corpus of American English (Frown). Accessed online on 6 December 2004 at http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM.

Izumi, E./Isahara, H. (2004), Investigation into language learners' acquisition order based on the error analysis of the learner corpus. Paper presented at IWLeL 2004. Tokyo, December 2004.

Izumi, E./Uchimoto, K./Isahara, H. (2004), SST speech corpus of Japanese learners English and automatic detection of learners errors. In: ICAME Journal 28, 31-48.

Johansson, S./Ebeling, J./Oksefjell, S. (2002), English-Norwegian Parallel Corpus: Manual. Oslo: University of Oslo.

Johansson, S./Leech, G./Goodluck, H. (1978), Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for Use with Digital Computers. Oslo: University of Oslo.

Kang, B./King, H. (2004), Sejong Korean corpus in the making. In: Proceedings of LREC 2004. 1747-1750.

Kruyt, J. (1995), Nationale tekstcorpora in internationaal perspectief. In: Forum der Letteren 36(1), 47-58.

Kruyt, J./Dutilh, M. (1997) A 38 million words Dutch text corpus and its users. In: Lexikos 7 (Afrilex-reeks/series 7: 1997), 229-244.

Kruyt, J./Raaijmakers, S./van der Kamp, P./van Strien, R. (1996), Language resources for language technology. In: Proceedings of the first TELRI European Seminar. Tihany, 173-178.

Kucěra, H./Francis, W. (1967), Computational Analysis of Present-day English. Providence: Brown University Press.

Kučera, K. (2002), The Czech National Corpus: principles, design, and results. In: Literary and Linguistic computing 17(2), 245-257.

Kytö, M. (1996), Manual to the Diachronic Part of the Helsinki Corpus of English Texts: Coding Conventions and Lists of Source Texts. Helsinki: University of Helsinki. Accessed online on 6 December 2004 at http://khnt.hit.uib.no/icame/manuals/HC/INDEX.HTM.

Laitinen, M. (2002), Extending the Corpus of Early English Correspondence to the 18th century. In: Helsinki English Studies 2002(2).

Lee, D. (2001), Genres, registers, text types, domains, and styles: clarifying the concepts and navigating a path through the BNC jungle. In: Language Learning and Technology 5(3), 37-72.

Lewandowska-Tomaszczyk, B. (2003), The PELCRA project state of art. In: Lewandowska-Tomaszczyk, B. (ed) Practical Applications in Language and Computers. Frankfurt: Peter Lang, 105-121.

MacWhinney, B. (1995), The CHILDES project: Tools for Analyzing Talk. Hillsdale, NJ: Erlbaum.

Malten, T. (1998), Tamil studies in Germany. Lecture given at Max Mueller Bhavan, Chennai on 17 March 1998. Accessed online on 6 December 2004 at http://www.uni-koeln.de/phil-fak/indologie/kolam/kolam2/malten.html.

Marcus, M. (1999), Manual of ICAMET (Innsbruck Computer-Archive of Machine-Readable English Texts). Innsbrucker Beitraege zur Kulturwissenschaft, Anglistische Reihe, vol. 7. Innsbruck: Leopold-Franzens-Universitaet Innsbruck, Institut fuer Anglistik.

Marcus, M./Santorini, B./Marcinkiewicz, M. (1993), Building a large annotated corpus of English: The Penn Treebank. In: Computational Linguistics 19, 313-330.

Marcus, M./Kim, G./Marcinkiewicz, M./MacIntyre, R./Bies, A./Ferguson, M./ Katz, K./Schasberger, B. (1994), The Penn Treebank: Annotating predicate-argument structure. In: ARPA Human Language Technology Workshop.

McEnery, A./Xiao Z./Mo L. (2003), Aspect marking in English and Chinese: using the Lancaster Corpus of Mandarin Chinese for contrastive language study. In: Literary and Linguistic Computing 18(4), 361-378.

Milton, J./Chowdhury, N. (1994), Tagging the interlanguage of Chinese learners of English. In: Flowerdew, L. & Tong, A. (eds) Entering Text. Hong Kong: The Hong Kong University of Science and Technology, 127-143. 

Nelson, G. (1996), The design of the corpus. In: Greenbaum, S. (ed) Comparing English Worldwide: the International Corpus of English. Oxford: Clarendon Press, 27-35.

Nelson, G./Wallis, S./Aarts, B. (ed) (2002), Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins.

Nevalainen, T. (2000), Gender differences in the evolution of standard English. In: Journal of English Linguistics 28(1), 38-59.

Reppen, R./Ide, N. (2004), The American National Corpus: overall goals and the first release. In: Journal of English Linguistics 32(2), 105-113.

Rissanen, M. (2000), The world of English historical corpora. In: Journal of English Linguistics 8(1), 7-20.

Riza, H. (1999), The Indonesia National Corpus and Information Extraction Project (INC-IX). Jakarta: BPP Teknologi.

Rossini Favretti, R./Tamburini, F./De Santis, C. (2004), A corpus of written Italian: a defined and a dynamic model. In: Wilson, A., Rayson, P. & McEnery, T. (eds) A Rainbow of Corpora: Corpus Linguistics and the Languages of the World. Munich: Lincom-Europa.

Sampson, G. (1987), The grammatical database and parsing scheme. In Garside, R., Leech, G. & Sampson, G. (eds) The Computational Analysis of English. London: Longman, 82-96.

Sampson, G. (1995), English for the Computer: The SUSANNE Corpus and Analytic Scheme. Oxford: Clarendon Press.

Sampson, G. (2000), CHRISTINE Corpus, Stage I: Documentation. Sussex: University of Sussex.

Sampson, G. (2003), The LUCY Corpus: Documentation. Sussex: University of Sussex.

Sánchez, M. (2002), CREA: Reference corpora for current Spanish. In: Proceedings of Language Corpora: Present and Future. Donostia, 24-25 October 2002.

Santos, D./Rocha, P. (2001), Evaluating CETEMPúblico, a free resource for Portuguese. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, ACL'2001. Toulouse, 9-11 July 2001. 442-449.

Schmied, J. (1994), The Lampeter Corpus of Early Modern English Tracts. In: Kytö, M., Rissanen, M. & Wright, S. (eds) Corpora Across the Centuries: Proceedings of the First International Colloquium on English Diachronic Corpora. St Catharine's College Cambridge, 25-27 March 1993. Amsterdam: Rodopi.

Schneider, P. (2002), Computer assisted spelling normalization of 18th century English. In Peters, P., Collins, P. & Smith, a. (eds) New Frontiers of Corpus research. Amsterdam: Rodopi, 199-214.

Scott, M. (1999), WordSmith Tools. Oxford: Oxford University Press.

Sharoff, S. (2004), Methods and tools for development of the Russian Reference Corpus. In: Archer, D., Wilson, A. & Rayson, P. (eds) Corpus Linguistics Around the World. Amsterdam: Rodopi.

Souter, C. (1993), Towards a standard format for parsed corpora. In: Aarts, J., Haan, P. & Oostdijk, N. (eds) English Language Corpora: Design, Analysis and Exploitation. Amsterdam: Rodopi, 197-214.

Stern, K. (1997), The Longman Spoken American Corpus: providing an in-depth analysis of everyday English. In: Longman Language Review Issue 3. Accessed online on 7 December 2004 at http://www.longman.com/dictionaries/llreview/r3stern.html.

Stibbard, R. (2001), Vocal Expression of Emotions in Non-laboratory Speech: An Investigation of the Reading/Leeds Emotion in Speech Project Annotation Data. PhD thesis. University of Reading.

Taylor, A. (2000), The Penn-Helsinki Parsed Corpus of Middle English 2. University of Pennsylvania.

Taylor, L./Knowles, G. (1988), Manual of Information to Accompany the SEC Corpus: The machine readable corpus of spoken English. University of Lancaster.

Thompson, H./Armstrong-Warwick, S./McKelvie, D./Petitpierre, D. (1994), Data in your Language: the ECI Multilingual Corpus 1. In: Proceedings of the International Workshop on Shareable Natural Language Resources. Nara, 1994.

Tsou, B./Tsoi, W./Lai, T./Hu, J./Chan, S. (2000), LIVAC, a Chinese synchronous corpus, and some applications. In: Proceedings of the ICCLC International Conference on Chinese Language Computing. Chicago. 233-238.

van Bergen, L./Denison, D. (2004), A corpus of late eighteenth century prose. In Beal, J., Corrigan, K. & Mosil, H. (eds) Models and Methods in the Handling of Unconventional Digital Corpora vol. 2. Diachronic Corpora. Palgrave.

Váradi, T. (2002), The Hungarian National Corpus. In: Proceedings of the Third International Conference on Language Resources and Evaluation. Las Palmas, Spain. 385-389.

Wang, J. (2001), Recent progress in corpus linguistics in China. In: International Journal of Corpus Linguistics 6(2), 281-304.

Wang, K. (ed.) (2004), The Development of the Compilation and Application of Parallel Corpora. Beijing: Foreign Language Education and Research Press.

Wittenburg, P./Brugman, H./Broeder, D. (2000), Summary. In: Proceedings of LREC 2000 Pre-conference Workshop on Meta-Descriptions for Multi-media Language Resources.

Xue, N./Xia, F./Chiou, F./Palmer, M. (2004), The Penn Chinese TreeBank: phrase structure annotation of a large corpus. In: Natural Language Engineering 10(4), 1-30.

 

Back to top

 

Copyright © 2006 Taylor & Francis Group plc