Using Corpora for Language Learning and Teaching

TESOL Press

TESOL Press Companion Webpage

Annotated List of Useful Corpora, Tools, and Resources

Large General Corpora
Small General Corpora
Spoken Corpora
Written English Corpora
Learner English Corpora
Corpus Search, Analysis, and Tagging Tools
Corpus-Based Language Learning Resources and Tools
Key Word Lists

All free corpora and corpus tools are each marked with an *.

Large General Corpora

British National Corpus* (BNC) was developed jointly by three publishers (Oxford University Press, Longman, and W & R Chambers), two universities (Oxford University and Lancaster University), and the British Library. It contains a little over 100 million running words of language data produced between 1980 and 1993 and is divided into seven registers or subcorpora: spoken, fiction, magazine, newspapers, academic writing, nonacademic writing, and miscellaneous. As such, BNC is an excellent source for studying contemporary British English. The corpus can be downloaded at http://ota.ox.ac.uk/desc/2554. It can also be accessed and searched online at several portals, including Mark Davies’s BYU portal, which is equipped with a powerful and user-friendly search engine that can perform many query functions, such as finding and generating collocations, keywords in context, and frequency lists.

Corpus of Contemporary American English* (COCA) is a continuously expanding corpus developed by Mark Davies. Currently, it contains 520 million words, with approximately 20 million words included for each year from 1990 and up to 2015 (more data of the same proportion are expected to be added in future years). Like BNC, COCA contains several subcorpora, including spoken, fiction, magazine, newspaper, and academic writing. Its search engine is the same as the one used on Mark Davies’s BNC portal.

Corpus of Global Web-based English* (GloWbE) is another corpus provided by Mark Davies. It consists of 1.9 billion words gathered from 1.8 million web pages from 20 English-speaking countries, including both inner-circle English-speaking countries (e.g., Great Britain, United States) and outer-circle English speaking countries (e.g., India, Singapore).

Small General Corpora

Brown Corpus of Standard American English*, developed in the 1960s, is the first of the modern, computer-readable, general corpora. It contains one million words from 500 texts of 2,000 words each covering 15 categories of texts, including news reports and fiction.

Lancaster-Oslo/Bergen (LOB) Corpus of British English is a corpus developed as a British English corpus to match the Brown Corpus, using exactly the same number of texts with the same length and number of categories.

Crown and CLOB corpora* were developed at Beijing Foreign Studies University. Mirroring the structure and size of the Brown and LOB corpora, with approximately one million running words each, the Crown and CLOB corpora are intended to provide sets of representative samples of contemporary American and British English, respectively.

Spoken Corpora

Corpus of Spoken Professional American English is a two-million-word corpus, with one million words being presentations and discussions at professional conferences and one million coming from question-and-answer sessions at the White House press briefings and conferences.

Michigan Corpus of Academic Spoken English* (MICASE) is a two-million-word corpus developed by the English Language Institute at the University of Michigan. The corpus is composed of 152 transcripts that are well balanced in terms of academic disciplines and functions. It provides academic spoken English in various functions, including classroom instruction, group discussion, presentation, and advising sessions. Therefore, the corpus is of particular use for studying academic spoken English.

Santa Barbara Corpus of Spoken American English* is a 249,000-word corpus composed of transcribed naturally occurring spoken interactions from all over the United States. With conversational English as its data, this corpus is particularly useful for learners interested in conversational American English.

Spoken subcorpora of BNC and COCA* (freely available as noted above) are each a large spoken English corpus. It is very important to note, however, that while the Spoken subcorpus of BNC involves different forms of spoken English, including both formal speeches and informal private conversations, the spoken subcorpus of COCA contains exclusively broadcast English.

Written English Corpora

British Academic Written English Corpus* (BAWE) is a corpus of academic writing by native British English speaker students at both the undergraduate and master’s levels. It consists of 2,761 texts from arts and humanities, social sciences, life sciences, and physical sciences. BAWE is a valuable resource for the study of tertiary-level academic English writing.

Michigan Corpus of Upper-Level Student Papers* (MICUSP) is a written English corpus developed by the English Language Institute at the University of Michigan. The corpus has 2.6 million words made up of 829 A-graded papers written by fourth-year undergraduate students and graduate students across 16 disciplines. It is searchable by a variety of categories such as topic, genre, and discipline. As such, the corpus should be very useful for ESL/EFL writers learning to write academic papers.

Academic English subcorpora of BNC and COCA* (freely available as noted above) are each a written English corpus. Similarly, the fiction, magazine, and newspaper subcorpora in both BNC and COCA are also each a written English corpus, but they represent different writing genres and registers.

Business Letter Corpus* (BLC) is a corpus of business letters of both U.S. and U.K. samples developed by Yasumasa Someya in 2000 for his master’s project. The corpus consists of roughly one million words. It is searchable and useful for instructors and students of business English.

Learner English Corpora

International Corpus of Learner English (version 2, ICLEv2, l) is arguably the most well-known learner corpus of EFL/ESL students’ writing. It was compiled by Granger and colleagues at the Université Catholique de Louvain. The ICLEv2 consists of argumentative and descriptive essays by higher intermediate and advanced EFL learners from 16 first language backgrounds, such as Chinese, Czech, Dutch, French, German, Japanese, and Russian. Thus, it is particularly valuable for comparing the language of EFL learners from different language backgrounds.

International Corpus Network of Asian Learners of English* (ICNALE) is a 1.2-million-word written corpus developed by Shin'ichiro Ishikawa. It contains essays written by EFL learners from more than 10 Asian countries.

Louvain International Database of Spoken English Interlanguage (LINDSEI,) is a corpus of spoken English by EFL speakers of 11 languages, including Bulgarian, Chinese, French, German, Japanese, and Spanish. Developed by Gaëtanalle Gilquin, Sylvie De Cock, and Sylviane Granger at the Université Catholique de Louvain, the corpus, available for purchase on CD, contains systematically collected representative ESL spoken English and comes with a search program and a handbook.

Ten-thousand English Compositions of Chinese Learners* (TECCL Corpus) is a learner corpus of written English by Chinese EFL students developed at Beijing Foreign Studies University. It contains approximately 10,000 essays written by Chinese secondary school and college EFL students.

Corpus Search, Analysis, and Tagging Tools

AntConc* is a corpus search and analysis software program developed by Lawrence Anthony at Waseda University. AntConc provides modules for concordancing corpus queries, developing word lists and keyword lists, compiling lists of clusters/N-grams (formulaic expressions or clusters), and retrieving collocations of target words. Because of its various functions, AntConc is a valuable tool for analyzing corpus data for language teaching and learning purposes.

AntWordProfiler* is another program designed by Lawrence Anthony. AntWordProfiler calculates the number and coverage/percentage of the General Service List’s (West 1953) first and second most high-frequency words and the Academic Word List’s (Coxhead, 2000) words in a corpus. AntWordProfiler may be considered an alternative, with a modern interface, to the Range software developed by Professor Paul Nation.

CLAWS WWW Tagger* is a parts-of-speech tagger. You can enter a corpus up to 100,000 running words into its designated space and it will tag your corpus for parts of speech for free. For a corpus larger than 100,000, you can purchase a full version of the tagger.

MonoconcEsy, MonoConc Pro, Colloate, and ParaConc are AntConc-like corpus analysis tools developed by Michael Barlow. MonoConc Pro 2.2 (the most current version of MonoConc Pro) has functions comparable to those of AntConc, such as identification of word collocations and wordlist keyword comparisons. MonoconcEsy is a simplified version (free for individual use) of MonoConc Pro; it does not have the advanced query and analysis functions found in MonoConc Pro 2.2. ParaConc is a tool for comparing data in two corpora of different languages.

Range* is a program by Professor Paul Nation for measuring the distribution of a word across different texts in a corpus. The program measures the number and coverage/percentage of words in a corpus. It is especially useful for determining to which frequency group of words a word belongs based on the various existing word lists available, for example, whether a word falls into the first 1,000, second 1,000, or fifth 1,000 most frequent words list (up to the 25th 1,000 words list) based on the 25 sublists of BNC/COCA words.

WordSmith Tools is an integrated set of corpus analysis programs developed by Mike Scott. Its most current version is Version 7. WordSmith provides similar search and analysis functions as those offered by AntConc and MonoConc Pro, such as data concordancing as well as developing word, key word, and cluster/N-gram lists. WordSmith also offers special functions not found in AntConc, one of which is ConcGram, a tool for retrieving concgrams (i.e., nonconsecutive clusters or N-grams; Cheng, Greaves, & Warren 2006).

Corpus-Based Language Learning Resources and Tools

Compleat Lexical Tutor* is a website developed by Tom Cobb for language learning and teaching. It provides a variety of resources (e.g., corpora such as Brown and BNC), tools (e.g., Range, VocabProfile), and corpus-based learning functions or activities (e.g., corpus-driven error correction and Concord Writer, which allows learners to write while assisted by lexico-grammatical information accessible online).

Word and Phrase* is a multi-function tool for vocabulary learning. It allows you to check the frequency information of any word in the entire COCA and across its registers and its various academic disciplines if the Academic function is selected. You can also use it to check the vocabulary profile of any text being entered. By selecting the information desired, you can find out how many of the words in the text are in the first 500 words in the Academic Vocabulary List and how many are in the 500–3,000 range of the list. Also, by clicking on any word in the text, you can obtain its frequency information and concordance examples of its use in COCA. Moreover, you can check whether any string of words in the text is an established phrase or multi-word unit; if it is a phrase, then concordance examples of its use in COCA will be displayed.

WordNet* is an English vocabulary database provided by Princeton University. It can function as a dictionary with much more detailed useful information, including corpus examples when you select the right display option. It is especially useful for understanding synonyms, including synonymous word phrases such as phrasal verbs, because it provides very detailed information (with examples) about the relationships among the synonyms or hypernyms/hyponyms in a set. Generally, for language learning and teaching purposes, the “Show key senses” display option should be selected.

Key Word Lists

Academic Word List* (AWL) is a list of 570 highly frequent word families in academic English developed by Averil Coxhead in 2000. The words in the word families are those that occur in a wide range of academic texts. The AWL may be the most influential list of academic words and is a very useful word list for instructors and students of English for academic purposes.

Academic Vocabulary List* (AVL) is a list developed by Dee Gardner and Mark Davies in 2014. It contains over 3,000 word lemmas (not word families) that occur frequently across all the academic disciplines in the academic subcorpus of COCA. The AVL differs from the AWL in several ways. For information about their differences, visit the AVL website listed above.

General Service List (GSL), developed by Michael West in 1953, may be the most influential list of high-frequency words of general English. It includes the first and second 1,000 most frequent words (or word families) in English, and it is of great use for English teaching and material development for beginner learners. However, it has been challenged for its age and subjective selection of words. Thus, a New General Service List (see below) has been compiled recently by Brezina and Gablasova (2015).

New General Service List (New-GSL), developed by Vaclav Brezina and Dana Gablasova in 2015, is composed of more than 2,000 high-frequency words in contemporary English. In contrast to its predecessor (GSL), the New-GSL was developed based on large sets of corpus data with rigid criteria for word extraction.

Mike Nelson's Business English Lexis Site* provides a series of business English word lists developed based on a business English corpus that Mike Nelson built for his doctoral dissertation research, such as “100 Most ‘Key’ Words in the Business English Corpus,” “Positive Business English Key Words,” and “Negative Business English Key Words.” The site also provides free downloadable teaching materials.