Corpus Factory Method | Sketch Engine

This page contains information about a corpus building method that is no longer used by Sketch Engine but Sketch Engine still contains older corpora built using this method. They are mainly the WaC corpora .
Nowadays, Sketch Engine builds corpora using the method used for TenTen corpora and described here.

A method for developing large general language corpora which can be applied to many languages.

Corpus Factory performs the following steps to collect a corpus of a language

Download a Wikipedia Dump and parse it to obtain a Wiki corpus
Generate a frequency list for the language from the Wiki corpus
Build queries from the mid frequency words in the frequency list
Send queries to Bing, Google or Yahoo and download the search hit pages
Clean the corpus
- Remove boilerplate text (HTML tags and advertisements)
- Using the wiki frequency list, compute the ratio of frequent words to non-frequent words and determine whether a page is continuous (i.e. is meaningful)
- Remove duplicates
Tokenise and (if tools are available) lemmatise and part-of-speech tag
Load it into our corpus query tool, Sketch Engine

Bibliography

Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010). A corpus factory for many languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.

Bibliography

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine