Corpus Factory Method | Sketch Engine

This page contains information about a corpus building method that is no longer used by Sketch Engine but Sketch Engine still contains older corpora built using this method. They are mainly the WaC corpora .
Nowadays, Sketch Engine builds corpora using the method used for TenTen corpora and described here.

A method for developing large general language corpora which can be applied to many languages.

Corpus Factory performs the following steps to collect a corpus of a language

Download Wikipedia Dump and parse it to get Wiki corpus
Generate frequency list of a language form Wiki corpus
Build queries from the mid frequent words in the frequency list
send queries to Bing, Google or Yahoo, and download the search hit pages
Clean the corpus
- Remove boilerplate text (HTML tags and advertisements)
- Using the wiki frequency list, compute ratio of frequent words to non-frequent words and determine if a page is continuous (i.e. is meaningful)
- Remove duplicates
Tokenise and (if tools are available) lemmatise and part-of-speech tag
Load into our corpus query tool, Sketch Engine

Bibliography

Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010). A corpus factory for many languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.

Bibliography

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine