A method for developing large general language corpora which can be applied to many languages.

Corpus Factory performs the following steps to collect a corpus of a language

  • Download Wikipedia Dump and parse it to get Wiki corpus
  • Generate frequency list of a language form Wiki corpus
  • Build queries from the mid frequent words in the frequency list
  • send queries to Bing, Google or Yahoo, and download the search hit pages
  • Clean the corpus
    • Remove boilerplate text (HTML tags and advertisements)
    • Using the wiki frequency list, compute ratio of frequent words to non-frequent words and determine if a page is continuous (i.e. is meaningful)
    • Remove duplicates
  • Tokenise and (if tools are available) lemmatise and part-of-speech tag
  • Load into our corpus query tool, Sketch Engine

Bibliography

Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010). A corpus factory for many languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.