etTenTen: Corpus of the Estonian Web
The Estonian Web Corpus (etTenTen) is an Estonian language corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 40 languages.
The Estonian Web 2019 corpus was crawled by the SpiderLing web spider from September 2019 to January 2020. The final size of the corpus consists of 500+ million words. The Estonian Web 2019 corpus contains semi-automatically detected text types such as news, blogs, discussion, education, …
Detailed information about TenTen corpora is on the separate page Common TenTen corpora attributes.
Part-of-speech tagset
The etTenTen corpus was annotated by the Estonian NLTK tool tagger using the following Estonian Filosoft tagset.
Overview of Estonian TenTen corpora
These web corpora were crawled and processed repeatedly during the years:
- Estonian Web corpus 2019 (etTenTen19) – 508 million words (September 2019 – January 2020; semi-automatically detected Text types)
- Estonian Web corpus 2017 (etTenTen15) – 658 million words (July–November 2017)
- Estonian Web corpus 2013 (etTenTen13) – 260 million words
etTenTen corpus in detail
The chart shows the distribution of the parts of speech in the Estonian Web corpus 2019.
Basic statistics of Estonian Web corpus 2019
Basic information
Frequency | |
Tokens | 622,999,541 |
Words | 508,447,009 |
Sentences | 41,819,737 |
Web pages | 2,535,829 |
Tools to work with the Estonian Web corpora
A complete set of Sketch Engine tools is available to work with these Estonian corpora to generate:
- word sketch – Estonian collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of Estonian nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
Changelog
etTenTen 2017 (February 2021)
- tagged with the Filosoft tagset (Estonian NLTK + Filosoft pipeline v.2)
etTenTen 2019 (January 2021)
- crawled by SpiderLing from September 2019 to January 2020
- 622 million tokens
- tagged with the Filosoft tagset (Estonian NLTK + Filosoft pipeline v.2)
- semi-automatically detected Text types
etTenTen 2017 (February 2018)
- crawled by SpiderLing from July to November 2017
- 807 million tokens
etTenTen 2013 (May 2017)
- new word sketches
etTenTen 2013 (May 2014)
- tagging & word sketches
- tagged with the Filosoft tagset (Estonian NLTK + Filosoft pipeline v.1)
etTenTen 2013 (March 2013)
- obtained from the web in January 2013
- 260 million words
- no tagging
Bibliography
TenTen corpora
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).
Search the Estonian corpus
Sketch Engine offers a range of tools to work with the etTenTen corpus.
or
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.