Arabic Trends: a daily-updated monitor corpus of news articles
The Arabic Trends corpus is a large-scale Arabic monitor corpus of news articles and other regularly updated web sources collected via RSS feeds (newsfeeds). As a timestamped corpus, it is updated daily, adding approximately 1–2 million words per day.
Its continuous growth makes it ideal for diachronic analysis, enabling users to track language change and word usage over time, identify Arabic trending words, and explore emerging vocabulary and neologisms using tools such as the timeline function show_chart .
Corpus size and growth
The Arabic Trends corpus has been continuously updated since 2014. New data is added twice a week, increasing the corpus by around 6 million words each time. As of March 2026, the Arabic Trends corpus exceeds 7 billion words, making it one of the largest Arabic corpora in Sketch Engine.
Arabic Trends corpus statistics:
| Number of words: | 7+ billion |
| Number of tokens: | 8.1+ billion |
| Number of sentences: | 285+ million |
| Number of documents: | 35+ million |
Linguistic annotationPart-of-speech tagset
The Arabic Trends corpus is part-of-speech tagged using the CAMeL NLP toolkit. The POS tagset used in the corpus is the CAMeL tagset, indicating parts of speech and grammatical categories. It also includes lemmatization, assigning each word form to its base form (lemma), which supports advanced corpus queries and linguistic analysis.
How is the Arabic Trends corpus updated?
New texts are collected every four hours. Twice a week (Tuesday and Friday), all newly collected texts are processed and added to the corpus. To ensure data quality, duplicate and near-duplicate texts are removed within each month. However, identical texts appearing in different months are preserved, allowing accurate temporal (timestamp-based) analysis.
Search the Arabic Trends corpus
Sketch Engine offers a range of tools to work with this Arabic Trends corpus from news articles.
Tools to work with the Arabic Trends corpus
A complete set of tools is available to work with this Arabic Trends corpus from news to generate:
- word sketch – Arabic collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word units
- word lists – lists of Arabic nouns, verbs, adjectives, etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- trends – diachronic analysis automatically identifies neologisms and changes in use
- text type analysis – statistics of metadata in the corpus
Bibliography
Arabic Trends corpus – Trends corpora
Herman, O., Jakubíček, M., Kraus, J., & Suchomel, V. (2025). From word of the year to word of the week: Daily-updated monitor corpora for 25 languages. In Proceedings of the eLex 2025 conference (pp. 44–61).
Herman, O., Kraus, J., & Suchomel, V. (2026). FeedFetcher: A resilient web feed downloader for corpus construction. In Proceedings of the Fifteenth International Conference on Language Resources and Evaluation (LREC 2026). (Forthcoming)
Corpus building
SUCHOMEL, Vít. Better Web Corpora For Corpus Linguistics And NLP. 2020. Available also from: https://is.muni.cz/th/u4rmz/. Doctoral thesis. Masaryk University, Faculty of Informatics, Brno. Supervised by Pavel RYCHLÝ.
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).
Trends – diachronic analysis
Adam Kilgarriff, Ondřej Herman, Jan Bušta, Pavel Rychlý and Miloš Jakubíček. DIACRAN: a framework for diachronic analysis. In Corpus Linguistics (CL2015), United Kingdom, July 2015.
Ondřej Herman and Vojtěch Kovář. Methods for Detection of Word Usage over Time. In Seventh Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2013. Brno: Tribun EU, 2013, pp. 79–85. ISBN 978-80-263-0520-0.
Genre annotation
SUCHOMEL, Vít. Genre Annotation of Web Corpora: Scheme and Issues. In Kohei Arai, Supriya Kapoor, Rahul Bhatia. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. Vancouver, Canada: Springer Nature Switzerland AG, 2021. s. 738-754. ISBN 978-3-030-63127-7. doi:10.1007/978-3-030-63128-4_55.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.





