Arabic Trends: a daily-updated monitor corpus of news articles

The Arabic Trends corpus is a large-scale Arabic monitor corpus of news articles and other regularly updated web sources collected via RSS feeds (newsfeeds). As a timestamped corpus, it is updated daily, adding approximately 1–2 million words per day.

Its continuous growth makes it ideal for diachronic analysis, enabling users to track language change and word usage over time, identify Arabic trending words, and explore emerging vocabulary and neologisms using tools such as the timeline function show_chart .

Corpus size and growth

The Arabic Trends corpus has been continuously updated since 2014. New data is added twice a week, increasing the corpus by around 6 million words each time. As of March 2026, the Arabic Trends corpus exceeds 7 billion words, making it one of the largest Arabic corpora in Sketch Engine.

Arabic Trends corpus statistics:

Number of words: 7+ billion
Number of tokens: 8.1+ billion
Number of sentences: 285+ million
Number of documents: 35+ million

Linguistic annotationPart-of-speech tagset

The Arabic Trends corpus is part-of-speech tagged using the CAMeL NLP toolkit. The POS tagset used in the corpus is the CAMeL tagset, indicating parts of speech and grammatical categories. It also includes lemmatization, assigning each word form to its base form (lemma), which supports advanced corpus queries and linguistic analysis.

How is the Arabic Trends corpus updated?

New texts are collected every four hours. Twice a week (Tuesday and Friday), all newly collected texts are processed and added to the corpus. To ensure data quality, duplicate and near-duplicate texts are removed within each month. However, identical texts appearing in different months are preserved, allowing accurate temporal (timestamp-based) analysis.

Search the Arabic Trends corpus

Sketch Engine offers a range of tools to work with this Arabic Trends corpus from news articles.

Where to find Trends

The texts in the corpus are timestamped which enables users to use Trends, the diachronic analysis tool for detecting neologisms and studying word usage changes

Arabic Trends corpus – Dashboard

Tools to work with the Arabic Trends corpus

A complete set of tools is available to work with this Arabic Trends corpus from news to generate:

  • word sketch – Arabic collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywordsterminology extraction of one-word units
  • word lists – lists of Arabic nouns, verbs, adjectives, etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trends – diachronic analysis automatically identifies neologisms and changes in use
  • text type analysis – statistics of metadata in the corpus

Arabic Trends corpus – Trends corpora

Herman, O., Jakubíček, M., Kraus, J., & Suchomel, V. (2025). From word of the year to word of the week: Daily-updated monitor corpora for 25 languages. In Proceedings of the eLex 2025 conference (pp. 44–61).

Herman, O., Kraus, J., & Suchomel, V. (2026). FeedFetcher: A resilient web feed downloader for corpus construction. In Proceedings of the Fifteenth International Conference on Language Resources and Evaluation (LREC 2026). (Forthcoming)

Corpus building

SUCHOMEL, Vít. Better Web Corpora For Corpus Linguistics And NLP. 2020. Available also from: https://is.muni.cz/th/u4rmz/. Doctoral thesis. Masaryk University, Faculty of Informatics, Brno. Supervised by Pavel RYCHLÝ.

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Trends – diachronic analysis

Adam Kilgarriff, Ondřej Herman, Jan Bušta, Pavel Rychlý and Miloš Jakubíček. DIACRAN: a framework for diachronic analysis. In Corpus Linguistics (CL2015), United Kingdom, July 2015.

Ondřej Herman and Vojtěch Kovář. Methods for Detection of Word Usage over Time. In Seventh Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2013. Brno: Tribun EU, 2013, pp. 79–85. ISBN 978-80-263-0520-0.

Genre annotation

SUCHOMEL, Vít. Genre Annotation of Web Corpora: Scheme and Issues. In Kohei Arai, Supriya Kapoor, Rahul Bhatia. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. Vancouver, Canada: Springer Nature Switzerland AG, 2021. s. 738-754. ISBN 978-3-030-63127-7. doi:10.1007/978-3-030-63128-4_55.

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.