KSUCCA: King Saud University Corpus of Classical Arabic
The King Saud University Corpus of Classical Arabic (KSUCCA) is a language corpus made up of Classical Arabic texts dating between the 7th and early 11th centuries. The corpus consists of 46 million words and was created as part of the Ph.D. work of Maha Alrabiah, find out more here. The corpus contains texts from a wide range of genres, such as Religion, Linguistics, Literature, Science, Sociology, and Biography; including division into subgenres.
Texts were lemmatized and POS tagged by Yonatan Belinkov using the MADA tools from the University of Columbia. See the POS tagset description.
Tools to work with the Arabic KSUCCA corpus
A complete set of Sketch Engine tools is available to work with this corpus of Classical Arabic to generate:
- word sketch – Arabic collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of Arabic nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- keywords– terminology extraction of one-word
- text type analysis – statistics of metadata in the corpus
Alrabiah, M., Al-Salman, A., & Atwell, E. S. (2013). The design and construction of the 50 million words KSUCCA. In Proceedings of WACL’2 Second Workshop on Arabic Corpus Linguistics (pp. 5-8). The University of Leeds.
Search the corpus of Classical Arabic
Sketch Engine offers a range of tools to work with the KSUCCA corpus.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.