Corpus of Classical Tibetan

The Annotated Corpora of Classical Tibetan (ACTib) is a corpus containing 170 million words of Classical Tibetan. Version 2.0 has been available in Sketch Engine since November 2020.

Corpus texts were taken from the e-text collection compiled by the Buddhist Digital Resource Center (Zenodo: http://doi.org/10.5281/zenodo.821218). The ACTiB corpus was lemmatized and part-of-speech tagged. There was also prepared a word sketch grammar for the Tibetan language enables users to explore the grammatical and collocational behavior of Tibetan words.

The corpus was built as part of the Tibetan in Digital Communication project. More information about the project and the author’s contacts can be found on the project page.

Part-of-speech tagging

The corpus was lemmatized and tagged using the TreeTagger tool by Helmut Schmid. The TreeTagger model was trained by Yeshe Tenley — parameter file and training corpus are to be found here. The lexicon, corpus, and enumeration of tags in the training data come from Dr. Nathan Hill — https://soas.academia.edu/NathanWHill.

Availability

The corpus is accessible to all users including trial users in Sketch Engine or can be downloaded in its entirety from Zenodo.

DOI for Segmented version: forthcoming

DOI for POS-tagged version: https://doi.org/10.5281/zenodo.822537

ACTib 2.1

  • forthcoming

ACTib 2.0

  • 80 million words automatically segmented and POS-tagged (no manual correction)
  • created word sketch grammar for the Tibetan language

ACTib 1.0

  • initial size 21 million words automatically segmented and POS-tagged (no manual correction)

Garrett, Edward and Hill, Nathan W. and Kilgarriff, Adam and Vadlapudi, Ravikiran and Zadoks, Abel (2015) ‘The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries.’ Revue d’Etudes Tibétaines, 32. pp. 51-86.

Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878

Meelen, Marieke and Hill, Nathan W. (forthcoming) ‘Segmenting and POS tagging Classical Tibetan’ in Himalayan Linguistics.

Hill, Nathan W. and Meelen, Marieke (forthcoming) ‘Creating an Annotated Corpus of Classical Tibetan (ACTib)’.

Meelen, Marieke; Hill, Nathan; Handy, Christopher (2017b), The Annotated Corpus of Classical Tibetan (ACTib), Part II – POS-tagged version, based on the BDRC digitised text collection, tagged with the Memory-Based Tagger from TiMBL. (https://doi.org/10.5281/zenodo.822537).

Search the Tibetan corpus

Sketch Engine offers a range of tools to work with the Tibetan corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.