Corpus of Classical Tibetan
The Annotated Corpora of Classical Tibetan (ACTib) is a corpus containing 170 million words of Classical Tibetan. Version 2.0 has been available in Sketch Engine since November 2020.
Corpus texts were taken from the e-text collection compiled by the Buddhist Digital Resource Center (Zenodo: http://doi.org/10.5281/zenodo.821218). The ACTib corpus was lemmatized and part-of-speech tagged. There was also prepared a word sketch grammar for the Tibetan language enables users to explore the grammatical and collocational behavior of Tibetan words.
The corpus was built as part of the Tibetan in Digital Communication project. More information about the project and the author’s contacts can be found on the project page.
Part-of-speech tagging
The corpus was lemmatized and tagged using the TreeTagger tool by Helmut Schmid. The TreeTagger model was trained by Yeshe Tenley – parameter file and training corpus are to be found here. The lexicon, corpus, and enumeration of tags in the training data come from Dr. Nathan Hill.
Availability
The corpus is accessible to all users including trial users in Sketch Engine or can be downloaded in its entirety from Zenodo.
DOI for POS-tagged version: 10.5281/zenodo.3785070
Changelog
ACTib 2.1
- forthcoming
ACTib 2.0
- 80 million words automatically segmented and POS-tagged (no manual correction)
- created word sketch grammar for the Tibetan language
ACTib 1.0
- initial size of 21 million words automatically segmented and POS-tagged (no manual correction)
Bibliography
Garrett, Edward and Hill, Nathan W. and Kilgarriff, Adam and Vadlapudi, Ravikiran and Zadoks, Abel (2015) ‘The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries.’ Revue d’Etudes Tibétaines, 32. pp. 51-86.
Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878
Meelen, Marieke and Hill, Nathan W. (forthcoming) ‘Segmenting and POS tagging Classical Tibetan’ in Himalayan Linguistics.
Hill, Nathan W. and Meelen, Marieke (forthcoming) ‘Creating an Annotated Corpus of Classical Tibetan (ACTib)’.
How to cite?
Meelen, Marieke, & Roux, Élie. (2020). The Annotated Corpus of Classical Tibetan (ACTib) – Version 2.0 (Segmented & POS-tagged) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3951503
Search the ACTib corpus
Sketch Engine offers a range of tools to work with the Tibetan corpus.
or
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.