NNC: Nepali National corpus
The Nepali National corpus (NNC) is a Nepali corpus built up 13 million words that are lemmatised and part-of-speech tagged. The corpus consists of three different types of corpora: written corpus, parallel corpus and spoken corpus which is not part of the NNC corpus in Sketch Engine. The corpus was created within the NeLRaLEC project funded by Asia IT & C Programme of the European Commission. Corpus texts were PoS tagged and later lemmatised by Bal Krishna Bal from Language Technology Kendra and Andrew Hardie from Lancaster University.
The NNC corpus is POS annotated with using the Nelralec tagset.
Content in detail
The version of the National Nepali Corpus in Sketch Engine consists of two corpora.
1. written corpus (two collections collecting 500 texts of 15 different genres with 2000 words published between 1990 and 1992; ca 11 million of words) with 2 collections:
The core corpus is a collection of Nepali written texts that concur as far as possible with the date, number and genres of the international FLOB and FROWN corpora consisting of 500 texts of 15 different genres with 2000 words each published between 1990 and 1992. This framework is as follows:
|Table 1: Core sample framework (based on FLOB/FROWN corpora)|
|Category Label||Category Title||Number of samples|
|E||Skills, Trades and Hobbies||38|
|G||Belles Lettres, Biographies, Essays||77|
|K||Mystery and Detective Fiction||24|
|M||Adventure and Western||29|
|N||Romance and Love story||29|
The primary purpose of the Core Sample was to provide a match to other corpora created from the same sampling frame. However, there were made some adaptations for selecting genres as all the genres existing in English writings (e.g. science fiction) did not exist in Nepali because of cultural and other East-West differences. Besides, only 398 (instead of 500) texts could be collected for Nepali core corpus since texts from some genres could not be available from the 1991/92 time frame when writings in Nepali were very much restricted and just started broadening with the advent of liberalism after the restoration of democracy in the country.
These collected core corpus is presented in Table 2.
|Table 1: Core sample framework (based on FLOB/FROWN corpora)|
|Category name||No of files||No of words|
|A (Press reportage)||33 (44)||66800|
|B (Press editorial)||23 (27)||46520|
|C (Press review)||6 (17)||12095|
|D (Religion)||13 (17)||26412|
|E (Skills, Trades and Hobbies)||29 (38)||58935|
|F (Popular lore)||32 (44)||64878|
|G (Belles Letters, Biographies, Essays)||68 (77)||137873|
|H (Miscellaneous)||28 (30)||56680|
|J (Science)||56 (80)||113507|
|S (Fiction)||110 (126)||220874|
|Grand total||398 (500)||804574|
The internal structure of the core corpus is as follows:
|One text translated from Hindi and one text based on Sanskrit|
|Science and technology||3|
|Anthropology / culture||8|
|History / Archeology||5|
|Language and grammar||3|
|Law / politics||5|
|Business / economics / administration||6|
These 1 million words appearing in 398 texts extracted from various books, journals, magazines and newspapers were digitized in Nepali Unicode. For the purpose of computer processing, these texts were then manually formatted using XML tagging in the body, paragraph, sentences and foreign words appearing therein. Each text was provided with the metadata or bibliographical details such as book/article/ issue title, author, publisher, publication date, publication place, name of the typist, etc. in XML header. Additional relevant XML tags were also added automatically.
A set of 112 parts-of-speech (POS) tags were developed empirically to annotate the core corpus (For details see POS Tagset). This tagset was first manually used to annotate 160 files in the core corpus. Based on this manually tagged corpus, an automatic tagger was developed at LU called Unitag, and has been used to automatically tag the whole of the text corpus using lexicon, rules and probalistic generalizations. However, in line with our policy of technology transfer we have been building our own parts-of-speech tagger at MPP as part of a general morphology analyser for a range of uses.
The general collections in the NNC contain digitized written texts collected opportunistically from a wide range of sources such as internet webs, newspapers, books, publishers, and authors. These texts of nearly 14 million words keyboarded in various fonts have been unicodified with a software called ‘Font Converter’ , developed at Bhashasanchar Project to convert non-unicode fonts such as Kantipur, Preeti, Jag Himali, etc. into Unicode, and tagged using XML markup and automatic POS tagger.
The texts in the general collections are arranged according to their types.
1. Web-texts (collected during March 2005 to May 2006)
These texts are classified according to their web addresses and are further classified as per their text types (for example, anthropology, art, business, crime, criticism, education, editorial, health, news, law, opinion, sport, politics etc.) and publication date, e.g. kantipur-editorial-2061-12-15.
2. Books (69 books of different genre and size)
Books are identified according to their genre, title and publication date. For example, alikhit by Dhruva Chandra Gautam has been named as ‘book-fiction-alikhit-2058’.
3. Newspaper/journal (complete text of a newspaper or a journal without classification)
In this class we have texts from 94 issues for himalkhabar patrika. Each file has been named after their name and publication date, e.g.himalkhabarpatrika-2057-05-01.
2. parallel corpus (two genres: computing and national development; ca 3 million of words)
Tools to work with the Nepali National corpus
A complete set of Sketch Engine tools is available to work with this Nepali National corpus to generate:
- word sketch – Nepali collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of Nepali nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- keywords– terminology extraction of one-word
- text type analysis – statistics of metadata in the corpus
Yadava, Yogendra P., Andrew Hardie, Ram Raj Lohani, Bhim N. Regmi, Srishtee Gurung, Amar Gurung, Tony McEnery, Jens Allwood, and Pat Hall. Construction and annotation of a corpus of contemporary Nepali.Corpora 3, no. 2 (2008): 213-225.
Part-of-speech tagset documentation
Hardie, A, Lohani, R, Regmi, B and Yadava, Y (2005). Categorisation for automated morphosyntactic analysis of Nepali: introducing the Nelralec Tagset (NT-01). Nelralec/Bhasha Sanchar Working Paper 2, pp. 171–198.
Part-of-speech tagging and lemmatisation
Hardie, A, Lohani, R and Yadava, YP (2011) Extending corpus annotation of Nepali: advances in tokenisation and lemmatisation. Himalayan Linguistics 10 (1): 151-165.
Yadava, Y.P., Hardie, A., Lohani R.R., Regmi B.N., Gurung, S., Gurung, A., McEnery, T., Allwood, J., and Hall, P. (2008). Construction and annotation of a corpus of contemporary Nepali. Corpora 3(2): 213-225.
Search the Nepali National corpus
Sketch Engine offers a range of tools to work with the Nepali National corpus.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.