The general collections in the NNC contain digitized written texts collected opportunistically from a wide range of sources such as internet webs, newspapers, books, publishers, and authors. These texts of nearly 14 million words keyboarded in various fonts have been unicodified with a software called ‘Font Converter’ , developed at Bhashasanchar Project to convert non-unicode fonts such as Kantipur, Preeti, Jag Himali, etc. into Unicode, and tagged using XML markup and automatic POS tagger.
The texts in the general collections are arranged according to their types.
1. Web-texts (collected during March 2005 to May 2006)
These texts are classified according to their web addresses and are further classified as per their text types (for example, anthropology, art, business, crime, criticism, education, editorial, health, news, law, opinion, sport, politics etc.) and publication date, e.g. kantipur-editorial-2061-12-15.
2. Books (69 books of different genre and size)
Books are identified according to their genre, title and publication date. For example, alikhit by Dhruva Chandra Gautam has been named as ‘book-fiction-alikhit-2058’.
3. Newspaper/journal (complete text of a newspaper or a journal without classification)
In this class we have texts from 94 issues for himalkhabar patrika. Each file has been named after their name and publication date, e.g.himalkhabarpatrika-2057-05-01.