BulgarianNC: Bulgarian National Corpus

The Bulgarian National Corpus (BulgarianNC) is a Bulgarian corpus made up of texts collected from various sources such as scanned books, transcribed data, internet texts, etc. The corpus is classified according to genre, domain, source type. It consists of 419 million words in total (both web and non-web part).

In Sketch Engine, BulgarianNC is organised hierarchically as follows:

  • BulgarianNC_web: The web corpus from Bulgarian NC
  • BulgarianNC_nonweb: All except the web
  • BulgarianNC_all: BulgarianNC_web + BulgarianNC_nonweb –> This is a test case of our new feature – the Virtual Corpus or the Super Corpus

Part-of-speech tagset

The Bulgarian National Corpus is PoS tagged using the following Bulgarian tagset.

Tools to work with the Bulgarian corpus

A complete set of tools is available to work with this Bulgarian corpus to generate:

  • word sketch – Bulgarian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Bulgarian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • keywords– terminology extraction of one-word units
  • text type analysis – statistics of metadata in the corpus

Search the Bulgarian National Corpus

Sketch Engine offers a range of tools to work with this Bulgarian corpus.

