BulgarianNC: Bulgarian National Corpus

The Bulgarian National Corpus (BulgarianNC) is a Bulgarian corpus made up of texts collected from various sources such as scanned books, transcribed data, internet texts, etc. The corpus is classified according to genre, domain, and source type. It consists of 419 million words in total (both web and non-web part).

In Sketch Engine, BulgarianNC is organized hierarchically as follows:

Part-of-speech tagset

The Bulgarian National Corpus is PoS tagged using the following Bulgarian tagset.

Tools to work with the Bulgarian corpus

A complete set of tools is available to work with this Bulgarian corpus to generate:

  • word sketch – Bulgarian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Bulgarian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • keywords– terminology extraction of one-word units
  • text type analysis – statistics of metadata in the corpus

Search the Bulgarian National Corpus

Sketch Engine offers a range of tools to work with this Bulgarian corpus.

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.