Topic classification

Topics and genres in corpora

Topics and genres are text types (metadata) that enrich the corpus with information about the subject of the texts or the writing styles. Sketch Engine uses topics and genres to focus the search or analysis on only a part of the corpus. All tools in Sketch Engine contain the text type selector which should be […]

Case sensitive and insensitive searching

Case sensitive and insensitive corpus analysis

This blog post explains how to analyse corpora and take into account or ignore the difference between lowercase and uppercase. In other words, how to use Sketch Engine to: type wifi and find wifi, WIFI, WiFi and Wifi OR type WiFi and only find WiFi but not the other variants

corpus attributes

Words, tags, lemmas, lemposes, lowercase

When using Sketch Engine, every now and then the user comes across the word attribute and its values: words, tags, lemmas, lempos, lowercase and some others depending on the corpus and language. This blog post explains how these positional attributes, to use the correct terminology, work in Sketch Engine and how the user can benefit […]

corpus from the web

Build a corpus from the web

The web is a great source of readily available textual data but also a limitless warehouse of spam, machine-generated content and duplicated content unsuitable for linguistic analysis. This may generate some uncertainty about the quality of the language included in the corpora from the web. At Sketch Engine, we are very well aware of the […]