What is a corpus?
A text corpus is a very large collection of text (often many billion words) produced by real users of the language and used to analyse how words, phrases and language in general are used. It is used by linguists, lexicographers, social scientists, humanities, experts in natural language processing and in many other fields. A corpus is also be used for generating various language databases used in software development such as predictive keyboards, spell check, grammar correction, text/speech understanding systems, text-to-speech modules, machine translation systems and many others.
Types of text corpora
It is not possible to easily classify a corpus into a certain category. Instead, corpora can have features or properties which can be used to group them. The same corpus can have one or more of these features.
A monolingual corpus is the most frequent type of corpus. It contains texts in one language only. The corpus is usually tagged for parts of speech and is used by a wide range of users for various tasks from highly practical ones, e.g. checking the correct usage of a word or looking up the most natural word combinations, to scientific use, e.g. identifying frequent patterns or new trends in language. Sketch Engine contains hundreds of monolingual corpora in dozens of languages.
Parallel corpus, multilingual corpus
A parallel corpus consists of two or more monolingual corpora. The corpora are the translations of each other. For example, a novel and its translation or a translation memory of a CAT tool could be used to build a parallel corpus. Both languages need to be aligned, i.e. corresponding segments, usually sentences or paragraphs, need to be matched. The user can then search for all examples of a word or phrase in one language and the results will be displayed together with the corresponding sentences in the other language. The user can then observe how the search word or phrase is translated.
A comparable corpus is one corpus in a set of two or more monolingual corpora, typically each in a different language, built according to the same principles. The content is therefore similar and results can be compared between the corpora even though they are not translations of each other (and therefore, there are not aligned). When users search these corpora they can use the fact, that the corpora also have the same metadata. An example of comparable corpora in Sketch Engine is CHILDES corpora or various corpora made from Wikipedia. Araneum corpora are comparable too.
A diachronic corpus is a corpus containing texts from different periods and is used to study the development or change in language. Sketch Engine allows searching the corpus as a whole or only including selected time intervals in the search. In addition, there is a specialized diachronic feature called Trends, which identifies words whose usage changes the most over the selected period of time.
see also Trends – diachronic analysis
The opposite is a synchronic corpus whose texts come from the same point of time. It is a snapshot of language in one moment. The TenTen family of corpora are such snapshots because their content is collected within a couple of months.
(also called a reference corpus (although this refers to something else in Sketch Engine) is a corpus whose development is complete. The content of the corpus does not change. Most corpora are static corpora. The benefit of a corpus that does not change is that the results of the analysis do not change which is important in many scenarios.
A monitor corpus is used to monitor the change in language. It is a corpus which is regularly (or even continuously) updated, new texts are added as they are produced. The results of the searches change because the content of the corpus gets bigger all the time.
The Timestamped corpus in Sketch Engine is an example of a monitor corpus.
A learner corpus is a corpus of texts produced by learners of a language. The corpus is used to study the mistakes and problems learners have when learning a foreign language. Sketch Engine allows for learner corpora to be annotated for the type of error and provides a special interface to search either for the error itself, for the error correction, for the error type or for a combination of the three options.
see also Setting up a learner corpus
These corpora contain texts produced by learners of a language or by translators. The errors are annotated and can be used to study the types of errors that different groups of learners or translators make.
see also Setting up a learner corpus
A specialized corpus contains texts limited to one or more subject areas, domains, topics etc. Such corpus is used to study how the specialized language is used. The user can create specialized subcorpora from the general corpora in Sketch Engine.
A multimedia corpus contains texts which are enhanced with audio or visual materials or other type of multimedia content. For example, the spoken part of British National Corpus in Sketch Engine has links to the corresponding recordings which can be played from the Sketch Engine interface.
Other corpora can have videos where the corpus text is spoken or images which show the original manuscript or printed copy of the text.
See BNC, where the spoken part (in particular the subcorpus ‘Audio sentences mp3’) is also available in the audio format and it can be played directly in the Sketch Engine interface.