Corpus structures – parts of a corpus
A corpus is a collection of a very large amount of text that is used, together with a suitable corpus management software such as Sketch Engine, to learn about how language is used. It has become an indispensable tool for all modern linguists and lexicographers.
A text corpus can consist of only one very long line of text. This is, however, impractical for many reasons. Dividing a corpus into smaller parts makes it possible to include/exclude certain parts when searching.
Similarly, it allows the user to count how many files, texts or documents contain a particular word or phrase in order to check how evenly it is distributed across the corpus and judge whether the word is in general use or limited to specific topics only. It might also be useful to know whether a word tends to appear in long sentences (suggesting it might be a formal word) or in very short sentences (suggesting the word tends to be used in informal spoken language). To do this, a corpus has to be equipped with marks or labels indicating the beginnings and ends of such parts. These marks or labels are called structure tags and the parts of a corpus they mark are called structures. The most typical parts are files, paragraphs and sentences.
The built-in annotation tool allows adding metadata to documents easily.
While a corpus without structures remains usable in many respects, it is treated as one long continuous line of text. Searching for the word look followed by the word up at the distance of 1 to 3 words from each other will also find instances where one word is a part of one sentence and the other word in the following sentence. This might be unwanted behaviour. It is the structures that will make it possible to take sentence boundaries into account.
Corpus management software generally does not prescribe (and neither does Sketch Engine) what structures should be included in the corpus and what they should look like. It is, however, advisable to include at least the basic set marking the beginning and end of a document, paragraph and sentence. By default, Sketch Engine will try to identify these three structures when uploading content and will supply the corresponding structure tags automatically.
The basic structure of a corpus with one document, two paragraphs and two sentences in each of the paragraphs might look like this:
<doc> <p> <s>My Bonnie lies over the ocean</s> <s>My Bonnie lies over the sea</s> </p> <p> <s>My Bonnie lies over the ocean</s> <s>Oh, bring back my Bonnie to me</s> </p> </doc>
The indentation above is used purely for the reader’s convenience. The data can also be uploaded as one line of text and Sketch Engine will still process the structures correctly:
<doc><p><s>My Bonnie lies over the ocean</s><s>My Bonnie lies over the sea</s></p><p><s>My Bonnie lies over the ocean</s><s>Oh, bring back my Bonnie to me</s></p></doc>