Adding structures and metadata
Adding structures, structural attributes and values makes it possible to annotate (add metadata) to a corpus. Document, paragraph and sentence structures are normally added automatically when building a corpus in Sketch Engine but other structures must be added manually if required.
If you are new to corpus annotation, you might like to read this blog post first.
Procedure in a nutshell
(If your corpus is in Sketch Engine, first download it.)
- Open the corpus in a plain text editor or annotation software.
- Add structures, attributes and values.
- Upload it to Sketch Engine. Attributes and values will be processed into text types automatically.
Terminology and format
Metadata can only be added to structures (document, sentence, paragraph, noun phrases and others that exist in the corpus or that the user introduces into the data). The structure must surround the text to be annotated.
To annotate a sentence, a sentence structure must mark the beginning and end of the sentence. The annotation is then added to the beginning of the structure.
An example of a sentence annotation:
<s direct_speech="yes" type="question">Have you had time to think it over?</s>
An example of a noun phrase annotation:
<s direct_speech="yes" type="affirm">I like <n_phrase type="noun-of-noun" words="5">the colour of your boots</n_phrase>.</s>
s and n_phrase are structure names
Structure names must be enclosed in angle brackets <> and can only use letters a-z, A-Z, numbers 0-9 and underscore (_).
yes, question, noun-of-noun and 5 are values
Values must be enclosed in plaintext double quotes, rounded typographic quotes are not allowed. Values can contain any characters including accented characters. If a double quote is part of a value, it must be escaped with a backslash
No spaces are allowed around the equal sign between attribute and it’s value.
<doc type = "spoken">
Documents uploaded to Sketch Engine are automatically surrounded by the document structure. Sentences are automatically recognized and surrounded by the sentence structure. Paragraph structures are only inserted automatically into web pages downloaded by Sketch Engine.
Annotation with structures but without attributes and values.
<doc> <p> <s>My Bonnie lies over the ocean</s> <s>My Bonnie lies over the sea</s> </p> <p> <s>My Bonnie lies over the ocean</s> <s>Oh, bring back my Bonnie to me</s> </p> </doc>
The indentation can be used for the user’s convenience. White space between structures is ignored. The same data in one line will still be processed correctly:
<doc><p><s>My Bonnie lies over the ocean</s><s>My Bonnie lies over the sea</s></p><p><s>My Bonnie lies over the ocean</s><s>Oh, bring back my Bonnie to me</s></p></doc>
Metadata consist of the attribute (the type of metadata, e.g. publication year) and the value (the actual metadatum, e.g. 1968). The attribute can be anything written in letters of the English alphabet or underscore _). The attribute can be abbreviated and the corpus can be configured to present the user with a human-friendly name. E.g. the corpus can contain but the configuration file can be edited to show this in the interface as Year of publication.
An example of a corpus consisting of 2 files, with structures and structure attributes (metadata).
<doc pub="1970" lang="en"> <p style="informal"> <s><pers gender="female">Rebecca</pers> has worked with a full range of clients including <brand sect="automotive">BMW</brand> and <brand sect="air">Airbus</brand>.</s> <s> some text </s> </p> <p style="formal"> <s>some text </s> <s>some text </s> </p> </doc> <doc pub="1977"> <p style="informal"> <s>some text </s> <s> some text </s> </p> <p style="informal"> <s>some text </s> <s>some text </s> </p> </doc>
Document annotation tool
The built-in annotation tool allows adding metadata to documents easily.
Other annotation tools
The Sketch Engine interface only allows assigning metadata to documents. To insert or annotate other structures, use a plain text editor or an external annotation tools.
Annotation tools are usually designed for a specific annotation taks. General-purpose annotation tools are not easy to find. The UAM Corpus Tool is worth trying.