Common corpus structures

It is generally practical to divide a corpus into smaller parts called structures. A corpus will typically contain these structures:

`<doc>`

is a document, which usually corresponds to a single web page or a file uploaded to Sketch Engine. It can have multiple attributes like URL/source (the source document), author, date/crawl_date (date of creation or date of collecting from the web).

`<p>`

is a paragraph. It can have various attributes heading (value “1” means the paragraph is a heading/caption). Paragraphs are only added if the source document is a html file containing <p> html tags.

`<s>`

stands for a sentence.

`<g>`

is a “glue” tag. The glue is inserted between tokens wich are normally displayed next to each other without a space in between, e.g. don’t is two tokens. It is used only do display such tokens as they are normally seen in written text.

The author may decide to include other structures or to use different names of the structures. The information about the structures used is given on the corpus details page.

Adding structures and metadata

Please refer to corpus annotation for details.

`<doc>`

`<p>`

`<s>`

`<g>`

Adding structures and metadata

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine