It is generally practical to divide a corpus into smaller parts called structures. A corpus will typically contain these structures:

<doc>

is a document, which usually corresponds to a single web page or a file uploaded to Sketch Engine. It can have multiple attributes like URL/source (the source document), author, date/crawl_date (date of creation or date of collecting from the web).

<p>

is a paragraph. It can have various attributes heading (value “1” means the paragraph is a heading/caption). Paragraphs are only added if the source document is a html file containing <p> html tags.

<s>

stands for a sentence.

<g>

is a “glue” tag. The glue is inserted between tokens wich are normally displayed next to each other without a space in between, e.g. don’t is two tokens. It is used only do display such tokens as they are normally seen in written text.

<gap>

marks a stretch of text removed by one of our tools, typically during de-duplication or the removal of boilerplate (=repetitive content found in HTML pages such as navigation menu, short ads, legal text etc.)

<a>

is a URL or a text that was a link to the original document

The author may decide to include other structures or to use different names of the structures. The information about the structures used is given on the corpus details page.