It is generally practical to divide a corpus into smaller parts called structures. A corpus will typically contain these structures:

  • <doc> is a document, which usually corresponds to a single web page. Can have multiple attributes like URL/source (the source document), author, date/crawl_date (date of creation or date of collecting from the web).
  • <p> is a paragraph. Can have attribute heading (value “1” means the paragraph is a heading/caption).
  • <s> is a sentence.
  • <g> is a “glue” tag, we use it to denote word boundaries without space (so its main purpose is for visualising concordances).
  • <gap> denotes a gap that has been created by one of our tools, mostly due to de-duplication or removal of boilerplate (cleaning HTML pages – navigation, short ads etc.)
  • <a> is a URL or a text that was a link to the original document

The author may decide to include other structures or to use different names of the structures. The information about the structures used is listed on the corpus details page.