1 Select a word sketch grammar to be applied, use the recommended option if in doubt. Selecting None will disable Word Sketches for this corpus.
2 If you want to use your own word sketch grammar, upload it here.
3 Select the term definition (term grammar), use the default option if in doubt. Selecting None will disable term extraction for this corpus.
4 Chose the name of the structure that should surround the content of each file in the corpus. In the case of a corpus created from the web, the content of each web page will be enclosed in this structure. Use the default option if in doubt. If you know what you are doing, use, for example, doc, document, text, page, site etc.
5 Tick to activate deduplication. When active, identical and very similar content will be identified and only one instance will be kept. Use 6 to indicate at which level should the content be compared.
Available deduplication options
structure name for files – this is the structure set in 4 – if the content of two files, i.e. web pages, is identified as identical or very similar, one of the pages will be removed
p – paragraph – if two or more paragraphs anywhere in the corpus identified as identical or very similar, only one will be kept, the rest will be deleted, this may result in a paragraph being removed from a text while the rest of text is kept in the corpus with the paragraph missing
s – sentence – as above but at the sentence level
Note that structure names for paragraph or sentence might be different in each corpus. The dropdown might also contain additional structures if they exist in the corpus.
7 This is a complete list of the structures found in the corpus. Tick the ones which should be kept. The unticked ones will be converted to corpus text and will be treated as words/tokens. If in doubt, keep all of the ticked.
It is recommended that you keep at least the g (glue) structure and the structures for sentences and paragraphs (s and p in the screenshot).
8 click Compile so start the corpus compilation