Create a new corpus from files

Sketch Engine is a tool to build a corpus by downloading content from the web or by uploading files. The latter is covered on this page.

A corpus can be built by combining both methods. Data can be added to the corpus at any point later and make it larger.

Who can access my data?

Sketch Engine is not a public cloud. Texts you upload will be stored in your personal space in your account. Other users cannot access your texts.

You can, however, choose to grant access to individually selected users by sharing the corpus. If you are a member of a site licence (multi-user account), you can grant access to all other members of the same site licence.  An explicit action has to be taken for this to happen.

How to create a corpus by uploading files

There are 3 ways to reach the corpus building tool:

  • on the corpus dashboard dashboard click NEW CORPUS
  • on the select corpus advanced screen storage click NEW CORPUS
  • open the corpus selector at the top of each screen and click CREATE CORPUS

Sketch Engine supports building parallel corpora from aligned texts. Follow these steps.

In the corpus building interface

  • type a name for your new corpus, select the language, optionally provide a description and click NEXT
  • select the I have my own texts
  • drag and drop the files or select them from your hard drive
  • multiple files can be uploaded as one zip archive
  • click on the help icons help_outline to learn about the  options and settings

This process can be repeated to make the corpus larger or can be combined with building from the web.

How to optimize your corpus

Find out how to optimize your corpus by adding a corpus description, labels, changing text types, etc.

Supported formats

The complete list of supported file formats includes:
.doc, .docx, .htm, .html, .tei, .tmx, .txt, .vert, .xml,
.pdf
(scanned images must be OCRed before uploading)
.xls, .xlsx, .tmx, .xlf/.xliff, .ods (for parallel corpora only)
.zip, .tar.gz (to upload a large number of files at once)

An XML file is also possible if you upload it as plain text but it should only contain text with structural mark-up (such as document or paragraph boundaries; document metadata, etc.). More complex XML will not be processed correctly. Here is a sample of XML text that would be processed correctly:

<xml>
  <doc author="Jan" title="Example doc 1">
        <p>This is a paragraph.</p>
        <p>This is another paragraph.</p>
    </doc>
    <doc author="Jan" title="Example doc 2">
        <p>I will add some more text here.</p>
    </doc>
</xml>

With regard to PDF files, please bear in mind that they are first converted into plain text to create a corpus. This conversion is still an unsolved problem in computer science (across various fields), especially with PDF files containing multiple columns, headings/footers, or splitting words at the end of lines that may not be processed correctly.

Configuration template (for advanced users): Instead of the default template for the selected language, you may select a custom configuration template. If you want to upload a vertical file, this option is typically the only way to create a corpus (unless you are uploading a vertical file in the same format as the default one in Sketch Engine). A template can be created in My Sketch Engine -> My Templates, and its main purpose is to define what attribute appears in which column in the vertical file. Below is an example of a basic template for a vertical file containing three tab-separated columns: word, lemma, and tag.

ATTRIBUTE "word" {
}
ATTRIBUTE "lemma" {
}
ATTRIBUTE "tag" {
}

Some tools (Word Sketch, WS Difference, Thesaurus, term extraction, ...) will not be available for corpora created from vertical files.