Create a new corpus from files
Sketch Engine is a tool to build a corpus by downloading content from the web or by uploading files. The latter is covered on this page.
A corpus can be built by combining both methods. Data can be added to the corpus at any point later and make it larger.
How to create a corpus by uploading files
There are 3 ways to reach the corpus building tool:
- on the corpus dashboard dashboard click NEW CORPUS
- on the select corpus advanced screen storage click NEW CORPUS
- open the corpus selector at the top of each screen and click CREATE CORPUS
Sketch Engine supports building parallel corpora from aligned texts. Follow these steps.
In the corpus building interface
- type a name for your new corpus, select the language, optionally provide a description and click NEXT
- select the I have my own texts
- drag and drop the files or select them from your hard drive
- multiple files can be uploaded as one zip archive
- click on the help icons help_outline to learn about the options and settings
This process can be repeated to make the corpus larger or can be combined with building from the web.
How to optimize your corpus
Find out how to optimize your corpus by adding a corpus description, labels, changing text types, etc.
Supported formats
The complete list of supported file formats includes:
.doc, .docx, .htm, .html, .tei, .tmx, .txt, .vert, .xml,
.pdf (scanned images must be OCRed before uploading)
.xls, .xlsx, .tmx, .xlf/.xliff, .ods (for parallel corpora only)
.zip, .tar.gz (to upload a large number of files at once)
An XML file is also possible if you upload it as plain text but it should only contain text with structural mark-up (such as document or paragraph boundaries; document metadata, etc.). More complex XML will not be processed correctly. Here is a sample of XML text that would be processed correctly:
<xml>
<doc author="Jan" title="Example doc 1">
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
</doc>
<doc author="Jan" title="Example doc 2">
<p>I will add some more text here.</p>
</doc>
</xml>With regard to PDF files, please bear in mind that they are first converted into plain text to create a corpus. This conversion is still an unsolved problem in computer science (across various fields), especially with PDF files containing multiple columns, headings/footers, or splitting words at the end of lines that may not be processed correctly.
Configuration template - advanced users
Configuration template (for advanced users): Instead of the default template for the selected language, you may select a custom configuration template. If you want to upload a vertical file, this option is typically the only way to create a corpus (unless you are uploading a vertical file in the same format as the default one in Sketch Engine). A template can be created in My Sketch Engine -> My Templates, and its main purpose is to define what attribute appears in which column in the vertical file. Below is an example of a basic template for a vertical file containing three tab-separated columns: word, lemma, and tag.
ATTRIBUTE "word" {
}
ATTRIBUTE "lemma" {
}
ATTRIBUTE "tag" {
}
Some tools (Word Sketch, WS Difference, Thesaurus, term extraction, ...) will not be available for corpora created from vertical files.




