Sketch Engine can handle texts in any language including languages which are not supported explicitly. The number of available features depends on the script the language uses. Scripts divide into whitespace scripts and non-whitespace ones.
A whitespace script separates words with a space, paragraph or a similar character appearing as white space on the screen or in print. Typical examples are languages written in Latin, Cyrillic or Arabic scripts. Many scripts of India also belong to this category.
A non-whitespace script is a script that does not use whitespace, typical examples are Chinese, Japanese or Thai. Texts in these scripts transliterated into whitespace scripts can make use of the same functionality as whitespace scripts.
|whitespace script||non-whitespace script|
|tokenization||YES, with a universal tokenizer||NO *)|
|concordance search||YES at word level or character level, regex allowed|
NO lemma search or POS search
|YES but only at character level, regex allowed, a concordance for a string of characters can be generated, no other searches are available *)|
can be calculated from a concordance or via word sketches
|word lists||YES||NO *)|
|Word Sketch||YES, universal word sketch grammar will be used, users can write their own word sketch grammar to suit their needs||NO *)|
|Create corpus from the web||YES **)||YES **)|