-
CAT tool
A CAT tool, stands for a computer assisted translation tool, is software that helps translators maintain consistency in terminology across their translation jobs and also aids the translation process by suggesting (or translating automatically) passages (segments) which the translator already translated in the past. Data exported from CAT tools (translation memories) can be used to build a parallel corpus in Sketch Engine or uploaded for bilingual term extraction. Extracted terms can be exported as term base (TBX) and uploaded back to the CAT too. Parallel user corpora in Sketch Engine can be downloaded as translation memory (TMX) and uploaded in the CAT tool. -
cluster
a process of creating groups of words in the thesaurus or word sketch. Words are connected to their shared collocational behavior. See more on the Clustering Neighbours documentation -
collocate
a part of a collocation that is not the node. A collocate is dependent on the node. The collocate strong and the node wind make up the collocation strong windcollocation collocate node strong wind icy wind cold wind -
collocation
a collocation is a sequence or combination of words that occur together more often than would be expected by chance (from Wikipedia|Collocation) A collocation, e.g. fatal error, typically consists of a node (error) and a collocate (fatal). Collocations can have different strengths, e.g. nice house is a weak collocation because both nice and house can combine with lots of other words, on the other hand, the Opera House is a strong collocation because it is very typical for opera to occur next to house and, at the same time, opera does not combine with many other words. In Sketch Engine, the tool to use for collocations is the word sketch. The strength of collocation is expressed by the logDice score. -
comparable corpus [ corpus-types ]
A comparable corpus is a corpus consisting of texts from the same domain in more languages. In contrast to a parallel corpus, the texts are not translations of each other and belong to the same domain with the same metadata. An example of a comparable corpus is corpus made from Wikipedia. -
compile
A corpus compilation refers to the processing of the corpus data (text) with the tools available for the language and converting the text into a corpus.Only a compiled corpus can be searched. see corpus compilation -
concordance [ feature ]
-
concordancer [ feature ]
A concordancer is a tool (a piece of software) which searches a text corpus and displays a concordance. A concordancer is one of the features in Sketch Engine which allows for simple corpus searches as well as queries involving complex criteria that search for grammatical or lexical structures. see also concordance -
CoNLL format
CoNLL format is a specific format of vertical that represents a syntactic parse tree. In comparison with vertical, there are extra columns describing the syntactic structure of words within the sentence, i.e. id, head, deprel. The number and position of these extra columns may vary depending on the specific CoNLL format.- id representing the positions of the current word (the 1st column)
- head is the parent node id of the current word (the 5th column)
- deprel contains the information about the relation by which the current node and parent node are connected (the 6th column)
<s> 1 Dropping drop-v VBG 14 advcl 2 down down-x RP 1 prt 3 abaft abaft-i IN 1 prep 4 the the-x DT 5 det 5 bridge bridge-n NN 3 pcomp 6 , ,-x , 14 punct 7 the the-x DT 9 det 8 first first-j JJ 9 amod 9 thing thing-n NN 14 subj 10 to to-x TO 11 infmark 11 come come-v VB 9 infmod 12 into into-i IN 11 prep 13 view view-n NN 12 pcomp 14 was be-v VBD 0 ROOT 15 the the-x DT 16 det 16 funnel funnel-n NN 14 arg1 17 . .-x . 14 punct </s>
see also vertical building word sketches from parsed corpora -
cooccurrence [ text-analysis ]
cooccurrence or co-occurrence is a term which expresses how often two terms from a corpus occur alongside each other in a certain order. It usually indicates words which together create a new meaning. We call them as phraseme or multi-word expression, e.g. black sheep or get on. Sketch Engine help to find such words with using the word sketch tool or the collocation search. Read more about further tools for text analysis. -
corpus
A corpus is a large collection of authentic texts used for studying language or generating linguistic data. Modern corpora contain texts whose total length is billions or dozens of billions of words. A corpus is usually annotated (=word are labelled with information about the part of speech and grammatical category). The terms corpus and text corpus and language corpus are interchangeable. Using a corpus for any type of linguistic or language oriented work ensures that the outcomes reflect the real use of the language. more on copora» -
corpus architect
an intuitive tool inside Sketch Engine for creating corpora from documents or the Web which does not require any expert knowledge. See the create your own corpus page. -
corpus manager
a program used to manage text corpora, i.e. to build, edit, annotate and search corpora. Sketch Engine is the user interface to the corpus manager Manatee. -
CQL
The Corpus Query Language is a code used to set criteria for complex searches which cannot be carried out using the standard user interface controls. The criteria may include words or lemmas but also tags and other attributes, text types or structures. Conditions can be set for optional tokens or token repetition. -
CSV
a type of plain text document used for saving tabular data. It is seamlessly accepted by a large variety of applications and is therefore ideal for exporting Sketch Engine results to be used in other software. CSV can be opened directly in Microsoft Excel, Open Office, Google Documents and many others.