Token is the smallest unit that each corpus divides to. Typically each word form and punctuation (comma, dot, …) is a separate token (but don’t in English consists of 2 tokens). Therefore, corpora contain more tokens than words. Spaces between words are not tokens. A text is divided into tokens by a tool called a tokenizer which is often specific for each language.
https://www.sketchengine.eu/wp-content/uploads/SE_logo_330x150-bleed-transp-bg.png 0 0 2016-05-28 12:10:482017-06-05 11:44:57token