Token is the smallest unit that each corpus divides to. A token normally refers to:

  • a word form: going, trees, Mary, twenty-five
  • punctuation: comma, dot, question mark, quotes…
  • digit: 50,000…
  • abbreviations, product names: 3M, i600, XP, FB…
  • anything else between spaces

The general principle apply to all languages but language-specific features may be handled in a special way.


don’t  in English consists of 2 tokens: do + n’t.

Verbs with pronominal clitics in Spanish, Italian, French, Portuguese etc. count as one token (Spanish dárselo is 1 token, even though it consits of dar + se + lo)

Corpora contain more tokens than words. Spaces are not tokens. A text is divided into tokens by a tool called a tokenizer which is often specific for each language.