A token is the smallest unit that a corpus consists of. A token normally refers to:
- a word form: going, trees, Mary, twenty-five…
- punctuation: comma, dot, question mark, quotes…
- digit: 50,000…
- abbreviations, product names: 3M, i600, XP, FB…
- anything else between spaces
There are two types of tokens: words and nonwords. Corpora contain more tokens than words. Spaces are not tokens. A text is divided into tokens by a tool called a tokenizer which is often specific for each language.
These general principles apply to all languages but some language-specific features may be handled differently. Here are some examples:
- don’t in English consists of 2 tokens: do + n’t.
- Verbs with pronominal clitics in Spanish, Italian, French, Portuguese etc. count as one token (Spanish dárselo is 1 token, even though it consits of dar + se + lo)
How to check tokenization
The wordlist works on tokens only. Search for the token using the wordlist. If it is found, it is one token. If it is not found, it is not one token.