A token is the smallest unit that a corpus consists of. A token normally refers to:
- a word form: going, trees, Mary, twenty-five…
- punctuation: comma, dot, question mark, quotes…
- digit: 50,000…
- abbreviations, product names: 3M, i600, XP, FB…
- anything else between spaces
There are two types of tokens: words
. Corpora contain more tokens than words
. Spaces are not tokens. A text is divided into tokens by a tool called a tokenizer
which is often specific for each language.
These general principles apply to all languages but some language-specific features may be handled differently. Here are some examples:
How to check tokenization
- don't in English consists of 2 tokens: do + n't.
- Verbs with pronominal clitics in Spanish, Italian, French, Portuguese etc. count as one token (Spanish dárselo is 1 token, even though it consits of dar + se + lo)
The wordlist works on tokens only. Search for the token using the wordlist. If it is found, it is one token. If it is not found, it is not one token.