token

A token is the smallest unit that a corpus consists of. A token normally refers to:

a word form: going, trees, Mary, twenty-five…
punctuation: comma, dot, question mark, quotes…
digit: 50,000…
abbreviations*, product names: 3M, i600, XP, e.g., etc., FB …
anything else between spaces

There are two types of tokens: words and nonwords. Corpora contain more tokens than words. Spaces are not tokens. A text is divided into tokens by a tool called a tokenizer which is often specific for each language.

*If an abbreviation contains a dot, the dot is included as part of the token. For example, ‘e.g.’ counts as a single token.

Exceptions

These general principles apply to all languages but some language-specific features may be handled differently. Here are some examples:

don’t in English consists of 2 tokens: do + n’t.
Verbs with pronominal clitics in Spanish, Italian, French, Portuguese etc. count as one token (Spanish dárselo is 1 token, even though it consits of dar + se + lo)

How to check tokenization

The wordlist works on tokens only. Search for the token using the wordlist. If it is found, it is one token. If it is not found, it is not one token.

See also

word

nonword

word form

« Back to Glossary Index

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine