The type/token ratio, often shortened TTR, is a simple measure of lexical diversity. It can only be interpreted when comparing it to TTR of a different text (corpus). The corpus with a higher TTR contains a higher variety of words than the other corpus. In other words, the authors use more different words, or richer vocabulary, than the authors of the texts in the other corpus.
The term
types refers to the number of distinct words. The term
tokens refers to the number of all words. This sentence:
We met face to face.
contains 4 types but 5 tokens.
The TTR is calculated by dividing
types (the number of different words) by
tokens (the total number of words). It is multiplied by 100 to display the value as percentage.
types ÷ tokens × 100 = TTR
The TTR of the example sentence is 80 %.
More examples:
We met on a bus. ⇢ 5 ÷ 5 × 100 = 100 %
A birch is a European tree. ⇢ 5 ÷ 6 × 100 = 83%
Aquí está uno de los mejores hoteles de lujo de Madrid. ⇢ 9 ÷ 11 × 100= 82%
Im Frühjahr sind es Himbeeren, im Herbst Äpfel. ⇢ 7 ÷ 8 × 100 = 88%
In practice, it does not make much sense to compute the TTR of one sentence. Normally, TTR is calculated for much larger stretches of texts such as documents or the whole corpora. It is important that they should be of nearly identical or very similar size. This is explained further below.
TTR in Sketch Engine
Sketch Engine does not calculate TTR but provides type and token counts so that TTR can be calculated manually. The values are found on the
corpus info page.
types ⇢ look for DASHBOARD - CORPUS INFO -
LEXICON SIZES - word
tokens ⇢ look for DASHBOARD - CORPUS INFO -
COUNTS - Words
Limitations of TTR
The TTR has a strong dependence on the length of the text. If the texts are extremely short, such as the sentences above, the values are close to 100%. As the texts get longer, the TTR drops dramatically. Look at these examples:
|
types |
tokens |
TTR |
a paragraph from Wikipedia |
53 |
77 |
69.00% |
a page from Wikipedia (main text only) |
293 |
660 |
44.00% |
a longer page from Wikipedia (without references) |
5,372 |
18,123 |
30.00% |
a project Guttenberg book |
6,811 |
66,022 |
10.00% |
a longer project Guttenberg book |
20,677 |
213,473 |
9.70% |
a very small corpus (Brown) |
53,048 |
1,007,299 |
5.30% |
a small corpus (BNC) |
724,893 |
96,134,547 |
0.75% |
a standard corpus (English Web 2008) |
8,871,742 |
2,759,340,513 |
0.32% |
a large corpus (English Web 2012) |
27,894,538 |
11,191,860,036 |
0.25% |
a very large corpus (Timestamped JSI corpus) |
68,250,885 |
60,409,480,489 |
0.11% |
A larger corpus will nearly always have a lower TTR. This does not mean that the texts come from authors with smaller vocabulary. It simply means that one does not need more different words (types) to produce more text. Instead, to produce more text, the types are used again and again. Only occasionally, new types are introduced.
Longer texts indeed contain more types because longer texts typically include more topics and, therefore, more types are needed to speak about them. However, new types are introduced at a much, much smaller rate than the rate at which the text grows.
Practical implications
- The TTR should only be used to compare corpora of the same size (= texts of the same length).
- The value itself cannot be interpreted on its own. It has to be compared to another TTR.
- TTR is easier to interpret with very small corpora. For example:
— comparing TTR of two novels of similar size can indicate the variety of language or richness of vocabulary of the authors
— comparing TTR of the language of two children can indicate the difference in the size of their vocabulary.
— comparing TTR of language produced by language learners may be indicative of their language level or their learning progress
- TTR is difficult to interpret with larger corpora. Referring to the statistics above, BNC shows a TTR of 0.75% while English Web 2012 shows 0.25%, i.e. 3 times smaller. However, English Web contains 27.8 million types while BNC only 0.7 million types. There are at least 27 million types which do not even appear in BNC because BNC is so small and does not cover the same variety of topics, genres and contexts as English Web does. Although BNC has a higher TTR, it feels wrong to say that the language in English Web is less varied or less rich. Most types from English Web do not even appear in BNC. It is questionable how TTR should be interpreted and whether it is useful in this case.