What is an n-gram?
An n-gram (also called multi-word unit or MWU) is a sequence of number of items (numbers,digits, words, letter etc.). In the context of text corpora, n-grams will typically refer to sequences of words. A unigram is one word, a bigram is a sequence of two words, a trigram is a sequence of three words etc. The items inside an n-gram may not have any relation between them apart from the fact that they appear next to each other.
What’s the difference between n-grams and collocations?
n-grams are items (words) next to each other
collocations are words with a relation between them but not necessarily next to each other
This sentence The office building was demolished yesterday. contains 5 bigrams:
- office building
- building was
- was demolished
- demolished yesterday
but only 2 collocations:
- office building
- to demolish a building
A collocation can be an n-gram if the words are found immediately next to each other.
The study of n-grams is important for machine translation (frequent n-grams can be translated as chunks with correct word forms reflecting the surrounding items in the n-gram rather than a sequence of isolated items) or in language learning (frequent n-grams can be learnt as chunks rather than constructed from the individual items each time the student needs to use them).
Generating a list of most frequent n-grams
First, you have to choose a corpus and then click on Word List in the left menu. Here you can choose an attribute (Search attribute), which it will search. The important thing is to tick off “use n-grams” and set the value of n (automatic is 2, maximum is 6). Clicking the button “Make Word List” below shows you n-grams according to the selected option.
Creating n-grams can take several tens of seconds (especially 5- or 6-grams in large corpora).
(1) the word list can be generated from the whole corpus or a subcorpus only, select the subcorpus here, you can also get information about the subcorpus or create a new one from text types
(2) select what you want to count, whether word forms, lemmas or something else. The list of options depends on how the corpus is annotated but will generally include these options:
attributes: word form, tag, lempos, lempos-lc, lemma, word form (lowercase), lemma-lc
word sketch: terms, collocations
text types: text types depend on the corpus selected and will be different for each corpus
(3) tick this options to calculate frequencies of n-grams
(4) when ticked, at the end will be grouped under at the end of because the 3-gram at the end is a sub n-gram of the 4-gram at the end of
See the Word list page for detailed information on creating word lists
(17) search n-grams