Automatic POS annotation
Due to the size of modern corpora, the only viable tagging option is an automatic annotation. The tool that does the tagging is called a POS tagger, or simply a tagger. It can work with a high level of accuracy reaching up to 98 % and the mistakes are typically only limited to phenomena of less interest such as misspelt words, rare usage or interjections (e.g. yuppeeee might be tagged incorrectly). Ambiguity also poses a problem. In the sentence Time flies., it is difficult to tell if it is made up of noun + verb or verb + noun. The latter meaning Use a stopwatch to measure (the movement of) insects. :-) Despite certain inaccuracies, modern tools are able to annotate a vast majority of the corpus correctly and the mistakes they make hardly ever cause problems when using the corpus.
During the development of an automatic POS tagger, a small sample (at least 1 million words) of manually annotated training data is needed. The tagger uses it to “learn” how the language should be tagged. It works also with the context of the word in order to assign the most appropriate POS tag. Automatic taggers can only be as good as the quality of the training data. If the training data contain errors or inconsistencies originating from low annotator agreement, data annotated by such taggers will also reflect these problems.
Taggers for each language can be mutually unrelated tools and each one can use different approaches, algorithms, programming languages and configurations. Apart from those, there are also tools which can be trained to process more than one language. The core software stays the same, but a different language model is used for each language.