NCI: New Corpus for Ireland
The New Corpus for Ireland (NCI) is a language corpus developed as part of the set-up phase of a project for a new English-to-Irish Dictionary (NEID). The project is under the direction of Foras na Gaeilge, a public body responsible for the promotion of the Irish language.
The corpus was collected by three main ways:
- incorporating existing corpora
- contacting publishers, authors, newspaper companies etc. to request permission to use their texts
- collecting data from the web.
In Sketch Engine, the project is composed of two separate corpora:
- 30-million corpus of Irish
- 200-million corpus of English including Hiberno-English (the variety of English that is spoken in Ireland)
The project page is available at http://focloir.sketchengine.co.uk/run.cgi/index
The NCI corpus, the Irish part, was processed by the morphological analyzer/generator for Irish (Uı´ Dhonn chadha) with the following POS tagset. The English part of the NCI was tagged by TreeTagger using Penn Treebank tagset.