NCI: New Corpus for Ireland
The New Corpus for Ireland (NCI) is a language corpus developed as part of the set-up phase of a project for a new English-to-Irish Dictionary (NEID). The project is under the direction of Foras na Gaeilge, a public body responsible for the promotion of the Irish language.
The corpus was collected by three main ways:
- incorporating existing corpora
- contacting publishers, authors, newspaper companies etc. to request permission to use their texts
- collecting data from the web.
In Sketch Engine, the project is composed of two separate corpora:
- 30-million corpus of Irish
- 200-million corpus of English including Hiberno-English (the variety of English that is spoken in Ireland)
The project page is available at http://focloir.sketchengine.co.uk/run.cgi/index
The NCI corpus, the Irish part, was processed by the morphological analyzer/generator for Irish (Uı´ Dhonn chadha) with the following POS tagset. The English part of the NCI was tagged by TreeTagger using Penn Treebank tagset.
Tools to work with the New Corpus for Ireland
A complete set of Sketch Engine tools is available to work with this NCI corpus to generate:
- word sketch – English and Irish collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of English and Irish nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
Kilgarriff, Adam, Michael Rundell, and Elaine Uí Dhonnchadha. Efficient corpus development for lexicography: building the New Corpus for Ireland. Language resources and evaluation 40.2 (2006): 127-152.
Search the New Corpus for Ireland
Sketch Engine offers a range of tools to work with the New Corpus for Ireland.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.