The aims of these instructions is to ensure that for every corpus, it is obvious what are its data sources, what is its configuration and what are the related compiled indices, altogether guaranteeing integrity and reproducibility of achieved results.
The preferred directory hierarchy is as follows:
- one directory containing three subdirectories for corpus vertical files, corpus configuration (registry) files and corpus compiled data, e.g.:
/corpora/ /corpora/vert/ /corpora/registry /corpora/manatee/
- each corpus should have a registry file in registry directory and two directories having exactly the same name in the directories for vertical files and compiled data files, e.g.:
/corpora/registry/mycorpus (configuration) /corpora/vert/mycorpus/ (source data and vertical files) /corpora/manatee/mycorpus/ (compiled data)
- each of these directories should contain only the corresponding files, nothing else
- for each corpus, its configuration file should contain the full path to the source vertical file in the VERTICAL directive and the PATH directive containing path to the corpus data directory, so that, at any moment, running encodevert -c <CORPUS> should reproduce the same data.