Size and balance
There is no minimal requirement for corpus size but to receive usable result, the corpus should contain a decent amount of data for each period. At present, our smallest corpus with trends is the Anthology Reference Corpus with 38 million words. It is unlikely to receive usable results from a corpus of 1 million words.
The balance of periods is also important. If most texts belong to only a few periods from a wide range of periods, the results will be biased towards those periods.
Configuring a corpus for use with Trends
Data have to be annotated with time stamps. The timestamp attribute is user-defined (e.g. pub_date). Most corpora only have one timestamp attribute but the same corpus can contain several . The user can select which timestap should be used for the computation. It is not necessary to have the timestamp in each document, but trends will be computed only from documents with the time stamps.
The beginning of the configuration file must contain the definition of the timestamp attribute, for example:
The number of attribute values should not exceed 500. The values must be composed of the same number of characters, the longest time period (e.g. year) must come first, the shortest last (e.g. day). Non-numerical characters will not produce an error but will be ignored.
Examples of valid values
All values within the same corpus must have the same format.
2004Mar14 – the month will be lost
If you have problems with setting things up, please ask for assistance.