The Timestamped web corpus is an English diachronic web which its first version was compiled and completed in early 2013. It comprises of the content posted on particular feeds which were discovered using the technique (Feed Corpus: An Ever Growing Up-To-Date Corpus) during the time 2012–2013. The content hence was downloaded from the internet using smart crawling techniques. The documents in the corpus contain the following meta fields.
A list of attributes in the corpus
- “meta” – Contains the Meta information such as the headings, etc. of the Feed from where the content is taken
- “tld” – Contains the top-level domain information on the content URL.
- “quarter” – Contains the quarter in which the content was posted, e.g. 2012q1 means the First quarter of the year 2012.
- “month” – Contains the month along with the year in which the content was posted, e.g. 2013-01 refers to January 2013.
- “content_url” – Contains the URL from the where the content was downloaded.
- “time” – Contains the timestamp information when the content URL was posted on the feed link.
- “feed_source_url”– Contains the source feed URL where the content URL was posted.
- “domain” – Contains the domain to which the content_url belongs.
- “year” – Contains the year in which the content was posted, e.g. 2012, 2013 etc.
To support searches by lemma and part of speech, the corpus has been annotated with lemmas and PoS tags using TreeTagger, see the Tagset documentation.
Minocha, Akshay, Siva Reddy, and Adam Kilgarriff (2014). Feed Corpus: An Ever Growing Up-To-Date Corpus. In Proceedings of the eighth Web as Corpus, ACL SIGWAC 8