frWaC: French corpus from the .fr domain
The frWaC corpus is a French text corpus collected from the .fr domain with using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus consists of French websites with total size 1.3 billion words.
The corpus texts were POS tagged with TreeTagger using the following tagset.
Tools to work with the French web corpus
A complete set of Sketch Engine tools is available to work with this French frWaC corpus to generate:
- word sketch – French collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of French nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- keywords– terminology extraction of one-word
- text type analysis – statistics of metadata in the corpus
version 1.1 (2012/04/13)
- retagged with UTF-8 TreeTagger models to fix lemmatization
- improved sentence segmentation
- POS tagged and lemmatized with the TreeTagger tool
- 100-million-word corpus
- gathered using a list of URLs provided by Serge Sharoff (the University of Leeds) as described in A Corpus Factory for Many Languages
BARONI, Marco, et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 2009, 43.3: 209-226.
Search the French corpus
Sketch Engine offers a range of tools to work with this French corpus.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.