The New model Corpus is a ~100 million words domain corpus built from web data in 2008. For more information see in attachments (below).
Text types
Genres
| Genre | # documents |
|---|---|
| blog | 13,957 |
| news | 12,388 |
| general | 10,216 |
| business | 1,433 |
| speech (subtitles) | 1,088 |
| medical | 516 |
| law | 451 |
| fiction | 123 |
Web top level domains
| TLD | # documents |
|---|---|
| com | 15,954 |
| uk | 12,077 |
| org | 2,852 |
| net | 944 |
| edu | 379 |
| gov | 237 |
| ca | 154 |
| us | 104 |
| au | 94 |
| ie | 92 |
| info | 30 |
| other | 116 |
| unknown | 7,139 |
Attachments
Further information about New Model Corpus.




