The length of the useful top is heavily influenced by the size of the corpus and by the number of occurrences of the words in the corpus. A multi-billion word corpus will easily contain millions of occurrences of the most frequent words which will produce thesaurus entries of dozens of relevant synonyms. A 100-million-word corpus will only produce a handful or relevant words for the most frequent keywords, not to speak about thesauri for less frequent words which will produce hardly any useful result. A 1-million-word corpus is not likely to produce any usable thesaurus.
Corpora and speed
Sketch Engine focusses on producing the largest possible corpora with a target size of billions of words so that the computations produce high-quality results. Generating a thesaurus is a matter of a couple of seconds because the system works with precalculated data. This was the only option to ensure that processing 3 million occurrences of beautiful and 2 million occurrences of amazing will only take a second or two.
Your own automatic thesaurus
Sketch Engine features an automatic corpus building tool that will convert any uploaded text into a corpus with a thesaurus and word sketches (for languages where these features are supported). It will even find relevant texts on the internet and include them in the corpus. No technical skills are required. Users can thus work with thesaurus based on their own data. It is important to bear in mind, however, that the quality is heavily dependent on the corpus size and there is little chance of generating a quality thesaurus from a corpus smaller than 100 million words. The use of the multi-billion-word corpora in Sketch Engine is recommended instead.
Sketch Engine currently hosts 400+ text corpora in 90+ languages. Of them, 63 languages have at least one corpus where word sketches and thesaurus are supported and it is an ongoing project to bring this functionality in for even more languages.