Users with a local installation of Sketch Engine can run the following commands on Linux.
Overview of all command line tools
| addsatfiles | dumpstructrng | lscbgr | mkhatlex | ocd-mkhwds-plain |
| dumpthes | lsclex | mkhatsort | ocd-mkhwds-terms | |
| biterms | dumpwmap | lscngr | mkisrt | ocd-mkthes |
| calctrends | dumpwmrev | lsfsa | mklcm | ocd-mkwsi |
| compilecorp | dumpws | lsfsa_intersect | mklex | par2tokens |
| concinfo | encodevert | lsfsa_left_intersect | mknormattr | parencodevert |
| corpconfcheck | extrms | lskw | mknorms | parmkdynattr |
| corpdatacheck | filterquery2attr | lslex | mkregexattr | parse2wmap |
| corpcheck | filterwm | lslexarf | mksizes | parws |
| corpinfo | freqs | lsslex | mkstats | registry_edit |
| corpquery | genbgr | lswl | mksubc | sconll2sketch |
| corpus4fsa | genfreq | manateesrv | mkthes | sconll2wmap |
| decodevert | genhist | maplexrev | mktrends | setupbonito |
| devirt | genngr | mkalign | mkvirt | ske |
| dumpalign | genterms | mkbgr | mkwc | sortuniq |
| dumpattrrev | genws | mkbidict | mkwmap | sortws |
| dumpattrtext | hashws | mkdrev | mkwmrank | terms2fsa |
| dumpbits | lex2fsa | mkdtext | ngr2fsa | tokens2dict |
| dumpdrev | lexonomyCreateEntries | mkdynattr | ngrsave | vertfork |
| dumpdtext | lexonomyMakeDict | mkfsa | ocd-mkcoll | virtws |
| dumpfsa | lsalsize | mkfsalex | ocd-mkdefs | wm2terms |
| dumplevel | lsbgr | mkhatfsa | ocd-mkdict | wm2thes |
| ocd-mkgdex | ws2fsa |
Command line tools for n-grams
There is a number of utilities available in Finlib/Manatee that make it easy to efficiently generate and store n-grams from corpora. The utilities can be clustered into 3 groups depending on their features:
Generating bigrams from a compiled corpus (genbgr, mkbgr, lsbgr, lscbgr)
Features:
- bigram generation, storing and viewing from a compiled corpus
- no corpus size limit
Usage:
The genbgr and mkbgr is used for generating and storing bigrams, respectively:
genbgr CORPUS ATTR MINFREQ | mkbgr BGRFILE
where CORPUS is the registry name/path of the corpus, ATTR is the attribute that should be used for generating the bigrams, MINFREQ is the minimum frequency of the bigram and BGRFILE is prefix for the bigram files, usually it is ATTR.bgr.
For viewing of stored bigrams, use the lsbgr tool:
lsbgr BGRFILE [FIRST_ID]
where BGRFILE is the same path as given above and the optional FIRST_ID attribute selects first bigram ID that will be shown (otherwise all bigrams are listed).
Example:
>genbgr susanne word 1 | mkbgr word.bgr mkbgr word.bgr[1]: stream sorted, #parts: 1 mkbgr word.bgr[2]: temporary files renamed >ls | grep word.bgr word.bgr.cnt word.bgr.idx >lsbgr word.bgr | head -10 0 1 1 0 14 1 0 16 2 0 23 3 0 25 6 0 33 2 0 40 2 0 49 1 0 52 1 0 66 3
The 3 columns are attribute IDs of the two tokens representing the bigram and the frequency of this bigram. For converting the attribute ID into the corresponding string, use the lsclex tool:
>echo -e '14n1' | lsclex -n susanne word 14 election 1 Fulton
The lscbgr tool prints directly bigram strings and possesses more options:
lscbgr
Lists corpus bigrams
usage: lscbgr [OPTIONS] CORPUS_NAME [FIRST_ID]
-p ATTR_NAME corpus positional attribute [default word]
-n BGR_FILE_PATH path to data files
[default CORPPATH/ATTR_NAME.bgr]
-f lists frequencies of both tokens
-s t|mi|mi3|ll|ms|s compute statistics:
t T score
mi MI score
mi3 MI^3 score
ll log likelihood
ms minimum sensitivity
d logDice
Example:
>lscbgr -f -n word.bgr susanne | head The Fulton 1074 14 1 The election 1074 36 1 The " 1074 2311 2 The place 1074 73 3 The jury 1074 27 6 The City 1074 29 2 The charge 1074 17 2 The September 1074 4 1 The charged 1074 18 1 The Mayor 1074 19 3
Generating n-grams from a compiled corpus (genngr, lscngr)
Features:
- concurrent n-gram generation (for any n), storing and viewing from a compiled corpus
- corpus size up to 2 billion tokens (larger corpora may be processed, but only first 2 billion tokens will be used)
Usage:
The genngr tool is used for generating and storing, the lscngr for viewing:
genngr CORPUS ATTR MINFREQ NGRFILE
The parameters for genngr have same semantics as for genbgr/mkbgr above, the prefix path is usually ATTR.ngr.
lscngr [OPTIONS] CORPUS_NAME
Options can be set as follows:
-p ATTR_NAME corpus positional attribute (default: word)
-n NGR_FILE_PATH n-grams data file path
-f lists frequences
-d STRUCT.ATTR print STRUCT duplicates according to ATTR
-m MIN_NGRAM minimum n-gram size (default: 3)
Example:
>genngr susanne word 1 word.ngr Preparing text Creating suffix array Creating LCP array Saving LDIs >ls | grep word.ngr word.ngr.freq word.ngr.lex word.ngr.lex.idx word.ngr.mm word.ngr.rev word.ngr.rev.cnt word.ngr.rev.cnt64 word.ngr.rev.idx >lscngr -f -n word.ngr susanne | head -10 2 3,4 The jury said | it 2 3 7 2 2,3 The grand | jury 2 6 9 2 3,3 The other , 8 7 195 3 3,3 The fact that 5 27 53 2 3,3 The fact is 5 2 53 2 2,3 The purpose | of 2 7 18 2 3,3 The man was 5 6 169 2 4,4 The Charles Men , 5 2 5 5 2,3 The Charles | Men 5 5 25 2 3,3 The New York 3 24 69
The semantic of the columns in the output listed above is as follows:
- n-gram frequency
- minimum, maximum length of the n-gram
- first 20 tokens of the n-gram, there is a vertical bar (“|”) after the minimum-th word of the n-gram
The following is listed only with the -f option. Given an n-gram as concatenation of strings xyiz
- frequency of the xyi (n-1)-gram
- frequency of the yiz (n-1)-gram
- frequency of the yi (n-2)-gram
If the optional -d STRUCT.ATTR option is given, a list of these structure attributes is printed in addition to the above output, saying which structures share a common n-gram (n being 40 by default, but might be set to a larger value using -m)
E.g.
lscngr -m 100 -f -d bncdoc.id bnc2
prints
>646#624>HHM HHK
at the end, saying that documents 646 and 624 (with IDs “HHM” and “HHK”) share a common 100-gram.
Generating n-grams from a vertical file (ngrsave)
Features:
- concurrent n-gram generation (for any n up to the given maximum) from a vertical file
- direct storing in a text file
- no corpus size limit
Usage:
The ngrsave utility generates the n-grams from a vertical file and stores the in a single text file:
usage: ngrsave VERT_FILE SAVE_FILE STOPLIST_FILE [DOC_SEPARATOR NGRAM_SIZE IGNORE_PUNC]
or
ngrsave -c CORPUS ATTR SAVE_FILE STOPLIST_FILE [DOC_STRUCTURE NGRAM_SIZE IGNORE_PUNC]
Prints all n-grams that occurred at least twice in the input VERT_FILE
STOPLIST_FILE textfile with one stopword per line, n-grams will not contain any stopwords
(use - as STOPLIST_FILE for omitting it)
VERT_FILE input vertical file to be processed, use - for standard input
CORPUS corpus registry filename
ATTR attribute name
SAVE_FILE textfile where the output will be written
DOC_SEPARATOR line prefix, e.g. '
Example:
>cut -f1 susanne.vert | ngrsave - susanne.ngrsave - "head susanne.ngrsave.out
that there be a line through P which meets g 2 130 130
the case in which g is a curve on a 2 130 130
was stored at ° in a tube equipped with a 2 123 123
be a line through P which meets g in points 2 130 130
at ° in a tube equipped with a break seal 2 123 123
there be a line through P which meets g in 2 130 130
He handed the bayonet to Dean and kept the pistol 2 136 136
were allowed to stand at room temperature for 1 hr 2 126 126
case in which g is a curve on a quadric 2 130 130
requires that there be a line through P which meets 2 130 130
The output contains all n-grams that occurred at least twice.
Selected command tools in more detail:
corpinfo
Prints basic information of a given corpus.
Usage: corpinfo [OPTIONS] CORPNAME
-d dump whole configuration
-p print corpus directory path
-s print corpus size
-w print corpus wordform counts
-g OPT print configuration value of option OPT
corpquery
Prints concordance of a given query
Usage: corpquery CORPUSNAME QUERY [ OPTIONS ]
Options:
-r ATTR reference attribute
(default: None)
-c LEFT,RIGHT | BOTH left and right or both context length
(default: 15)
-h LIMIT maximum number of results
(default: -1)
-a ATTR1,ATTR2,... comma separated list of attributes to be shown
default: word,lemma,tag)
-s STR1,STR2... comma separated list of structures to be shown
(use struct.attr or struct.* to show structure attributes; default: s,p,doc)
-g GDEX_CONF use GDEX with a given GDEX_CONF configuration file
(default: None; use - for default configuration) use -h to set the result size (default: 100)
-m GDEX_MODULE_DIR GDEX module path (directory with gdex.py or gdex_old.py)
lsclex
Lists lexicon of given corpus attribute
usage: lsclex [-snf] CORPUS ATTR
-s str2id -- strings from stdin translate to IDs
-n id2str -- IDs from stdin translate to strings
-f print frequences
lsslex
Lists number of tokens for all structure attribute values
usage: lsslex CORPNAME STRUCTNAME STRUCTATTR
example: lsslex bnc bncdoc alltyp
freqs
Prints frequencies of words in a given context of a given query
usage: freqs CORPUSNAME 'QUERY' 'CONTEXT' LIMIT
default CONTEXT is 'word -1' default LIMIT is 1
examples: freqs susanne '[lemma="house"]' 'word -1'
freqs susanne '[lemma="run"]' 'word/i 0 tag 0 lemma 1' 2
freqs susanne '[lemma="test"] []? [tag="NN.*"]' 'word/i -1>0' 0
corpcheck
Checks the validity of various corpus attributes and the correctness of compiled corpus data. Any issues found with the corpus are presented in a clear, human-readable format in standard error output.
Usage: corpcheck CORPNAME




