forked from turian/common-scripts
-
Notifications
You must be signed in to change notification settings - Fork 0
kheremos/common-scripts
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
local/ -Files of local interest, e.g. with fixed hostnames README.txt - This file all-xml-to-json.sh - For every XML file in the command-line, convert it to JSON. boilerpipe-stdin-urls-to-mongo.py - Run every sys.stdin URL through Boilerpipe (or diffbot), and store in a MongoDB. citeseer-get.pl - Fetch PDFs from citeseer. cumulative.py - Output a cumulative sum for each line in the input file. delexicalize-low-frequency-words.py - Delexicalize all words with freq less than minfreq to *UNKNOWN* dumpdb.py - Dump the MongoDB enscript-landscape-all.pl - Enscript all files listed in @ARGV in landscape mode. filter-json.py - Filter JSON in sys.stdin to find only docs that match each regex with at least one field value. from-one-line-per-word-to-one-line-per-sentence.py - Read one-line-per-word and convert to one-line-per-sentence. grep-json.py - Filter JSON in sys.stdin to find only docs that match each regex against raw JSON. grep-json-by-field.py - Filter JSON in sys.stdin to find only docs that match each regex with at least one field value. join-json.py - For each JSON file in sys.argv, join them and output to stdout. lines-with-funny-characters.pl - Print lines with funny characters lines-with-no-funny-characters.pl - Print lines without funny characters load-directory-of-textfiles-into-mongodb.py - For all files recursively in a subdir, load them into a MongoDB with a certain field name. load-json-into-mongodb.py - Load JSON from stdin into a MongoDB htmldecode.pl - Decode HTML entities, e.g. < becomes < htmlencode.pl - Encode HTML entities, e.g. < becomes < html2text - Convert HTML to text mongodb-count.py - Count the number of entries in a mongodb collection. mongodb-field-lengths.py - Print MongoDB field length and field, for every row. mongodb-remove-field.py - Remove every occurrence of some field, for every row, in MongoDB. mongodb-remove-short-fields.py - Remove every occurrence of some field if it is shorter than some length, for every row, in MongoDB. mongodb-to-lucene.py - Read all mongo docs, and insert them into Lucene. one-sentence-per-line-to-json.py - For line in stdin, convert it to a JSON dict with key: "content" and value: line. page-count.pl - For each file (usually .ps or .pdf) specified in stdin, count the number of pages in the file print-all.pl - For each file (.ps or .pdf) specified as a command-line argument, print the file to a random printer. ptb/one-sentence-per-line.pl - Output one PTB sentence per line, using PTB tagged/ files. read-xml-mysqldump.py - Read in the XML mysqldump from sys.sdin. remove-funny-characters.pl - Remove any funny character remove-nonascii-characters.pl - Remove non-ASCII characters remove-non-utf10-characters.pl - Remove non-UTF 1.0 characters remove-non-utf11-characters.pl - Remove non-UTF 1.1 characters sample.pl - Sample and print only a certain percentage of input lines. shuffle/shuffle.sh - Shuffle lines of stdin sort-curves.py - Sort gnuplot curves tokenizer.sed - Penn Treebank tokenizer. tokenize-English.pl - Word Tokenizer for English by Al-Onaizan and Melamed. tsv-to-json.py - Read TSV from stdin and output as JSON. unichars - List characters for one or more properties (by Tom Christiansen) untokenize - Detokenize Penn Treebank formatted text. vowpal-to-libsvm.py - Convert a vowpal-wabbit file in stdin to libsvm. words-integers-mapfile.py - Create a integers mapfile for the words in textfile. words-to-integers.py - Convert words to integers, according to the mapping in mapfile. xmlmysqldump.py - Read in the XML mysqldump for sys.sdin.
About
Common scripts, mainly for text processing and experimental control
Resources
Stars
Watchers
Forks
Packages 0
No packages published