GitHub - turian/common-scripts: Common scripts, mainly for text processing and experimental control

turian / common-scripts Public

Notifications You must be signed in to change notification settings
Fork 7
Star 20

Common scripts, mainly for text processing and experimental control

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
html2text		html2text
local		local
shuffle		shuffle
README		README
all-xml-to-json.sh		all-xml-to-json.sh
boilerpipe-stdin-urls-to-mongo.py		boilerpipe-stdin-urls-to-mongo.py
choose-columns.py		choose-columns.py
chop-columns.pl		chop-columns.pl
citeseer-get.pl		citeseer-get.pl
compare-file-lengths.pl		compare-file-lengths.pl
condor-count-user-jobs-in-queue.sh		condor-count-user-jobs-in-queue.sh
convert-yaml-to-hashdb.py		convert-yaml-to-hashdb.py
convert-yaml-to-json.py		convert-yaml-to-json.py
convert-yaml-to-split-json.py		convert-yaml-to-split-json.py
count-undefined_references.pl		count-undefined_references.pl
cumulative.py		cumulative.py
delexicalize-low-frequency-words.py		delexicalize-low-frequency-words.py
dumpdb.py		dumpdb.py
enscript-landscape-all.pl		enscript-landscape-all.pl
file-count-recursive.py		file-count-recursive.py
filter-json-by-key.py		filter-json-by-key.py
fix-filenames.py		fix-filenames.py
from-one-line-per-word-to-one-line-per-sentence.py		from-one-line-per-word-to-one-line-per-sentence.py
fuck-condor.py		fuck-condor.py
grep-json-by-field.py		grep-json-by-field.py
grep-json.py		grep-json.py
hashdb-append.py		hashdb-append.py
hashdb-dump.py		hashdb-dump.py
htmldecode.pl		htmldecode.pl
htmlencode.pl		htmlencode.pl
ikigrabimage		ikigrabimage
ikiquip		ikiquip
interleave		interleave
join-json.py		join-json.py
lines-with-funny-characters.pl		lines-with-funny-characters.pl
lines-with-no-funny-characters.pl		lines-with-no-funny-characters.pl
load-directory-of-textfiles-into-mongodb.py		load-directory-of-textfiles-into-mongodb.py
load-json-into-mongodb.py		load-json-into-mongodb.py
make-single-sided.pl		make-single-sided.pl
mongodb-count.py		mongodb-count.py
mongodb-field-lengths.py		mongodb-field-lengths.py
mongodb-remove-field.py		mongodb-remove-field.py
mongodb-remove-short-fields.py		mongodb-remove-short-fields.py
mongodb-to-lucene.py		mongodb-to-lucene.py
non-ascii-lines.pl		non-ascii-lines.pl
numberlines		numberlines
numberlines-normalized		numberlines-normalized
one-sentence-per-line-to-json.py		one-sentence-per-line-to-json.py
page-count.pl		page-count.pl
paired-T-test.py		paired-T-test.py
percentile.py		percentile.py
print-all.pl		print-all.pl
quickrm		quickrm
read-xml-mysqldump.py		read-xml-mysqldump.py
remove-funny-characters.pl		remove-funny-characters.pl
remove-non-utf10-characters.pl		remove-non-utf10-characters.pl
remove-non-utf11-characters.pl		remove-non-utf11-characters.pl
remove-nonascii-characters.pl		remove-nonascii-characters.pl
run-lock.pl		run-lock.pl
rzip.pl		rzip.pl
rzipdir.py		rzipdir.py
sample.pl		sample.pl
shuffle-files.pl		shuffle-files.pl
shuffle.sh		shuffle.sh
shuffle_galleries.pl		shuffle_galleries.pl
sort-curves.py		sort-curves.py
statistics.pl		statistics.pl
statistics.py		statistics.py
strip-invalid-lines.py		strip-invalid-lines.py
tailpercent		tailpercent
tokenize-English.pl		tokenize-English.pl
tokenizer.sed		tokenizer.sed
tsv-to-json.py		tsv-to-json.py
tsv_to_html.py		tsv_to_html.py
unichars		unichars
untokenizer		untokenizer
version.pl		version.pl
vowpal-to-libsvm.py		vowpal-to-libsvm.py
weighted-sum.pl		weighted-sum.pl
words-integers-mapfile.py		words-integers-mapfile.py
words-to-integers.py		words-to-integers.py
xmlmysqldump.py		xmlmysqldump.py

Repository files navigation

local/                      -Files of local interest, e.g. with fixed
                            hostnames

README.txt			        - This file

all-xml-to-json.sh          - For every XML file in the command-line,
                            convert it to JSON.

boilerpipe-stdin-urls-to-mongo.py
                            - Run every sys.stdin URL through Boilerpipe
                            (or diffbot), and store in a MongoDB.

citeseer-get.pl			    - Fetch PDFs from citeseer.

cumulative.py               - Output a cumulative sum for each line in
                            the input file.

delexicalize-low-frequency-words.py
                            - Delexicalize all words with freq less than
                            minfreq to *UNKNOWN*

dumpdb.py                   - Dump the MongoDB

enscript-landscape-all.pl	- Enscript all files listed in @ARGV in
            				landscape mode.

filter-json.py              - Filter JSON in sys.stdin to find only docs
                            that match each regex with at least one
                            field value.

from-one-line-per-word-to-one-line-per-sentence.py
                            - Read one-line-per-word and convert to
                            one-line-per-sentence.

grep-json.py                - Filter JSON in sys.stdin to find only docs
                            that match each regex against raw JSON.

grep-json-by-field.py       - Filter JSON in sys.stdin to find only docs
                            that match each regex with at least one
                            field value.

join-json.py                - For each JSON file in sys.argv, join them
                            and output to stdout.

lines-with-funny-characters.pl
                            - Print lines with funny characters

lines-with-no-funny-characters.pl
                            - Print lines without funny characters

load-directory-of-textfiles-into-mongodb.py
                            - For all files recursively in a subdir, load
                            them into a MongoDB with a certain field name.

load-json-into-mongodb.py   - Load JSON from stdin into a MongoDB

htmldecode.pl               - Decode HTML entities, e.g. &lt; becomes <

htmlencode.pl               - Encode HTML entities, e.g. < becomes &lt;

html2text                   - Convert HTML to text

mongodb-count.py            - Count the number of entries in a mongodb
                            collection.

mongodb-field-lengths.py    - Print MongoDB field length and field,
                            for every row.

mongodb-remove-field.py     - Remove every occurrence of some field,
                            for every row, in MongoDB.

mongodb-remove-short-fields.py
                            - Remove every occurrence of some field if it
                            is shorter than some length, for every row,
                            in MongoDB.

mongodb-to-lucene.py        - Read all mongo docs, and insert them
                            into Lucene.

one-sentence-per-line-to-json.py
                            - For line in stdin, convert it to a JSON
                            dict with key: "content" and value: line.

page-count.pl			    - For each file (usually .ps or .pdf)
                            specified in stdin, count the number of
                            pages in the file

print-all.pl			    - For each file (.ps or .pdf) specified
            				as a command-line argument, print the
            				file to a random printer.

ptb/one-sentence-per-line.pl    - Output one PTB sentence per line,
                            using PTB tagged/ files.

read-xml-mysqldump.py       - Read in the XML mysqldump from sys.sdin.

remove-funny-characters.pl  - Remove any funny character

remove-nonascii-characters.pl   - Remove non-ASCII characters

remove-non-utf10-characters.pl  - Remove non-UTF 1.0 characters

remove-non-utf11-characters.pl  - Remove non-UTF 1.1 characters

sample.pl                   - Sample and print only a certain percentage
                            of input lines.

shuffle/shuffle.sh		    - Shuffle lines of stdin

sort-curves.py              - Sort gnuplot curves

tokenizer.sed               - Penn Treebank tokenizer.

tokenize-English.pl         - Word Tokenizer for English by Al-Onaizan
                            and Melamed.

tsv-to-json.py              - Read TSV from stdin and output as JSON.

unichars                    - List characters for one or more properties
                            (by Tom Christiansen)

untokenize                  - Detokenize Penn Treebank formatted text.

vowpal-to-libsvm.py         - Convert a vowpal-wabbit file in stdin
                            to libsvm.

words-integers-mapfile.py   - Create a integers mapfile for the words
                            in textfile.

words-to-integers.py        - Convert words to integers, according to
                            the mapping in mapfile.

xmlmysqldump.py             - Read in the XML mysqldump for sys.sdin.