Skip to content

Common scripts, mainly for text processing and experimental control

Notifications You must be signed in to change notification settings

turian/common-scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

local/                      -Files of local interest, e.g. with fixed
                            hostnames

README.txt			        - This file

all-xml-to-json.sh          - For every XML file in the command-line,
                            convert it to JSON.

boilerpipe-stdin-urls-to-mongo.py
                            - Run every sys.stdin URL through Boilerpipe
                            (or diffbot), and store in a MongoDB.

citeseer-get.pl			    - Fetch PDFs from citeseer.

cumulative.py               - Output a cumulative sum for each line in
                            the input file.

delexicalize-low-frequency-words.py
                            - Delexicalize all words with freq less than
                            minfreq to *UNKNOWN*

dumpdb.py                   - Dump the MongoDB

enscript-landscape-all.pl	- Enscript all files listed in @ARGV in
            				landscape mode.

filter-json.py              - Filter JSON in sys.stdin to find only docs
                            that match each regex with at least one
                            field value.

from-one-line-per-word-to-one-line-per-sentence.py
                            - Read one-line-per-word and convert to
                            one-line-per-sentence.

grep-json.py                - Filter JSON in sys.stdin to find only docs
                            that match each regex against raw JSON.

grep-json-by-field.py       - Filter JSON in sys.stdin to find only docs
                            that match each regex with at least one
                            field value.

join-json.py                - For each JSON file in sys.argv, join them
                            and output to stdout.

lines-with-funny-characters.pl
                            - Print lines with funny characters

lines-with-no-funny-characters.pl
                            - Print lines without funny characters

load-directory-of-textfiles-into-mongodb.py
                            - For all files recursively in a subdir, load
                            them into a MongoDB with a certain field name.

load-json-into-mongodb.py   - Load JSON from stdin into a MongoDB

htmldecode.pl               - Decode HTML entities, e.g. &lt; becomes <

htmlencode.pl               - Encode HTML entities, e.g. < becomes &lt;

html2text                   - Convert HTML to text

mongodb-count.py            - Count the number of entries in a mongodb
                            collection.

mongodb-field-lengths.py    - Print MongoDB field length and field,
                            for every row.

mongodb-remove-field.py     - Remove every occurrence of some field,
                            for every row, in MongoDB.

mongodb-remove-short-fields.py
                            - Remove every occurrence of some field if it
                            is shorter than some length, for every row,
                            in MongoDB.

mongodb-to-lucene.py        - Read all mongo docs, and insert them
                            into Lucene.

one-sentence-per-line-to-json.py
                            - For line in stdin, convert it to a JSON
                            dict with key: "content" and value: line.

page-count.pl			    - For each file (usually .ps or .pdf)
                            specified in stdin, count the number of
                            pages in the file

print-all.pl			    - For each file (.ps or .pdf) specified
            				as a command-line argument, print the
            				file to a random printer.

ptb/one-sentence-per-line.pl    - Output one PTB sentence per line,
                            using PTB tagged/ files.

read-xml-mysqldump.py       - Read in the XML mysqldump from sys.sdin.

remove-funny-characters.pl  - Remove any funny character

remove-nonascii-characters.pl   - Remove non-ASCII characters

remove-non-utf10-characters.pl  - Remove non-UTF 1.0 characters

remove-non-utf11-characters.pl  - Remove non-UTF 1.1 characters

sample.pl                   - Sample and print only a certain percentage
                            of input lines.

shuffle/shuffle.sh		    - Shuffle lines of stdin

sort-curves.py              - Sort gnuplot curves

tokenizer.sed               - Penn Treebank tokenizer.

tokenize-English.pl         - Word Tokenizer for English by Al-Onaizan
                            and Melamed.

tsv-to-json.py              - Read TSV from stdin and output as JSON.

unichars                    - List characters for one or more properties
                            (by Tom Christiansen)

untokenize                  - Detokenize Penn Treebank formatted text.

vowpal-to-libsvm.py         - Convert a vowpal-wabbit file in stdin
                            to libsvm.

words-integers-mapfile.py   - Create a integers mapfile for the words
                            in textfile.

words-to-integers.py        - Convert words to integers, according to
                            the mapping in mapfile.

xmlmysqldump.py             - Read in the XML mysqldump for sys.sdin.

About

Common scripts, mainly for text processing and experimental control

Resources

Stars

Watchers

Forks

Packages

No packages published