-
Notifications
You must be signed in to change notification settings - Fork 7
/
README
133 lines (87 loc) · 4.94 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
local/ -Files of local interest, e.g. with fixed
hostnames
README.txt - This file
all-xml-to-json.sh - For every XML file in the command-line,
convert it to JSON.
boilerpipe-stdin-urls-to-mongo.py
- Run every sys.stdin URL through Boilerpipe
(or diffbot), and store in a MongoDB.
citeseer-get.pl - Fetch PDFs from citeseer.
cumulative.py - Output a cumulative sum for each line in
the input file.
delexicalize-low-frequency-words.py
- Delexicalize all words with freq less than
minfreq to *UNKNOWN*
dumpdb.py - Dump the MongoDB
enscript-landscape-all.pl - Enscript all files listed in @ARGV in
landscape mode.
filter-json.py - Filter JSON in sys.stdin to find only docs
that match each regex with at least one
field value.
from-one-line-per-word-to-one-line-per-sentence.py
- Read one-line-per-word and convert to
one-line-per-sentence.
grep-json.py - Filter JSON in sys.stdin to find only docs
that match each regex against raw JSON.
grep-json-by-field.py - Filter JSON in sys.stdin to find only docs
that match each regex with at least one
field value.
join-json.py - For each JSON file in sys.argv, join them
and output to stdout.
lines-with-funny-characters.pl
- Print lines with funny characters
lines-with-no-funny-characters.pl
- Print lines without funny characters
load-directory-of-textfiles-into-mongodb.py
- For all files recursively in a subdir, load
them into a MongoDB with a certain field name.
load-json-into-mongodb.py - Load JSON from stdin into a MongoDB
htmldecode.pl - Decode HTML entities, e.g. < becomes <
htmlencode.pl - Encode HTML entities, e.g. < becomes <
html2text - Convert HTML to text
mongodb-count.py - Count the number of entries in a mongodb
collection.
mongodb-field-lengths.py - Print MongoDB field length and field,
for every row.
mongodb-remove-field.py - Remove every occurrence of some field,
for every row, in MongoDB.
mongodb-remove-short-fields.py
- Remove every occurrence of some field if it
is shorter than some length, for every row,
in MongoDB.
mongodb-to-lucene.py - Read all mongo docs, and insert them
into Lucene.
one-sentence-per-line-to-json.py
- For line in stdin, convert it to a JSON
dict with key: "content" and value: line.
page-count.pl - For each file (usually .ps or .pdf)
specified in stdin, count the number of
pages in the file
print-all.pl - For each file (.ps or .pdf) specified
as a command-line argument, print the
file to a random printer.
ptb/one-sentence-per-line.pl - Output one PTB sentence per line,
using PTB tagged/ files.
read-xml-mysqldump.py - Read in the XML mysqldump from sys.sdin.
remove-funny-characters.pl - Remove any funny character
remove-nonascii-characters.pl - Remove non-ASCII characters
remove-non-utf10-characters.pl - Remove non-UTF 1.0 characters
remove-non-utf11-characters.pl - Remove non-UTF 1.1 characters
sample.pl - Sample and print only a certain percentage
of input lines.
shuffle/shuffle.sh - Shuffle lines of stdin
sort-curves.py - Sort gnuplot curves
tokenizer.sed - Penn Treebank tokenizer.
tokenize-English.pl - Word Tokenizer for English by Al-Onaizan
and Melamed.
tsv-to-json.py - Read TSV from stdin and output as JSON.
unichars - List characters for one or more properties
(by Tom Christiansen)
untokenize - Detokenize Penn Treebank formatted text.
vowpal-to-libsvm.py - Convert a vowpal-wabbit file in stdin
to libsvm.
words-integers-mapfile.py - Create a integers mapfile for the words
in textfile.
words-to-integers.py - Convert words to integers, according to
the mapping in mapfile.
xmlmysqldump.py - Read in the XML mysqldump for sys.sdin.