A simple format that can be viewed in excel. Really easy to see in text editor, contains only one table. Lets look at an example CSV file. Here is some data on expenditures in the U.S. House of Representatives:
https://projects.propublica.org/represent/expenditures
Grab a file
wget https://projects.propublica.org/congress/assets/staffers/2017Q1-house-disburse-detail.csv
If running wget tells you "command not found", install it using brew install wget
on Mac or sudo apt-get install wget
for Ubuntu.
Take a look
head 2017Q1-house-disburse-detail.csv
wc -l 2017Q1-house-disburse-detail.csv
Find something
cat 2017Q1-house-disburse-detail.csv | grep WATER
Where did the header go?
curl "https://projects.propublica.org/congress/assets/staffers/2017Q1-house-disburse-detail.csv" | egrep -i "water|bioguide" > water.txt
cat water.txt| csvstat
https://csvkit.readthedocs.io/en/1.0.1/. This program is its both a command line interface (CLI) and a python package. It helps parse CSVs cleanly.
pip3 install csvkit
Now we can search within a column
cat 2017Q1-house-disburse-detail.csv | csvgrep -c OFFICE -m WATER > spending_by_waters_office.csv
cat spending_by_waters_office.csv |csvstat
cat 2017Q1-house-disburse-detail.csv | csvgrep -c PURPOSE -m WATER > spending_on_water.csv
cat spending_on_water.csv | csvstat
Hmm...who is spending so much on water?
cat spending_on_water.csv | grep "40333.33"
in2csv roster.xlsx | csvstat
in2csv roster.xlsx | csvcut -c 9 | tail -n +2 | awk '{gsub("\"", "")}1' | cut -d'|' -f 1 | sed 's/^[ \t]*//;s/[ \t]*$//' | sort | uniq -c
- Reproducability
- Portability
- Transparancy
- To be obnioxious
- Simplicity and ease of use. Once you're comfortable with this, its easier sometimes to use the command line for simple cleaning operations before piping to file and starting the real analysis.