Why this data?

The CDCR (California Department of Corrections and Rehabilitation) releases monthly reports on the number of people in state prisons around California. These reports also include the designed capacities of each prison, and how the current population compares to that capacity. California prisons have had extreme overcrowding issues for a long time - see, e.g. Brown v Plata, which is the US Supreme Court case that finally sparked a concerted effort to reduce overcrowding.

Unfortunately, the reports provided by the CDCR are only available in PDF format, with one PDF provided per month. Trying to analyze those numbers over many months and years is very difficult in such a format. This repository changes that, parsing the monthly PDF reports from 1996 to the present day to provide these data in one CSV.

The monthly, per-prison data is available here, or on Google Sheets.

The data look something like this:

head data/monthly_cdcr_population.csv | column -t -s,
year  month  institution_name                  population_felons  civil_addict  total_population  designed_capacity  percent_occupied  staffed_capacity  source_pdf_name
1996  01     VSP (VALLEY SP)                   2294               0             2294              1980               115.9             1980              TPOP1Ad9601.pdf
1996  01     SCC (SIERRA CONSERVATION CENTER)  322                0             322               320                100.6             320               TPOP1Ad9601.pdf
1996  01     NCWF (NO CAL WOMEN'S FACIL)       786                4             790               400                197.5             760               TPOP1Ad9601.pdf
1996  01     CCWF (CENTRAL CA WOMEN'S FAC)     2846               13            2859              2004               142.7             3224              TPOP1Ad9601.pdf
...

I've gone through a number of the PDFs by hand to double check the numbers are correct, but if you spot mistakes or otherwise think something is wrong, please create a Github issue. If you're not familiar and want to report a bug, please reach out via email ([email protected]).

Raw PDFs

Data come from the PDFs of monthly archives at: https://www.cdcr.ca.gov/research/monthly-total-population-report-archive/

The PDFs themselves are pulled down and checked into this repository under data/raw_monthly_pdfs/. The names of the PDFs have not been changed. Pre-2019, they were downloaded by running a script (datacleaning/scrape_from_cdcr.py). The naming of files / directories changed around a bit starting in 2019 - to add new PDFs, just go to the website and download manually the newest month, and save it to the data/raw_monthly_pdfs/ directory.

Parsing the PDFs

The PDFs are parsed using tools in the datacleaning directory in the root of this repo. The result of their parsing is in this directory at data/monthly_cdcr_population.csv.

To re-parse / re-generate that CSV, run:

python datacleaning/bulk_parse_pdfs.py --verbose

Running tests

There are tests! Run them with nose:

nosetests

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
analysis		analysis
data		data
datacleaning		datacleaning
test		test
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why this data?

Raw PDFs

Parsing the PDFs

Running tests

About

Releases

Packages

Languages

nrjones8/cdcr-population-data

Folders and files

Latest commit

History

Repository files navigation

Why this data?

Raw PDFs

Parsing the PDFs

Running tests

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages