The official publication of the Supreme Court is the bound volumes of the United States Reports. On the Supreme Court's website, there are PDFs of the bound volumes dating from 1991.
There are a couple of problems with these bound volumes:
- They're old. The most recent one is from the 2008 term.
- They're PDFs. Large (several hundred page) unstructured PDFs.
There are a couple of great things about these bound volumes:
- They have all of the cases from a given term AND all of the orders from the term.
- They come with some small data sets near the end.
- They have a hand-rolled index at the very end.
- In some situations, they have touching little artifacts about the Court (such as the recent retirement letters of Justice Souter).
So, the goal is to parse up these PDFs and make some beautiful XML/HTML. A word of caution, though, based on an earlier effort on my part. Because there are headers, footers, and footnotes, it is insufficient to simply pdftotext
the bound volumes. Some clustering will likely be needed. With that, let's begin.