Skip to content
Joshua Tauberer edited this page Nov 4, 2013 · 11 revisions

Overview

Bill text is sourced from the U.S. Government Printing Office, which is the legislative-branch agency tasked with publishing documents.

For every bill, there may be zero or more prints of the bill. A print is in general a snapshot of the text of the bill at a given time. Each print is assigned a status code from a kind-of-but-not-really fixed set of status codes. When a House bill or resolution is introduced, its first print is for the "ih" (Introduced in House) status code.

Prints typically occur when a bill is introduced, after votes on passage, and when a bill is sent to the president (enrollment). Prints are always at least a day behind, and may sometimes not be published for days or weeks after the activity occurred. And because prints only occur after major action, there is typically no new bill text published as amendments are adopted. See more in the documentation on FDSys.

Data Formats

Bill text comes in several formats:

  • PDF. The Government Printing Office converts all bills to PDF. These are available in the GPO FDSys Congressiona Bills (BILLS) collection starting with the 103rd Congress. There is no comprehensive bill text available before the 103rd Congress.

  • Plain Text. A plain text version of each bill can also be found along side any PDF. The plain text from GPO is pretty good. It omits line numbers, which are hard to ignore in the PDF text layer, and it doesn't hyphenate words that happen to be broken across lines in the print form, which is very handy for search indexing. So when using plain text, use the plain text from GPO and not the text layer of the PDF. Note that GPO calls these files HTML, but they're HTML wrappers around plain text.

  • XML. Starting with approximately the 111th Congress, bills have been drafted in XML. The XML drafting process began a few Congresses earlier, but it wasn't initially comprehensive. The documentation for the XML format is at xml.house.gov. These XML files can be found on GPO FDSys in the Congressiona Bills (BILLS) collection and also in undocumented directories such as http://thomas.loc.gov/home/gpoxmlc113/. GPO also has a "bulk data" prototype for House bill XML here, but it's entirely redundant with the other sources and, because it excludes Senate bills, has strictly less information.

  • HTML. Prior to the 111th Congress, the only comprehensive source of semi-structured bill text data was the HTML rendition of bill text as it appeared on THOMAS. GovTrack has scraped this starting with the 103rd Congress (the earliest that bill text is available). We don't currently make it available through this project.

  • In addition, a "MODS" file --- which is an XML file --- is available for all bills on GPO FDSys. These files contain fairly detailed metadata about the bill. One interesting component is a list of citations found in the bill.

File layout

Our scrapers store bill text like this:

data/[congress]/bills/[bill_type]/[bill_type][number]/text-versions/[status_code]/document.[format]

For instance, the directory:

113/bills/hr/hr1237/text-versions/ih

is for bill text information for the ih status of H.R. 1237 in the 113th Congress. This directory may contain:

document.txt (plain text version; UTF-8 encoded)
document.xml (XML version)
mods.xml (MODS metadata file)
data.json (metadata extracted from the MODS file in a more convenient JSON format)
catoxml.xml (Cato Institute Deepbills file)

Here's data.json:

{
  "bill_version_id": "hr1237-113-ih", 
  "issued_on": "2013-03-18", 
  "urls": {
    "html": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr1237ih/html/BILLS-113hr1237ih.htm", 
    "pdf": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr1237ih/pdf/BILLS-113hr1237ih.pdf", 
    "unknown": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr1237ih/content-detail.html", 
    "xml": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr1237ih/xml/BILLS-113hr1237ih.xml"
  }, 
  "version_code": "ih"

In general, the way to find the current or most recent text of a bill is to look through all of the text-versions directories (for each status code), read data.json, and select the status directory that has the most recent issued_on date. The status codes do not reliably have an order. (That's not to say there isn't a pattern to it, but it's hard or impossible to always know from a status code which is most recent.)

Scrapers

We have four scrapers that produce bill text files:

  • Bill text via FDSys: fdsys.py for the GPO BILLS collection. This gathers GPO's actual bill text data, from the 103rd Congress (1993) to the present. It's smart about updating only changed files.

  • Bill metadata via FDSys: bill_versions.py. This downloads MODS files only and creates the data.json files only. It's not smart. It downloads everything that it hasn't yet seen and doesn't update any existing MODS files if they happen to be updated upstream.

  • Bill text via the Statutes at Large: statutes.py for the GPO STATUTE collection. This extracts bill information from GPO's Statutes at Large collection, which covers enacted bills and agreed-to concurrent resolutions from 1951 to the present. Since there is better bill data starting in 1993, this scraper should only be used up to 1992. (The scraper also extracts bill metadata. As noted on the bill documentation page, better bill metadata comes from THOMAS stating in 1973.)

  • Cato Institute Deepbills Project XML: deepbills.py for the Cato Institute Deepbills Project, which creates better XML for bills starting with the 113th Congress (2013).

Bill text via FDSys

The fdsys.py scraper provides infrastructure for various GPO FDSys-fetching tasks, including fetching bill text.

FDSys is organized by collection (bills, committee reports, etc.) and year.

To download all bill text for the 112th Congress, run:

./run fdsys --collections=BILLS --congress=112 --store=pdf,mods,xml,text

Here we're requesting the PDF, MODS, XML, and plain text versions. You'll get them in the file layout as described above.

Running the command again will smartly update changed files by scanning through FDSys's sitemaps for changed sitemaps and changed files.

All arguments are optional. Without --store, the script just updates a local copy of the sitemap files in cache/fdsys/sitemap/YEAR/COLLECTION.xml. Use --cached to force the use of cached files and not hit the network except when a file does not exist. Use --force to download all files anew.

Using fdsys.py for other collections

This scraper can be used to smartly fetch any collection in FDSys. The stored files for other collections (besides bills) are organized in a more generic way: in data/fdsys/COLLECTION/YEAR/PKGID. The PKGID is the package identifier for the file on FDSys. For instance:

./run fdsys --collections=STATUTE --year=1982 --store=mods
data/fdsys/STATUTE/1982/STATUTE-96/mods.xml

The collections argument can take a comma-separated list of collections. To get a list of collection names, use:

./run fdsys --list-collections

Bill metadata via FDSys

A separate scraper called bill_versions.py will download just the mods.xml files (if they aren't already downloaded) and will create just the data.json files.

./run bill_versions --congress=112
data/112/bills/hjres/hjres6/text-versions/ih/mods.xml
data/112/bills/hjres/hjres6/text-versions/ih/data.json

Use bill_versions if you only need the MODS and JSON files, or to update the JSON files in bulk if our data format changes.

Bill text via the Statutes at Large

The statutes.py scraper extracts bill metadata and bill text from the U.S. Statutes at Large, which is the final compilation of enacted bills and agreed-to concurrent resolutions that is published after the end of each Congress. We can use the Statutes at Large to fill in bill text for enacted bills and agreed-to concurrent resolutions from 1951 to 1992, since there is no comprehensive source of bill text in that period. See the documentation on Bills for how to run this scraper to generate bill metadata.

First download the Statutes at Large from GPO:

./run fdsys --collections=STATUTE --store=pdfs --granules

Then run this scraper:

./run statutes --volumes=87-106 --textversions --extracttext

This processes all downloaded statutes files in the period of time for which normal bill text is not available, and it saves bill text files, e.g.:

data/82/bills/hr/hr1/text-versions/enr/data.json
data/82/bills/hr/hr1/text-versions/enr/document.txt

Of course, we only get bill text for the enr (Enrolled) status of bills.

When --extracttext is given, the PDF is converted to text using "pdftotext -layout" and they are stored in the usual place for bill text in plain text format (as indicated above). The text file is UTF-8 encoded (like normal) and has form-feed characters marking page breaks.

If --linkpdf is given, then hard links are created from where the PDF should be for bill text to where the PDF has been downloaded in the fdsys directory, i.e.:

data/82/bills/hr/hr1/text-versions/enr/document.pdf

You can also use --volume=65, --volumes=65-86, --year=1951, or --years=1951-1972 to limit which files are created.

Cato Institute Deepbills Project XML

deepbills.py TBD

Clone this wiki locally