-
Notifications
You must be signed in to change notification settings - Fork 200
Bill Text
Bill text is sourced from the U.S. Government Printing Office, which is the legislative-branch agency tasked with publishing documents.
For every bill, there may be zero or more prints of the bill. A print is in general a snapshot of the text of the bill at a given time. Each print is assigned a status code from a kind-of-but-not-really fixed set of status codes. When a House bill or resolution is introduced, its first print is for the "ih" (Introduced in House) status code.
Prints typically occur when a bill is introduced, after votes on passage, and when a bill is sent to the president (enrollment). Prints are always at least a day behind, and may sometimes not be published for days or weeks after the activity occurred. And because prints only occur after major action, there is typically no new bill text published as amendments are adopted. See more in the documentation on FDSys.
Bill text comes in several formats:
-
PDF. The Government Printing Office converts all bills to PDF. These are available in the GPO FDSys Congressiona Bills (BILLS) collection starting with the 103rd Congress. There is no comprehensive bill text available before the 103rd Congress.
-
Plain Text. A plain text version of each bill can also be found along side any PDF. The plain text from GPO is pretty good. It omits line numbers, which are hard to ignore in the PDF text layer, and it doesn't hyphenate words that happen to be broken across lines in the print form, which is very handy for search indexing. So when using plain text, use the plain text from GPO and not the text layer of the PDF. Note that GPO calls these files HTML, but they're HTML wrappers around plain text.
-
XML. Starting with approximately the 111th Congress, bills have been drafted in XML. The XML drafting process began a few Congresses earlier, but it wasn't initially comprehensive. The documentation for the XML format is at xml.house.gov. These XML files can be found on GPO FDSys in the Congressiona Bills (BILLS) collection and also in undocumented directories such as http://thomas.loc.gov/home/gpoxmlc113/. GPO also has a "bulk data" prototype for House bill XML here, but it's entirely redundant with the other sources and, because it excludes Senate bills, has strictly less information.
-
HTML. Prior to the 111th Congress, the only comprehensive source of semi-structured bill text data was the HTML rendition of bill text as it appeared on THOMAS. GovTrack has scraped this starting with the 103rd Congress (the earliest that bill text is available). We don't currently make it available through this project.
-
In addition, a "MODS" file --- which is an XML file --- is available for all bills on GPO FDSys. These files contain fairly detailed metadata about the bill. One interesting component is a list of citations found in the bill.
Our scrapers store bill text like this:
data/[congress]/bills/[bill_type]/[bill_type][number]/text-versions/[status_code]/document.[format]
For instance, the directory:
113/bills/hr/hr1237/text-versions/ih
is for bill text information for the ih
status of H.R. 1237 in the 113th Congress. This directory may contain:
document.txt (plain text version; UTF-8 encoded)
document.xml (XML version)
mods.xml (MODS metadata file)
data.json (metadata extracted from the MODS file in a more convenient JSON format)
catoxml.xml (Cato Institute Deepbills file)
Here's data.json
:
{
"bill_version_id": "hr1237-113-ih",
"issued_on": "2013-03-18",
"urls": {
"html": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr1237ih/html/BILLS-113hr1237ih.htm",
"pdf": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr1237ih/pdf/BILLS-113hr1237ih.pdf",
"unknown": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr1237ih/content-detail.html",
"xml": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr1237ih/xml/BILLS-113hr1237ih.xml"
},
"version_code": "ih"
In general, the way to find the current or most recent text of a bill is to look through all of the text-versions directories (for each status code), read data.json
, and select the status directory that has the most recent issued_on
date. The status codes do not reliably have an order. (That's not to say there isn't a pattern to it, but it's hard or impossible to always know from a status code which is most recent.)
We have four scrapers that produce bill text files:
-
Bill text via FDSys:
fdsys.py
for the GPOBILLS
collection. This gathers GPO's actual bill text data, from the 103rd Congress (1993) to the present. It's smart about updating only changed files. -
Bill metadata via FDSys:
bill_versions.py
. This downloads MODS files only and creates thedata.json
files only. It's not smart. It downloads everything that it hasn't yet seen and doesn't update any existing MODS files if they happen to be updated upstream. -
Bill text via the Statutes at Large:
statutes.py
for the GPOSTATUTE
collection. This extracts bill information from GPO's Statutes at Large collection, which covers enacted bills and agreed-to concurrent resolutions from 1951 to the present. Since there is better bill data starting in 1993, this scraper should only be used up to 1992. (The scraper also extracts bill metadata. As noted on the bill documentation page, better bill metadata comes from THOMAS stating in 1973.) -
Cato Institute Deepbills Project XML:
deepbills.py
for the Cato Institute Deepbills Project, which creates better XML for bills starting with the 113th Congress (2013).
The fdsys.py
scraper provides infrastructure for various GPO FDSys-fetching tasks, including fetching bill text.
FDSys is organized by collection (bills, committee reports, etc.) and year.
To download all bill text for the 112th Congress, run:
./run fdsys --collections=BILLS --congress=112 --store=pdf,mods,xml,text
Here we're requesting the PDF, MODS, XML, and plain text versions. You'll get them in the file layout as described above.
Running the command again will smartly update changed files by scanning through FDSys's sitemaps for changed sitemaps and changed files.
All arguments are optional. Without --store, the script just updates a local copy of the sitemap files in cache/fdsys/sitemap/YEAR/COLLECTION.xml. Use --cached to force the use of cached files and not hit the network except when a file does not exist. Use --force to download all files anew.
This scraper can be used to smartly fetch any collection in FDSys. The stored files for other collections (besides bills) are organized in a more generic way: in data/fdsys/COLLECTION/YEAR/PKGID. The PKGID is the package identifier for the file on FDSys. For instance:
./run fdsys --collections=STATUTE --year=1982 --store=mods
data/fdsys/STATUTE/1982/STATUTE-96/mods.xml
The collections argument can take a comma-separated list of collections. To get a list of collection names, use:
./run fdsys --list-collections
A separate scraper called bill_versions.py
will download just the mods.xml files (if they aren't already downloaded) and will create just the data.json files.
./run bill_versions --congress=112
data/112/bills/hjres/hjres6/text-versions/ih/mods.xml
data/112/bills/hjres/hjres6/text-versions/ih/data.json
Use bill_versions if you only need the MODS and JSON files, or to update the JSON files in bulk if our data format changes.
statutes.py
TBD
deepbills.py
TBD