-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some birthdays are inconsistent in legislators-historical.yaml
#490
Comments
I just looked at one (Huffman) and it appears that the birthday is correct but the terms aren't - there are term records that are completely bogus (wrong century, wrong type, wrong state - they seem to be duplicates of terms for this other person). There could be a bigger problem here than just the terms identified in your table. |
I see what you mean. For Huffman, the Senate data looks correct but not the HoR.
Another one, William Mayrant, appears to use the resignation date as the birthday: October 21, 1816. The page says "served until his resignation on October 21, 1816". Does the scraper parse the literal text on the page? |
It was a one time scrape... almost a decade ago. We don't have an established process now to update from bioguide although there are some scripts that get part of the way. Birthdays should I relatively easy to rescrape. But terms aren't.
So... I'm not sure what the best approach is to fixing this comprehensively. You stumbled on a hard one!
Might be best to start with writing a script that looks for anomalous terms.
…On July 6, 2017 7:15:37 PM EDT, Brian Lee Yung Rowe ***@***.***> wrote:
I see what you mean. For Huffman, the Senate data looks correct but not
the HoR.
```
terms:
- type: rep
start: '1809-05-22'
end: '1811-03-03'
state: NJ
district: -1
party: Federalist
- type: rep
start: '1811-11-04'
end: '1813-03-03'
state: NJ
district: -1
party: Federalist
- type: rep
start: '1813-05-24'
end: '1815-03-03'
state: NJ
district: -1
party: Federalist
- type: sen
start: '1945-01-03'
end: '1947-01-03'
state: OH
class: 1
party: Democrat
```
Another one, [William
Mayrant](http://bioguide.congress.gov/scripts/biodisplay.pl?index=M000288),
appears to use the resignation date as the birthday: October 21, 1816.
The page says "served until his resignation on October 21, 1816". Does
the scraper parse the literal text on the page?
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#490 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
I was going to suggest building a NER model to detect and distinguish between birthdays and terms, but it looks like it's possible to scrape from the table of search results the salient information. It would be easy to extend to all congresses. The only limitation is that the birth and death dates from this approach are limited to years instead of full dates. Here's an R snippet:
|
Running that to at least detect big errors (birthday years and term years) would be useful, even if we can't automatically fill in the complete date. (This script should be able to get complete birthdays, or it might be responsible for the error.) |
Here's a script for getting birth dates from Wikidata (and the resulting data), in case we want to use that to help check for errors: https://gist.github.com/bycoffe/3f19b94a35785fd766a29b7454f38018 (This is the script I used to fix a similar issue with death dates/term end dates here in #475) |
I noticed that the birthdays for the following legislators are after their start dates in congress. This is in the
legislators-historical.yaml
. Not sure if it's a parse issue or source data issue.The text was updated successfully, but these errors were encountered: