-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect number of trailing commas when last field(s) are empty #24
Comments
Hey, thanks for opening an issue! At the time I wrote the bulk of the code, this was the intended behavior. There are some errant filings with incorrect numbers of columns per line. To make the code as accurate as possible, I made it just output as many fields as are observed in the filing (and if there were more columns than headers, the remaining would just be printed for completeness). This quite literally reflects what is actually in the filing itself (and could be useful if you care to, say, detect filings with incorrect numbers of columns). If you rerun FastFEC with the Whether this is the correct behavior may warrant rethinking. Does having fewer columns break downstream tooling for you? Do you think it's more accurate to put in trailing commas for missing fields (in this case, what should be done for extra fields in a given row)? Maybe we could expose an option to pad missing fields with commas, or make this the default behavior. I'm curious to hear your/others' thoughts! |
I'd vote for outputting the same number of commas as the header, because I suspect some import processes will choke otherwise. I vaguely remember participating in the discussion Dylan's talking about. WinRed seems to output one more separator than you'd expect, for instance. But I feel like a consistent number of commas would be the correct normalization. |
Hey @freedmand, thanks for the detailed response! I would also advocate for padding out to the correct number of commas regardless of the contents of the fecfile. I discovered this when attempting to use the |
Great, that makes sense! In the case of too many fields in a given row (i.e. more fields than header columns), do you think just truncating would make sense Andrew/Chris? We could make that the default and then expose a raw option to preserve the source filing as closely as possible |
I'd maybe draw a distinction between empty excess fields (like WinRed) and non-empty. Empty you can eat no problem. Non-empty typically means something has gone wrong and it might be useful to output those or error just so we catch it. I could see an argument for truncation regardless, though, but even then it should warn. |
So, essentially:
|
That sounds great. I think having, by default, a set of CSVs that have a guaranteed number fields on all rows would be best here, though I agree it makes sense to have a "raw" option as well as warnings where helpful. |
Fixes washingtonpost#24 Before, we printed exactly what was in the .fec file. If a row had more fields or fewer fields than we expected from the schema, we just printed it as is. This broke downstream tooling, such as loading the .csv's into databases. Now, by default we - pad short rows with empty fields - truncate long rows You can get back to the old behavior by setting the `raw` flag to True. By default it is False. Note that this could be BREAKING. This also adjusts the warnings a bit: BEfore, you got a warning for every extra field in a row. Now you only get one warning per row, and we print out the full row (even though that row is a bit mangled by the csv parser as it removes quotes and delimiters) The tests only currently test the default behavior. A follow up should adjust how we define test cases. Currently, we expect a 1:1 correspondence between an input .fec file and an output. But really we want a 1:N relationship, where one .fec file can generate multiple outputs depending on the options passed. That will require updating our test definition format.
I have this fixed in #58. Please @chriszs @afischer take a look and give @freedmand some help. |
FYI, in the new version of DuckDB's CSV parser, they added an option that makes the parser more resilient to messy CSV data such as this. I am able to parse these CSVs with their missing trailing commas by using the
So this issue is a lot less pressing for me now. |
FastFEC export seems to be missing a trailing comma in lines that have one or more empty items at the end of a row.
Using homebrew version of fastfec on a M1 MacBook Pro running macOS Montery 12.4.
For example, you can reproduce this by running
fastfec 876050 fastfec_output/
and checking the header.csv (should be an additional trailing comma afterreport_number
002), SB28A.csv (42 fields in line items vs 43 in header) or SB23.csv (43 fields in line items vs 44 in header)header.csv:
The text was updated successfully, but these errors were encountered: