Incorrect number of trailing commas when last field(s) are empty #24

afischer · 2022-07-12T22:58:16Z

FastFEC export seems to be missing a trailing comma in lines that have one or more empty items at the end of a row.

Using homebrew version of fastfec on a M1 MacBook Pro running macOS Montery 12.4.

For example, you can reproduce this by running fastfec 876050 fastfec_output/ and checking the header.csv (should be an additional trailing comma after report_number 002), SB28A.csv (42 fields in line items vs 43 in header) or SB23.csv (43 fields in line items vs 44 in header)

header.csv:

record_type,ef_type,fec_version,soft_name,soft_ver,report_id,report_number,comment
HDR,FEC,8.0,Microsoft Navision 3.60 - AVF Consulting,1.00,FEC-840327,002

The text was updated successfully, but these errors were encountered:

freedmand · 2022-07-18T07:24:25Z

Hey, thanks for opening an issue! At the time I wrote the bulk of the code, this was the intended behavior. There are some errant filings with incorrect numbers of columns per line. To make the code as accurate as possible, I made it just output as many fields as are observed in the filing (and if there were more columns than headers, the remaining would just be printed for completeness). This quite literally reflects what is actually in the filing itself (and could be useful if you care to, say, detect filings with incorrect numbers of columns).

If you rerun FastFEC with the --warn parameter, e.g. fastfec --warn 876050 fastfec_output/, it will show many warning messages corresponding to these column mismatches.

Whether this is the correct behavior may warrant rethinking. Does having fewer columns break downstream tooling for you? Do you think it's more accurate to put in trailing commas for missing fields (in this case, what should be done for extra fields in a given row)? Maybe we could expose an option to pad missing fields with commas, or make this the default behavior. I'm curious to hear your/others' thoughts!

chriszs · 2022-07-18T17:18:27Z

I'd vote for outputting the same number of commas as the header, because I suspect some import processes will choke otherwise. I vaguely remember participating in the discussion Dylan's talking about. WinRed seems to output one more separator than you'd expect, for instance. But I feel like a consistent number of commas would be the correct normalization.

afischer · 2022-07-18T17:24:50Z

Hey @freedmand, thanks for the detailed response! I would also advocate for padding out to the correct number of commas regardless of the contents of the fecfile. I discovered this when attempting to use the COPY / FROM postgres command to easily pull a form into a database.

freedmand · 2022-07-18T17:29:34Z

Great, that makes sense! In the case of too many fields in a given row (i.e. more fields than header columns), do you think just truncating would make sense Andrew/Chris? We could make that the default and then expose a raw option to preserve the source filing as closely as possible

chriszs · 2022-07-18T17:32:05Z

I'd maybe draw a distinction between empty excess fields (like WinRed) and non-empty. Empty you can eat no problem. Non-empty typically means something has gone wrong and it might be useful to output those or error just so we catch it. I could see an argument for truncation regardless, though, but even then it should warn.

freedmand · 2022-07-18T17:42:01Z

So, essentially:

by default, always output exactly as many fields as headers
don't warn at all if there's a mismatch in field amount that entails too few fields or too many fields but the extra ones are empty
warn by default otherwise; have a silent option to suppress warnings
have a non-default raw option to output exactly what's in the filing even if it doesn't match the number of header cols

afischer · 2022-07-27T13:28:56Z

That sounds great. I think having, by default, a set of CSVs that have a guaranteed number fields on all rows would be best here, though I agree it makes sense to have a "raw" option as well as warnings where helpful.

Fixes washingtonpost#24 Before, we printed exactly what was in the .fec file. If a row had more fields or fewer fields than we expected from the schema, we just printed it as is. This broke downstream tooling, such as loading the .csv's into databases. Now, by default we - pad short rows with empty fields - truncate long rows You can get back to the old behavior by setting the `raw` flag to True. By default it is False. Note that this could be BREAKING. This also adjusts the warnings a bit: BEfore, you got a warning for every extra field in a row. Now you only get one warning per row, and we print out the full row (even though that row is a bit mangled by the csv parser as it removes quotes and delimiters) The tests only currently test the default behavior. A follow up should adjust how we define test cases. Currently, we expect a 1:1 correspondence between an input .fec file and an output. But really we want a 1:N relationship, where one .fec file can generate multiple outputs depending on the options passed. That will require updating our test definition format.

NickCrews · 2023-04-13T19:31:12Z

I have this fixed in #58. Please @chriszs @afischer take a look and give @freedmand some help.

NickCrews · 2023-11-25T22:27:04Z

FYI, in the new version of DuckDB's CSV parser, they added an option that makes the parser more resilient to messy CSV data such as this. I am able to parse these CSVs with their missing trailing commas by using the null_padding option:

import duckdb

conn = duckdb.connect()
conn.read_csv(p, all_varchar=True,  quotechar='"', null_padding=True)

So this issue is a lot less pressing for me now.

freedmand added the enhancement label Aug 21, 2022

NickCrews mentioned this issue Dec 16, 2022

BUG: (maybe?) Missing trailing commas from output #47

Closed

NickCrews mentioned this issue Apr 13, 2023

Refactor many things, fix extra/missing commas #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect number of trailing commas when last field(s) are empty #24

Incorrect number of trailing commas when last field(s) are empty #24

afischer commented Jul 12, 2022

freedmand commented Jul 18, 2022

chriszs commented Jul 18, 2022

afischer commented Jul 18, 2022

freedmand commented Jul 18, 2022

chriszs commented Jul 18, 2022

freedmand commented Jul 18, 2022

afischer commented Jul 27, 2022

NickCrews commented Apr 13, 2023

NickCrews commented Nov 25, 2023

Incorrect number of trailing commas when last field(s) are empty #24

Incorrect number of trailing commas when last field(s) are empty #24

Comments

afischer commented Jul 12, 2022

freedmand commented Jul 18, 2022

chriszs commented Jul 18, 2022

afischer commented Jul 18, 2022

freedmand commented Jul 18, 2022

chriszs commented Jul 18, 2022

freedmand commented Jul 18, 2022

afischer commented Jul 27, 2022

NickCrews commented Apr 13, 2023

NickCrews commented Nov 25, 2023