Only parse Schedule A itemizations #45

NickCrews · 2022-08-16T03:55:58Z

Hi! Thanks for this great utility.

I only care about the Schedule A itemizations. In some cases of multi gig .FEC files, the non-schedule A entries can take up more than half of the file, and so really slow down parsing.

Can we add some options to only parse particular itemizations?

In the meantime, I do this, do you see any problems with it? Like are schedule A itemizations always going to come before other schedules?

# filter_fec.sh

# We only want the individual contributions from an FEC file. We don't want
# the other itemizations, they can be gigabytes and slow parsings

# From the FEC file format documentation:

# The first record of every electronic file that is submitted to the FEC must be an
# HDR record that precedes the main body of the ASCII CSV (comma separated values) data.
# The second record will be a "cover" record for the particular filing, (for example,
# a F3 or and F3X record for a FEC-3 or FEC-3X electronic report). An unlimited number
# of Schedule records (examples: SA, SB, SC/ ...) can follow the first two records of
# an FEC electronic report file. (Electronic fi les are usually assigned the file
# suffix ".fec".)

# So as soon as we see a line starting with "SB", "SC", or "SD", we stop.
# From https://stackoverflow.com/a/8940829/5156887
awk '{if(/^SB|^SC|^SD/)exit;else print}'

and use it as curl https://docquery.fec.gov/dcdev/posted/13360.fec | filter_fec.sh | fastfec 13360

The text was updated successfully, but these errors were encountered:

freedmand · 2022-08-21T15:12:41Z

Hi @NickCrews, thanks for the question. I agree this is an important and useful feature to add. I'll see how easy it is to add a flag to pass a regex form filter. Would something like --form-filter make sense as a flag name?

freedmand · 2022-08-21T15:13:36Z

In the meantime, I do this, do you see any problems with it? Like are schedule A itemizations always going to come before other schedules?

I think that will mostly work but I have observed out-of-order forms in the past (very rare). @chriszs may have more insight

chriszs · 2022-08-21T19:01:37Z

Dylan's correct. Order is not guaranteed, though it's often ordered that way. For multi-gigabyte files, the limiting factor tends to be download speed. Filtering form types in FastFEC would speed up parsing, but it wouldn't bail half way through the download as this does, so it wouldn't have much of an impact on the overall time. Speeding up the download using aria2c -x 4 and then filtering to ^SA might be safer and more effective.

NickCrews · 2022-08-21T23:43:09Z

Thank you for the responses. That makes sense that we can't rely on order, darn. And I would see how if we need to download the whole file then skipping parsing won't gain much speed. I guess save some disk space. So this isn't a super must have for me, if you aren't interested in supporting it then I wouldn't be heartbroken.

I would say that I might prefer explicit table names, instead of a regex, there aren't that many options. (Unless I'm wrong and there are a lot?)

freedmand added the enhancement label Aug 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only parse Schedule A itemizations #45

Only parse Schedule A itemizations #45

NickCrews commented Aug 16, 2022 •

edited

Loading

freedmand commented Aug 21, 2022

freedmand commented Aug 21, 2022

chriszs commented Aug 21, 2022 •

edited

Loading

NickCrews commented Aug 21, 2022

Only parse Schedule A itemizations #45

Only parse Schedule A itemizations #45

Comments

NickCrews commented Aug 16, 2022 • edited Loading

freedmand commented Aug 21, 2022

freedmand commented Aug 21, 2022

chriszs commented Aug 21, 2022 • edited Loading

NickCrews commented Aug 21, 2022

NickCrews commented Aug 16, 2022 •

edited

Loading

chriszs commented Aug 21, 2022 •

edited

Loading