-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas performance & automated release #144
Conversation
Codecov ReportPatch and project coverage have no change.
Additional details and impacted files@@ Coverage Diff @@
## main #144 +/- ##
=======================================
Coverage 93.09% 93.09%
=======================================
Files 8 8
Lines 594 594
=======================================
Hits 553 553
Misses 41 41
☔ View full report in Codecov by Sentry. |
@@ -84,9 +84,11 @@ def _dedupe_newer_report_wins(df: pd.DataFrame, primary_key: list[str]) -> pd.Da | |||
old_index = df.index.names | |||
return ( | |||
df.reset_index() | |||
.sort_values("report_date") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can sort_values()
before doing the groupby because according to the groupby docs "Groupby preserves the order of rows within each group."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potentially non-deterministic outcome here is that if there are duplicate values of report_date
within a group then which row of data ends up being "last" might not always be the same. Do we think there can ever be multiple filings within a group that have the same date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
For some reason, I made this issue in the PUDL repo instead: catalyst-cooperative/pudl#2822 but the short of it is "we need to use metadata from rssfeed
to sort them better."
.groupby(unique_cols) | ||
.apply(lambda df: df.sort_values("report_date").ffill().iloc[-1]) | ||
.last() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groupby.last()
selects the last non-null value from each column which I think is the intention of ffill().iloc[-1]
-- since that will forward fill the last non-null value within the group, and then select the last element, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! That's the exact intent & also this conveys what's going on way better!
####################################################################################### | ||
# Software Package Build & Release | ||
####################################################################################### | ||
[testenv:build] | ||
description = Prepare Python source and binary packages for release. | ||
basepython = python3 | ||
skip_install = false | ||
extras = | ||
dev | ||
commands = | ||
bash -c 'rm -rf build/* dist/*' | ||
python -m build | ||
|
||
[testenv:testrelease] | ||
description = Do a dry run of Python package release using the PyPI test server. | ||
basepython = python3 | ||
skip_install = false | ||
extras = | ||
dev | ||
commands = | ||
{[testenv:build]commands} | ||
twine check dist/* | ||
twine upload --verbose --repository testpypi --skip-existing dist/* | ||
|
||
[testenv:release] | ||
description = Release the package to the production PyPI server. | ||
basepython = python3 | ||
skip_install = false | ||
extras = | ||
dev | ||
commands = | ||
{[testenv:build]commands} | ||
twine check dist/* | ||
twine upload --verbose --skip-existing dist/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed because it takes place in the release
workflow now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wooohoo!
@@ -19,7 +19,7 @@ def test_extractor_scripts(script_runner, ep): | |||
|
|||
The script_runner fixture is provided by the pytest-console-scripts plugin. | |||
""" | |||
ret = script_runner.run(ep, "--help", print_result=False) | |||
ret = script_runner.run([ep, "--help"], print_result=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a switch from deprecated syntax.
v
is pushed.