-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-15: Reject PDEP-10 #58623
Open
lithomas1
wants to merge
12
commits into
pandas-dev:main
Choose a base branch
from
lithomas1:reject-pdep10
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+83
−0
Open
PDEP-15: Reject PDEP-10 #58623
Changes from 5 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
98eb85a
PDEP-10: Change status to rejected
lithomas1 5e451db
Split out into new pdep
lithomas1 2af5632
remove pdep-10 changes
lithomas1 6e4efe5
Apply suggestions from code review
lithomas1 45754bf
Apply suggestions from code review
lithomas1 7833637
update a little
lithomas1 1ccca56
Merge branch 'reject-pdep10' of github.com:lithomas1/pandas into reje…
lithomas1 1b3bdee
minor update
lithomas1 e52e2e7
Merge branch 'main' of github.com:pandas-dev/pandas into reject-pdep10
lithomas1 fef0c92
Merge branch 'main' into reject-pdep10
lithomas1 e5de753
small updates
lithomas1 c159851
Merge branch 'reject-pdep10' of github.com:lithomas1/pandas into reje…
lithomas1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# PDEP-15: Do not require PyArrow as a required dependency (for pandas 3.0) | ||
|
||
- Created: 8 May 2024 | ||
- Status: Under Discussion | ||
- Discussion: [#58623](https://github.com/pandas-dev/pandas/pull/58623) | ||
[#52711](https://github.com/pandas-dev/pandas/pull/52711) | ||
[#52509](https://github.com/pandas-dev/pandas/issues/52509) | ||
[#54466](https://github.com/pandas-dev/pandas/issues/54466) | ||
- Author: [Thomas Li](https://github.com/lithomas1) | ||
- Revision: 1 | ||
|
||
## Abstract | ||
|
||
This PDEP supersedes PDEP-10, which stipulated that PyArrow should become a required dependency | ||
for pandas 3.0. After reviewing feedback posted | ||
on the feedback issue [#54466](https://github.com/pandas-dev/pandas/issues/54466), we, the members of | ||
the core team, have decided against moving forward with this PDEP for pandas 3.0. | ||
|
||
The primary reasons for rejecting this PDEP are twofold: | ||
|
||
1) Requiring pyarrow as a dependency causes installation problems. | ||
- Pyarrow does not fit or has a hard time fitting in space-constrained environments | ||
such as AWS Lambda and WASM, due to its large size of around ~40 MB for a compiled wheel | ||
(which is larger than pandas' own wheel sizes) | ||
- Installation of pyarrow is not possible on some platforms. We provide support for some | ||
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
less widely used platforms such as Alpine Linux (and there is third party support for pandas in | ||
pyodide, a WASM distribution of pandas), both of which pyarrow does not provide wheels for. | ||
|
||
While both of these reasons are mentioned in the drawbacks section of PDEP-10, at the time of the writing | ||
of the PDEP, we underestimated the impact this would have on users, and also downstream developers. | ||
|
||
2) Many of the benefits presented in PDEP-10 can be materialized even with payrrow as an optional dependency. | ||
|
||
For example, as detailed in PDEP-14, it is possible to create a new string data type with the same semantics | ||
as our current default object string data type, but that allows users to experience faster performance and memory savings | ||
compared to the object strings (if pyarrow is installed). | ||
|
||
While we've decided to not move forward with requiring pyarrow in pandas 3.0, the rejection of PDEP-10 | ||
does not mean that we are abandoning pyarrow support and integration in pandas. We, as the core team, still believe | ||
that adopting support for pyarrow arrays and data types in more of pandas will lead to greater interoperability with the | ||
ecosystem and better performance for users. Furthermore, a lot of the drawbacks, such as the large installation size of pyarrow | ||
and the lack of support for certain platforms, can be solved, and potential solutions have been proposed for them, allowing us | ||
to potentially revisit this decision in the future. | ||
|
||
However, at this point in time, it is clear that we are not ready to require pyarrow | ||
as a dependency in pandas. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this really exaggerates the problem on AWS. AWS has long distributed its own AWS SDK for pandas library (formerly called awswrangler) which uses pyarrow to better integrate with many of its services (ex: pyarrow is used for high performance data exports to/from AWS Redshift, rather than using a traditional ODBC driver)
The issue here is really just scoped to users that don't want to use the AWS Lambda Managed Layer, but instead want to build their environment from scratch, assumedly without the AWS SDK for pandas. Even then, it may not be a current issue given the drastic reductions in the binary size of pyarrow through both
pip andcondaThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's only conda that's been drastically reduced - install pyarrow and pandas in a fresh venv and it already hits 295 MB. And the wheel size on PyPI is still ~40MB, so we'd be noticeably increasing the load on PyPI by making PyArrow required
My current stance is: if the desire was to make PyArrow dtypes the default for all dtypes, then ok, maybe that'd be justified. But probably not if it's just for the sake of strings, for which I think that
pip install pandas[pyarrow]
in all instructionsshould be enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
I'll add this to the PDEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity - is this something that the PyPi maintainers have mentioned as a problem?
If so, the plot twist is then that we should really be pushing AWS users towards using the pre-provided image, rather than building their own from scratch. I believe that would forgo hitting PyPi altogether, and even if it doesn't, it is still smaller than what people are building themselves (see #54466 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we then be asking our users to be dependent on the AWS people updating their pre-provided image whenever we created a new release of
pandas
or the arrow team did a new release ofpyarrow
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think that is generally how lambda works, even with the overall Python version. It's not quite an "anything goes" type of execution environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's this post from 2021, in which they write about the monthly PyPI bill being almost 2 million US dollars
https://dustingram.com/articles/2021/04/14/powering-the-python-package-index-in-2021/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too worried about this - PyPI's bill is subsidized anyways. Of course, it is nice to reduce our load on PyPI, but I don't think we are close to being the worse offenders here (those would probably be something like tensorflow and pytorch), and it's important to keep in mind that a lot of people have pyarrow installed for whatever reason anyways.
I would mostly only be concerned about an increase in size in our own pandas package (since PyPI does limit the total size of all packages uploaded by a project, and raising the limit is a manual and annoying process)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made another go at clarifying point1, incorporating the feedback here.
PTAL @WillAyd @MarcoGorelli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, but I still disagree with calling the AWS-provided layer a "workaround" - essentially it is the canonical approach to solve an issue that has existed for years with pandas and lambda functions.
A quick google search for something how "how to run pandas on aws lambda" yields a slew of conflicting results on how to get this to work. If the AWS-provided layer is a workaround, then what are we calling the proper approach?
https://stackoverflow.com/questions/36054976/pandas-aws-lambda
https://stackoverflow.com/questions/53824556/how-to-install-numpy-and-pandas-for-aws-lambdas
https://medium.com/swlh/how-to-add-python-pandas-layer-to-aws-lambda-bab5ea7ced4f
https://medium.com/@johnnymao/how-to-use-pandas-in-your-aws-lambda-function-c3ce29f6f189
https://medium.com/@shimo164/lambda-layer-to-use-numpy-and-pandas-in-aws-lambda-function-8a0e040faa18
https://www.youtube.com/watch?v=1UDEp90S9h8