Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-15: Reject PDEP-10 #58623
base: main
Are you sure you want to change the base?
PDEP-15: Reject PDEP-10 #58623
Changes from 2 commits
98eb85a
5e451db
2af5632
6e4efe5
45754bf
7833637
1ccca56
1b3bdee
e52e2e7
fef0c92
e5de753
c159851
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we could learn from this process is what caused this to change our minds? These issues were discussed leading up to the acceptance of PDEP-10.
The way this is written I think reads more as "we discovered this after the fact" instead of "we decided that X amount of negative feedback on these points was enough to revert". I think there is some value to future PDEPs to set expectations around the latter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within the context of recent conversation I don't think this comment about AWS is true. AWS distributes an official pandas image for lambda which already includes pyarrow, pandas, and NumPy. This is all required by their own "AWS SDK on pandas" library.
The issue more finely scoped I think is that the default wheel installation via pip into a lambda image exceeds the 256 MB limit. Either using the official AWS provided image or using miniconda should not exceed the space limits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't find this point very convincing. Saying
Many of the benefits
but then following it up with one bullet point seems to miss the mark - what are the other many benefits that we don't need pyarrow for? Without pyarrow users are forgoing:On the larger roadmap of pandas this moves us away from tighter Arrow integration, which means we move further away from Arrow compute algorithms / joins and the larger ecosystem of tools that includes streaming, query optimizers, planners, data engines, etc...
I think this argument in its current form is saying "we don't need a car because we have a horse and buggy"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PDEP-10, there are 3 benefits listed
So, IMO, this argument is accurate, in that most of the benefits in PDEP-10 can be made possible (for those user that have pyarrow installed) without making pyarrow required.
The future benefits of Arrow are very compelling, but decisions on making a dependency required should be based on immediate and not future benefits. Like I said before, it is easy to reconsider this decision in a years time if those future benefits are materialize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think points 1 and 3 are possible without pyarrow then the alternatives for that should be laid out in this PDEP, at least at a super high level. I'm assuming point 1 refers to the nanoarrow POC I was sharing; point 3 requires reimplementing the conversions that pyarrow already has. (I personally don't think building either of those from scratch is a good long term solution but it can at least be discussed)
For point 2 how do you know those are niche applications? Its easy to dismiss things that don't exist today as not worthwhile, but I get the feeling that there could be plenty of use cases for the aggregate types, since they have a natural fit with many of the Python containers.
On interoperability the long term prospects for the dataframe interchange protocol seem dubious, and we have even discussed moving that out of pandas (see #56732).
The Arrow interchange protocol can be used by any library that needs to work with Arrow data - there is no limit to it being used by other dataframe libraries. It provides a standardized API so that third parties don't need to hack into our internals, which is a direct benefit for us. It also works in two directions - we can be a consumer just as much as a producer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also wanted to point out that arrow has a decimal128 and decimal256 type which is especially useful for financial calculations where floating point inaccuracies cannot be tolerated, and the arrow decimal types are an extremely significant improvement over using object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will update and add a note in the PDEP when I get time again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PDEP 14 does not change performance or memory savings if you do not have pyarrow installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a note in parentheses at the end of that sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you push this up? I don't see anything in parentheses.
The way I am interpreting this now is "we don't need/care for pyarrow strings because we have always had a string data type using Python strings" - is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PDEP-15 text, and forgot to remove the PDEP-10 changes.
I've removed the PDEP-10 changes now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this really exaggerates the problem on AWS. AWS has long distributed its own AWS SDK for pandas library (formerly called awswrangler) which uses pyarrow to better integrate with many of its services (ex: pyarrow is used for high performance data exports to/from AWS Redshift, rather than using a traditional ODBC driver)
The issue here is really just scoped to users that don't want to use the AWS Lambda Managed Layer, but instead want to build their environment from scratch, assumedly without the AWS SDK for pandas. Even then, it may not be a current issue given the drastic reductions in the binary size of pyarrow through both
pip andcondaThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's only conda that's been drastically reduced - install pyarrow and pandas in a fresh venv and it already hits 295 MB. And the wheel size on PyPI is still ~40MB, so we'd be noticeably increasing the load on PyPI by making PyArrow required
My current stance is: if the desire was to make PyArrow dtypes the default for all dtypes, then ok, maybe that'd be justified. But probably not if it's just for the sake of strings, for which I think that
pip install pandas[pyarrow]
in all instructionsshould be enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
I'll add this to the PDEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity - is this something that the PyPi maintainers have mentioned as a problem?
If so, the plot twist is then that we should really be pushing AWS users towards using the pre-provided image, rather than building their own from scratch. I believe that would forgo hitting PyPi altogether, and even if it doesn't, it is still smaller than what people are building themselves (see #54466 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we then be asking our users to be dependent on the AWS people updating their pre-provided image whenever we created a new release of
pandas
or the arrow team did a new release ofpyarrow
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think that is generally how lambda works, even with the overall Python version. It's not quite an "anything goes" type of execution environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's this post from 2021, in which they write about the monthly PyPI bill being almost 2 million US dollars
https://dustingram.com/articles/2021/04/14/powering-the-python-package-index-in-2021/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too worried about this - PyPI's bill is subsidized anyways. Of course, it is nice to reduce our load on PyPI, but I don't think we are close to being the worse offenders here (those would probably be something like tensorflow and pytorch), and it's important to keep in mind that a lot of people have pyarrow installed for whatever reason anyways.
I would mostly only be concerned about an increase in size in our own pandas package (since PyPI does limit the total size of all packages uploaded by a project, and raising the limit is a manual and annoying process)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made another go at clarifying point1, incorporating the feedback here.
PTAL @WillAyd @MarcoGorelli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, but I still disagree with calling the AWS-provided layer a "workaround" - essentially it is the canonical approach to solve an issue that has existed for years with pandas and lambda functions.
A quick google search for something how "how to run pandas on aws lambda" yields a slew of conflicting results on how to get this to work. If the AWS-provided layer is a workaround, then what are we calling the proper approach?
https://stackoverflow.com/questions/36054976/pandas-aws-lambda
https://stackoverflow.com/questions/53824556/how-to-install-numpy-and-pandas-for-aws-lambdas
https://medium.com/swlh/how-to-add-python-pandas-layer-to-aws-lambda-bab5ea7ced4f
https://medium.com/@johnnymao/how-to-use-pandas-in-your-aws-lambda-function-c3ce29f6f189
https://medium.com/@shimo164/lambda-layer-to-use-numpy-and-pandas-in-aws-lambda-function-8a0e040faa18
https://www.youtube.com/watch?v=1UDEp90S9h8