Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add spark-connect tests suite and run existing tests with SC #241

Closed
1 of 3 tasks
SemyonSinchenko opened this issue Jul 12, 2024 · 22 comments
Closed
1 of 3 tasks
Assignees
Labels
1.0 Quinn 1.0 release CI CI/CD
Milestone

Comments

@SemyonSinchenko
Copy link
Collaborator

Feature Type

  • Adding new functionality to quinn

  • Changing existing functionality in quinn

  • Removing existing functionality in quinn

Problem Description

We need to test quinn against both connect and classic.

Feature Description

.

Additional Context

This one may be used as an example: https://github.com/pyspark-ai/pyspark-ai/blob/master/run_spark_connect.sh

@SemyonSinchenko SemyonSinchenko added 1.0 Quinn 1.0 release CI CI/CD labels Jul 12, 2024
@SemyonSinchenko SemyonSinchenko self-assigned this Jul 14, 2024
@nijanthanvijayakumar
Copy link
Contributor

nijanthanvijayakumar commented Jul 14, 2024

Hi @SemyonSinchenko - are you okay if I give this one a try?

Would need some guidance, but would like give a try of the first draft myself.

@SemyonSinchenko
Copy link
Collaborator Author

@nijanthanvijayakumar This one is little tricky. It requires to download and run Spark-Connect server before running tests and also use SparkSession.remote(...). You may check how I did it for pyspark-ai:

  1. https://github.com/pyspark-ai/pyspark-ai/blob/master/run_spark_connect.sh
  2. https://github.com/pyspark-ai/pyspark-ai/blob/master/.github/workflows/build_and_test.yml#L33
  3. https://github.com/pyspark-ai/pyspark-ai/blob/master/tests/test_spark_connect.py

@SemyonSinchenko
Copy link
Collaborator Author

Feel free to ask me if you face any problem with it.

@nijanthanvijayakumar
Copy link
Contributor

Feel free to ask me if you face any problem with it.

Thank you; will definitely do. A few initial questions:

  1. Does the version have to be locked to v3.5.1?
  2. Also does all of the functions have to be tested with Spark-Connect?

@SemyonSinchenko
Copy link
Collaborator Author

Feel free to ask me if you face any problem with it.

Thank you; will definitely do. A few initial questions:

  1. Does the version have to be locked to v3.5.1?
  2. Also does all of the functions have to be tested with Spark-Connect?
  1. I would like to take a version from environment variable
  2. We do need to cover all the cases but it could be cool to cover each function at least once

BUT! To avoid very big PRs that are hard to review, please focus at first on CI/CD and suite. You may create a single test for that.

@MrPowers MrPowers added this to the 1.0 release milestone Jul 14, 2024
@nijanthanvijayakumar

This comment was marked as resolved.

@nijanthanvijayakumar
Copy link
Contributor

nijanthanvijayakumar commented Jul 15, 2024

#241 (comment)

@SemyonSinchenko - an update on the above comment

  1. I was able to fix that by using pyarrow=1.16.x. I have updated the pyproject.toml with the latest list of libraries.
  2. I was able to get this spark-connect test working in CI 🥳. Here's the link to the GitHub Actions on my fork.

I will raise a PR for you to review. Please let me know your comments on that.

@SemyonSinchenko
Copy link
Collaborator Author

@nijanthanvijayakumar I think you may continue to work in this issue. If you can, try to run all the existing tests with Spark Connect. Feel free to ask me, if you face any issue.

@SemyonSinchenko SemyonSinchenko changed the title feat: Add spark-connect tests suite feat: Add spark-connect tests suite and run existing tests with SC Jul 15, 2024
@nijanthanvijayakumar
Copy link
Contributor

Thank you @SemyonSinchenko . Will work on integrating the test cases and create a new PR.

@nijanthanvijayakumar
Copy link
Contributor

@SemyonSinchenko - sorry I have been away from my computer for a while. I was able to fix the column_to_list function under the quinn/dataframe_helpers.py, which was one of the tests that were failing.

However, there are a couple of other test cases/functions that were failing too. Following are those:

  • test_array_choice
  • test_sort_struct_nested_with_arraytypes_nullable
| =========================== short test summary info ============================
| FAILED tests/test_functions.py::test_array_choice - TypeError: Unsupported Data Type Column
| FAILED tests/test_transformations.py::test_sort_struct_nested_with_arraytypes_nullable - chispa.schema_comparer.SchemasNotEqualError: 

@SemyonSinchenko
Copy link
Collaborator Author

@nijanthanvijayakumar Thank you! May you post a more detailed stacktrace? Or even run tests in CI in your fork an give me a link to the logs?

@nijanthanvijayakumar
Copy link
Contributor

nijanthanvijayakumar commented Aug 4, 2024

@SemyonSinchenko
Copy link
Collaborator Author

@nijanthanvijayakumar It seems to me that you found an interesting behavior. Let's pause it for now.

@nijanthanvijayakumar
Copy link
Contributor

Thank you @SemyonSinchenko; tried fixing this for hours and couldn't. Keen to know/learn more about the behaviour here.

@SemyonSinchenko
Copy link
Collaborator Author

It seems to me that is a bug in PySpark itself.
@nijanthanvijayakumar Are you in 'mrpowers' slack channel, related to this project? If so, may you say ur nickname

@nijanthanvijayakumar
Copy link
Contributor

nijanthanvijayakumar commented Aug 4, 2024

It seems to me that is a bug in PySpark itself.

@nijanthanvijayakumar Are you in 'mrpowers' slack channel, related to this project? If so, may you say ur nickname

@SemyonSinchenko - good to know. I'm not in that channel, but would love to be part of it. Thanks.

@SemyonSinchenko
Copy link
Collaborator Author

@nijanthanvijayakumar Give me your email, please. I will ask @MrPowers to contact you

@nijanthanvijayakumar
Copy link
Contributor

Hello @SemyonSinchenko . Hope you got my email address (noticed your thumbs up earlier and I assumed you had got it)? I have removed it from this chat for now, just for privacy reasons.

@SemyonSinchenko
Copy link
Collaborator Author

@nijanthanvijayakumar FYI apache/spark@536445c

May you mark this function as "non-working" if the version of spark is less than 4.0 (or less than 3.5.2 for 3.5-branch) and throw an exception, like "This function is not working in Connect environments for spark less than 3.5.2"?

@nijanthanvijayakumar
Copy link
Contributor

@nijanthanvijayakumar FYI apache/spark@536445c

May you mark this function as "non-working" if the version of spark is less than 4.0 (or less than 3.5.2 for 3.5-branch) and throw an exception, like "This function is not working in Connect environments for spark less than 3.5.2"?

Hi @SemyonSinchenko - if my understanding is right, do you want me to do the following?

  1. In the link you had shared, comment This function is not working for Spark-Connect v3.5.2 and less and,
  2. Update the test_array_choice and test_sort_struct_nested_with_arraytypes_nullable test cases to throw an exception saying This function is not working on Spark-Connect environments for Spark version less than 3.5.2?

@SemyonSinchenko
Copy link
Collaborator Author

I mean you should mark functions in chispa as non-workin in connect. These functions (array_choice, etc.) should trow an exception if u are trying to call them in Connect env from spark version less than 3.5.2

@nijanthanvijayakumar
Copy link
Contributor

nijanthanvijayakumar commented Aug 6, 2024

@nijanthanvijayakumar FYI apache/spark@536445c

May you mark this function as "non-working" if the version of spark is less than 4.0 (or less than 3.5.2 for 3.5-branch) and throw an exception, like "This function is not working in Connect environments for spark less than 3.5.2"?

Thanks for the guidance @SemyonSinchenko . I have updated the test cases and the actual function implementation accordingly. Here's the link to the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0 Quinn 1.0 release CI CI/CD
Projects
None yet
Development

No branches or pull requests

3 participants