Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assertColumnEquality reuses assertLargeDatasetEquality #35

Closed
wants to merge 3 commits into from

Conversation

snithish
Copy link
Contributor

@MrPowers to address #25 what can be done is reuse reuse all the logic we have in DatasetComparer as a POC I have converted assertColumnEquality to resue assertLargeDatasetEquality. This make for a clean codebase with optimal code reuse. I understand the ArrayPrettyPrint is the USP, but I believe we can achieve this in other ways.

@snithish snithish requested a review from MrPowers August 25, 2018 04:03
@MrPowers
Copy link
Collaborator

@snithish - The assertLargeDatasetEquality method is a lot slower than assertSmallDataFrameEquality and assertColumnEquality. See this blog post where I talk about how replacing assertLargeDataFrameEquality with assertSmallDataFrameEquality reduced the test suite runtime by 31%.

I am all for code reuse, but the main purpose of this library is to create helper methods for tests that run as fast as possible. Any thoughts on how we can have the best of both worlds? 😎

@snithish
Copy link
Contributor Author

snithish commented Aug 27, 2018

@MrPowers agreed, did you find any performance difference between assertSmallDataFrameEquality and assertColumnEquality? I believe the performance kick comes from the collect we do in both of them, so they should be equally performant, in which case we will have two broad categories of equality assertions, one which performs a collect and one without.

@MrPowers
Copy link
Collaborator

@snithish - I think assertSmallDataFrameEquality and assertColumnEquality are pretty much the same performance wise (don't have any benchmarking to back that up, just my guess).

I agree that we should have two broad categories of equality assertions: the ones that use collect and the ones don't. assertSmallDataFrameEquality and assertColumnEquality should share the same underlying code. assertLargeDataFrameEquality and assertLargeColumnEquality (doesn't exist yet) should share the same underlying code.

Or maybe we should just get rid of assertLargeDataFrameEquality all together... I never use it. I don't know, maybe other folks use it.

Please feel free to tear up the codebase as you see fit 😉 I am always down for improvements. All I care about are tests that run really, really fast and error messages that are beautiful. Anything that gets those results will make me happy. I know you'll be able to work your magic and write some beautifully designed code that gets all the results and uses nice Scala code.

@snithish
Copy link
Contributor Author

@MrPowers I will try to simplify it down to asserting with and without collect. We can benchmark and decide after that. Should have something up by the end of this week.

@snithish
Copy link
Contributor Author

snithish commented Sep 9, 2018

@MrPowers finally got to refactoring it to use small dataset equality for column equality, have a look at let me know your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants