Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1 milestones & release #2

Open
MrPowers opened this issue Apr 3, 2021 · 5 comments
Open

v1 milestones & release #2

MrPowers opened this issue Apr 3, 2021 · 5 comments

Comments

@MrPowers
Copy link
Contributor

MrPowers commented Apr 3, 2021

I'm really excited about this project!

Think about the features that'll be included in the "initial public release". Once all the initial features are built, ping me, and I'll make a commit to make a compelling sell in the project README.

Once the README is updated, I'll start marketing the project to try to get users and feedback on the code.

Sounds like a good plan? I'm definitely interested in seeing this project grow & get a lot of users!

@alfonsorr
Copy link
Member

Hey @MrPowers, im very happy for your interest 😄

Im still refactoring the code to make a first usable version. I spect to have all types except structs included, and the idea is to have basic functionality for map functions (withColumn, filter, select drop etc) but typed.

The idea now is to make a syntax very close to the spark API, an example of would be something like:

df.withColumn("new_col", getInt("c1") + getInt("c2"))
df.withColumn("new_col", getInt("c1") + getTimestamp("c2")) //wont compile

any error in runtime will be accumulated, so if c1 and c2 are not integer, it will be throwed in a single error saying that both columns selected are invalid.

I will try to have some basic functionality in the following days to show you.

@MrPowers
Copy link
Contributor Author

MrPowers commented Apr 5, 2021

@alfonsorr - that sounds like a good first implementation. I like the idea of making this lib a "minimalistic, performant way to write typesafe Spark code". It can have these selling points:

  • it allows for typesafe programming with compile-time checks
  • it's just as performant as regular Spark DataFrames (unlike Datasets)
  • it can be used in conjunction with "regular Spark code"

Bringing the benefits of typesafe programming to the Spark-Scala community will be a huge benefit!

Let me know when you're finished with the basic prototype and I'll try it out. Not rush. Definitely excited!

@jserranohidalgo
Copy link
Member

Awesome selling points :)

My only possible caveat is that the message sounds too strong. I mean, DataFrames are dynamically typed, and this won't be avoided by doric expressions: compile-time checks may succeed and we may still get typing errors at runtime, right? Things might be different if we could start from some kind of ValidatedDataFrame[T]. In that case, dynamic typing errors could also happen, of course, but they would be captured in advance. We may then say that execution is guaranteed to be successful provided that the validation checks on the accompanying DataFrame succeed. Not sure at all if this kind of ValidatedDataFrame is useful at all, though. Maybe, it would be enough to constrain the scope of type-safety in a footnote to well-formed column Spark expressions or something like that, leaving your selling points intact.

Thanks for your involvement, @MrPowers!

@alfonsorr
Copy link
Member

I've opened a few issues with elements pending for a first release and created project in github to keep track of them.

@MrPowers
Copy link
Contributor Author

@alfonsorr - I checked the issues and the project and it looks like you're making great progress. Ping me when the v1 stuff is done, so I can try out the project and provide feedback. Can't wait!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants