-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request][YAML]: KafkaIO for YAML #28664
Comments
@robertwb -- Jeff mentioned you might be working on this. Is that true? Otherwise, Ferran [ @ffernandez92 ] can take that on. |
I've thought about this some, but haven't actually started any work here. It'd be great if Ferran takes this on. It'd probably make sense to follow the same pattern as the pubsub one (namely, taking a "format" and "schema" parameter to convert to and from Row objects). (I don't know if kafka messages have the equivalent of PubSub attributes.) Also, in this case there's no Python implementation, so no need for a Python variant. Just do it all in Java. |
Assigning myself for now. I was unable to assign @ffernandez92 . Is there special permissions needed for that? FYI --> We still have some things this week to address, AND, needing to get machine started to be able to run/test things, so might be greater than a week. @anyone message here if any questions. |
Not sure what the permissions requirements are, looks like the assignee needs to comment on the issue first? https://github.com/apache/beam/blob/master/CONTRIBUTING.md#share-your-intent . In any case, the intent is clear, looking forward to this contribution. |
I've been pondering how to approach this issue. Initially, we can think about utilizing the "raw" type, where it would consist solely of the Row object with the payload represented as bytes. As for the schema, for the time being, we can reuse the _create_parser function. If we decide to enhance it later on, we might consider supporting additional types like avro or proto. Nevertheless, I believe it's best to keep it straightforward and allow the user to handle such complexity downstream through custom transformations, possibly in Java or Python. Any opinions here? |
Supporting the "raw" types as bytes certainly is the most flexible and makes sense as a first pass. I do think we'll want to support json and avro (at least) similar to the I just noticed that we already have https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformProvider.java which is actually pretty complete. We'd probably want to add an option for the "raw" format that would pass the bytes through directly, as well as an option to add the key in if desired (similar to how the attributes can be appended as extra fields for PubSub). |
And we'd also want to add error handling capabilities. |
FYI --> We're most interested in protobuf and json [ since we don't currently use avro ]. |
+1, protobuf support would be great to have too. |
Thanks for the KafkaProvider @robertwb . I ran a quick test adding the necessary elements to
As a test. I ran the following yaml:
But it shows the following error:
Most likely I'm missing something. When I registered the transform as "full python" under |
Yes, Java is the right way to go. Looks like you hit #28775 (the java <-> python side is still not as fully vetted/tested). |
Thanks! It worked. Most likely I'll create a different provider since I don't think this one covers everything we need. Besides that, I dropped a comment here: #28775 . I had to fix line 173 as well |
What is the motivation to provide a different provider vs. extend this one? (These are not really used yet, but were prototyped in anticipation of something like YAML). |
Oh, I see! I thought it might have been utilized elsewhere, so I was concerned that making an extension or modification that could potentially impact backward compatibility could pose an issue. However, I think extending is fine. |
Hoping that we wind up with tests in place so we can confidently modify/extend without worry. :-) |
I was trying to run the previous code in Dataflow but i'm facing some issues, maybe you guys can help. 1 - I created the expansion service jar: 2 - Ran the following command:
I got the following error:
I tried to rollback my code and use the original code but I got the same result. My expectation here is that I'm not building this in the right way (most likely I think it has to do with the autovalues but I'm not sure). Is there any gradlew command I should be using? |
Disregard this last comment... I forgot to add the kafka client while building the expansion service... It works as expected now. |
What would you like to happen?
We'd like Beam YAML to support Kafka :-)
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: