-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a simple validation transform to yaml. #32956
Conversation
R: @Polber We could call this "ValidateWithSchema" if we think that's not too verbose and other types of validation may be added in the future. |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
The YAML and prism failures are pre-existing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple minor comments
if validator_ptr[0] is None: | ||
validator_ptr[0] = jsonschema.validators.validator_for(json_schema)( | ||
json_schema) | ||
validator_ptr[0].validate(convert(row)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was recently doing something similar in Python and was informed about PEP 3104
if validator_ptr[0] is None: | |
validator_ptr[0] = jsonschema.validators.validator_for(json_schema)( | |
json_schema) | |
validator_ptr[0].validate(convert(row)) | |
nonlocal validator_ptr | |
if not validator_ptr: | |
validator_ptr = jsonschema.validators.validator_for(json_schema)( | |
json_schema) | |
validator_ptr.validate(convert(row)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. Good call.
if not json_schema: | ||
return lambda x: None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably this never happens since json_schema is a required parameter to the Validate transform, but wouldn't this also imply that if a json_schema is not given, it will pass silently?
Nit, but perhaps having an error or warn log on compilation here instead would make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This handles the degenerate case where the schema is {}
, which in json schema means "anything goes."
Though one could ask why one would even have this transform at all, it's quite possible the schema is provided elsewhere and we want to handle this case gracefully (similar to how empty lists are handled despite being degenerate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok I was unaware of the {}
case - thanks for the explanation!
if not weak_schema: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my other comment, but wouldn't this pass silently? I'm not sure how it would reach this state in the first place, but if the incoming PCollection does not have a schema (perhaps if preceded by a transform that does not output Row?), this would pass validation even if given a json schema to validate against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, weak_schema could be {}
. This comes up in practice if we don't know anything about this part of the input (e.g. it's Any
), which is also the fallback for beam_type_to_json_type
.
@robertwb I missed this before. I think either works. The Validate transform could also have different parameters for different validation cases. But I think in general it never hurts to be descriptive, so I suppose I am in favor of "ValidateWithSchema". Follows the ReadFromX, WriteToX, MapToFields, etc. cadence |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #32956 +/- ##
============================================
+ Coverage 57.41% 57.51% +0.10%
Complexity 1475 1475
============================================
Files 968 969 +1
Lines 154224 154380 +156
Branches 1076 1076
============================================
+ Hits 88546 88790 +244
+ Misses 63477 63389 -88
Partials 2201 2201
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.