Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Make format validate by default #1520

Open
gregsdennis opened this issue Jun 17, 2024 · 59 comments
Open

Proposal: Make format validate by default #1520

gregsdennis opened this issue Jun 17, 2024 · 59 comments

Comments

@gregsdennis
Copy link
Member

There's a long and sticky history around format.

  1. Going back all the way to Draft 01, format has never required validation.
  2. Whether to support format validation has always been the decision of the implementation.
  3. The extent to which formats are validated has also been the decision of the implementation.

The result of all of this is that implementation support for validation has been spotty at best. Despite the JSON Schema specs referencing very concretely defined formats (by referencing other specs), implementations that do support validation don't all support each format equally. This has been the primary driving force behind keeping format as an opt-in validation.

With 2019-09, we decided that it was time to give the option of format validation to the schema author. They could enable validation by using a meta-schema which listed the Format Vocabulary with a true value, which meant, "format validation is required to process this schema."

In 2020-12, we further refined this by offering two separate vocabularies, one that treats the keyword as an annotation and one that treats it as an assertion. The argument was that the behavior of a keyword shouldn't change based on whether the vocabulary was required or not.

However, the fact remains that our users consistently report (via questions in Slack, GitHub, and StackOverflow) that they expect format to validate. (The most recent case I can think of was only last week, in .Net's effort to build a short-term solution for schema generation from types.)

This consistency in user expectations leads me to believe that we should officially make format an assertion keyword and strictly enforce it by moving the appropriate tests into the required section of the Test Suite.

(Personally, I'm not passing all of the optional format tests, so I'll have to do some work to get there or document why they're not supported.)

@gregsdennis gregsdennis added this to the stable-release milestone Jun 17, 2024
@gregsdennis gregsdennis added the agenda For the OCWM agenda label Jun 17, 2024
@mwadams
Copy link

mwadams commented Jun 18, 2024

FWIW I am of like mind.

(Corvus.JsonSchema does pass all the optional format tests, so that's easy for me to say!)

@jdesrosiers
Copy link
Member

I agree that users expect that format will assert and it's probably best that the expected behavior is the default. I say that reluctantly considering that my implementation currently doesn't support format-assertion at all and I'm not excited about the effect that a requirement will have on my bundle size.

@jviotti
Copy link
Member

jviotti commented Jun 20, 2024

I also agree users expect format to perform assertions and are almost always confused when they first see they don't. However, if we bring back format as an assertion, how do we prevent the issues that caused it to become an annotation by default?

As described in https://json-schema.org/draft/2019-09/release-notes#format-vocabulary, it was very hard to make implementations interoperable on the validation they should perform and at which degree, and these inconsistencies sometimes ended up being more painful and confusing than format not being an assertion. Keep in mind that some formats are specifications on their own that can be very tricky to validate (like URIs?).

Sounds like if we go in this direction, we should not only make the currently optional tests required, but also extend them a lot more, which might be tricky given that some formats don't even have an "official" test suite of their own (again like URIs?)

Maybe a middle ground would be to make format an assertion while also being a bit more prescriptive about i.e. the expected regular expressions that implementations should use as a minimum bar?

@gregsdennis
Copy link
Member Author

how do we prevent the issues that caused it to become an annotation by default

It was always an annotation (or before annotations were a thing, it just wasn't validated).

@gregsdennis
Copy link
Member Author

I agree having some minimal requirement set for each format is probably warranted.

@gregsdennis
Copy link
Member Author

Another question this raises (as highlighted by @jdesrosiers' #1510 (comment)) is how we want implementations to handle formats they don't understand. Should implementations automatically pass validation for these or fail them?

I think an implementation failing validation (or maybe even refusing to process the schema) gives a more expected outcome when compared with an implementation that does support the format. Getting a pass from an implementation that doesn't know the format when one that does fails feel more wrong to me.

Thoughts?

@jdesrosiers
Copy link
Member

Sounds like if we go in this direction, we should not only make the currently optional tests required, but also extend them a lot more

Agreed. I think the biggest concern was the inconsistency between implementations. The test suite can address that issue. Maybe it's not perfect, but it can get better over time as we find edge cases. I think the only way making format assert by default is a viable option is if we're fairly strict about what that validation behavior has to be. If we keep the lax requirements of the past, we'll just end up with the same problems.

@jdesrosiers
Copy link
Member

Should implementations automatically pass validation for these or fail them?

I have different opinions depending on how we end up defining what asserting on format requires. If we're strict about how each format must validate (within reason), then I think implementations should refuse to process a schema with a format it doesn't know. To be clear, I mean it should be considered an invalid schema, it should not evaluate to false. But, if we stick with the current anything-goes requirements, then an unknown format should return true.

@SorinGFS
Copy link

IMO format should exist in both states all the time, not enabled by some sort of config or vocabulary. The reason for this is because there are plenty of situations when we need both in the same schema! As it stands now, draft 2020 has format as annotation and schemas may use pattern to validate a format, so it may use both states of format, but if you are going to make format to validate by default this option goes away!
When I analyzed about 700 schemas from schemastore I noticed that surprisingly format was rarely used, probably due to ambiguity around this keyword. Making format to validate by default would rise the ambiguity around the keyword to highest level. Moreover, making format to validate by default would be a massive breaking change for plenty of schemas that didn't expect the format to be validated. And, json-schema has defined patterns for format validation, but those formats are not the same across various languages which are supporting different standards. Since json-schema advertises that is a language agnostic standard I consider that json-schema should not decide which patterns should be used to validate formats. All these being said, I consider that format-validation should be completly removed, format keyword should stay as annotation and users should be encouraged to use pattern! This approach would eliminate the ambiguity for ever.
But, if you won't accept the above solution I think at least you should create a second keyword for format:

  • if format validation is choosen to be default (which I think is the worst decision) then second keyword can be formatName or formatStandard, and the meta-schema validation for this keyword should be an enum limited to the names defined here
  • if format annotation is choosen to be default then second keyword can be hasFormat or formatted

@gregsdennis
Copy link
Member Author

if you are going to make format to validate by default this option goes away

No, it doesn't. People can still use x-format (or any x- keyword) if they must have an annotation.

Making format to validate by default would rise the ambiguity around the keyword to highest level.

Making it always behave either way would definitively remove ambiguity. Making it validate would align with users' expectations.

If you're referring to the level of support for each format offered by different implementations, other comments in this issue address that by saying we need more rigorous (and non-optional) testing.

making format to validate by default would be a massive breaking change for plenty of schemas

It probably would be a breaking change for many schemas that don't declare $schema, yes. This is why we recommend using it.

On the other hand, it will fix the multitude of schemas that exist in the wild which expect format validation.

It's also going to be a burden for many tooling maintainers because a lot of them don't support these formats fully. (I'm one of them.)

those formats are not the same across various languages which are supporting different standards. Since json-schema advertises that is a language agnostic standard I consider that json-schema should not decide which patterns should be used to validate formats.

JSON Schema very clearly defines the specifications for each format, and they're all language-agnostic (except maybe regex which declares ECMA-262). If tools are implementing other specifications, then they are not conforming to JSON Schema and need to be fixed anyway.

format keyword should stay as annotation and users should be encouraged to use pattern

Not all formats can be well-represented with regex, which is the only validation pattern provides. Further, not everyone agrees on the regexes to use for each format. IMO forcing people to use pattern doesn't fix anything and may actually make the problem worse.

@gregsdennis
Copy link
Member Author

@Julian what would it take to get Bowtie to report on formats, maybe even just locally, so we can get some rough numbers?

@awwright
Copy link
Member

awwright commented Jun 22, 2024

Indeed there's a great need to have a "format" keyword that validates. The difficulty of going about with this before has been a couple things:

  • Reverse compatibility, some usage of "format" is intended to be annotation-only.

  • A validating "format" [potentially] rejects all types of values, instead of within a single type (like most keywords). This makes use with multi-type schemas (e.g. "type": ["integer", "string"]) more difficult. This point motivated my earlier solution Type specific validating formats (stringFormat, numberFormat) #1391.

Even when unknown keywords are ignored, validation keywords typically cause errors when their value is outside the permitted values, so I'd expect an error. Similarly, a validating "format" is not too different from a $ref that has predefined names—you're referencing some external, arbitrary validator, and if you don't know what that is, that's an error.

@SorinGFS
Copy link

@gregsdennis

Not all formats can be well-represented with regex, which is the only validation pattern provides. Further, not everyone agrees on the regexes to use for each format. IMO forcing people to use pattern doesn't fix anything and may actually make the problem worse.

Can you give some examples? And if a format wouldn't be well-represented with regex how would that format be validated by implementers?

@gregsdennis
Copy link
Member Author

A validating "format" rejects all types of values, instead of within a single type (like most keywords). - @awwright

Could you clarify this? Are you saying that a validating "email" format would reject a number? I can't find such a requirement. In fact the (optional) test suite verifies that format ignore value types to which the format doesn't apply.

Maybe I'm misunderstanding you.

Can you give some examples? - @SorinGFS

email & uri can't be completely represented by a regex. Some rather large regexes get close, but I haven't seen a perfect one.

And if a format wouldn't be well-represented with regex how would that format be validated by implementers? - @SorinGFS

These generally need parsers. The spec says that it's expected that an implementation will rely on established functionality to perform the validation.

@SorinGFS
Copy link

SorinGFS commented Jun 22, 2024

@gregsdennis

These generally need parsers. The spec says that it's expected that an implementation will rely on established functionality to perform the validation.

Well, I think this is precisely the point where this concern goes into ambiguity: the same format: "uri" could be interpreted differently from one implementer to another, and from user's perspective I think this shouldn't be acceptable. I think json-schema should stick to things that are giving similar results in any context.

email & uri can't be completely represented by a regex. Some rather large regexes get close, but I haven't seen a perfect one.

As for the email part using the pattern is exactly the perfect solution, because the schema author has the authority to decide which pattern is expected by own application.

@awwright
Copy link
Member

Could you clarify this? Are you saying that a validating "email" format would reject a number? I can't find such a requirement. In fact the (optional) test suite verifies that format ignore value types to which the format doesn't apply.

So you're right, most of the formats won't reject e.g. numbers. There's still a related problem when if I have a schema like { "type": ["number", "string"] }: Defining separate formats for the numbers and strings. Suppose I want to say numbers are unix timestamps, and strings are RFC3339 dates. While if/then (or oneOf) is a solution, selecting on "type" is supposed to have another solution: multiple keywords that don't cross "type" boundaries.

More specifically: if it's possible to break down a keyword into multiple parts, because someone might want one to reject and not the other, then separate keywords are justified. "type" is the most common instance of this.

@jdesrosiers
Copy link
Member

making format to validate by default would be a massive breaking change

I think everyone has recognized this is a breaking change, but I'm glad this was brought up because we haven't stated explicitly the consequences of that fact. After the stable release, breaking changes won't be allowed, so if we want to make this change, it would have to be before the release.

format keyword should stay as annotation and users should be encouraged to use pattern

In addition to Greg's response to this, there are other problems with relying on pattern. Regular expressions can be hard to write and hard to read. That makes them error prone, hard to maintain, and impossible to produce useful error messaging for. Being able to use "format": "email" rather than an incomprehensible pattern is a better experience for schema authors as well as schema consumers... as long as there's a strict and enforced definition of how `"format": "email" should validate.

@SorinGFS
Copy link

Being able to use "format": "email" rather than an incomprehensible pattern is a better experience for schema authors

I totally dissagree on this one, and I think others should express their opinion on this matter. As for me, I will always prefer pattern as schema author because it gives me the oportunity to twick the pattern until I get the desired result, and from that on I will have the guarantee that I will get precisely what I intended in my app... while the format keyword is fix! And despite the fact that format is fix... the results across implementers would vary!
In fact, I think this should be a wider discussion, not about the format keyword itself, but about up to where shoud go the json-schema concern. IMHO json-shema should respect a simple rule in this matter: no concern for matters that cannot produce similar results among implementers under any circumstances! All those concerns should go to application level where the schema author has the ability to require exactly what he needs in his application. The results of the implementers for the same schema must be identical by design, meaning that json-schema should only provide capabilities that cannot be interpreted differently.

@SorinGFS

This comment was marked as off-topic.

@SorinGFS

This comment was marked as off-topic.

@SorinGFS

This comment was marked as off-topic.

@jdesrosiers
Copy link
Member

@SorinGFS I think we're talking about two different things. If I understand you correctly, you understand format to be a kind of semantic identifier. "format": "date" would indicate that the value represents a date. One schema might represent a date as a string and another might represent a date as a number, but they're both semantically a date. Given that interpretation, your arguments makes perfect sense.

But, that's not the way format is defined in the spec. Every format has a very specific expectations defined by a standard. For example, "date" is defined as a string that conforms to RFC 3339, section 5.6. If you use "format": "date" where you expect a timestamp, you're using format in a way that isn't correct according to the spec. All the defined formats are the same. They aren't open-ended. They all have very specific definitions that aren't open to interpretation by schema authors.

@SorinGFS

This comment was marked as off-topic.

@gregsdennis
Copy link
Member Author

gregsdennis commented Jun 24, 2024

data type for date is number

This is an incorrect assumption. Different systems handle dates differently. Some (like many spreadsheets) internally represent dates as numbers.

.Net uses the DateTime struct, an object with multiple fields. When serializing a DateTime to JSON, .Net chose to represent the date as an ISO 8601 string.

Data interchange for date values should only be made as number, and how that date value transported as number is displayed should be completely another concern.

This is a valid opinion, but it is not an industry standard or best practice.

JSON Schema (long ago) made a decision that the ideal way to represent dates in JSON should be strings per RFC 3339. My guess is that, at the time, JSON Schema was published under IETF, and they wanted to use an IETF standard for date representation. Since then, the majority of the internet (based on my experience, which is admittedly .Net heavy) seems to have decided they like ISO 8601 better. (Happy to be proven wrong.) So if anything, we should be changing format: date to check for ISO 8601 dates.


But that's not what this discussion is about.

We're not discussing any format specifically. We're discussing whether the format keyword in general should validate. I'm marking any of your comments that do not address this specifically as off-topic.

In the future, please keep discussions on topic. If you have something new to discuss, please open a new discussion. If it's related, add a link.

@SorinGFS
Copy link

@jdesrosiers @gregsdennis

IMHO, as a rule of tomb, wherever json-schema requires an external library to function is a clear indication that it has exceeded its competence area. Json-schema must work using the basic capabilities of a language, with zero dependencies. Which is not the case with format: date,date-time,time,duration, which needs an external library to work.

@SorinGFS
Copy link

JSON Schema (long ago) made a decision that the ideal way to represent dates in JSON should be strings per RFC 3339. My guess is that, at the time, JSON Schema was published under IETF, and they wanted to use an IETF standard for date representation. Since then, the majority of the internet (based on my experience, which is admittedly .Net heavy) seems to have decided they like ISO 8601 better. (Happy to be proven wrong.) So if anything, we should be changing format: date to check for ISO 8601 dates.

that is another indication that json-schema exceeded its competence area by entering in a land of 'choices and preferences' against precision! 😄 Since json-schema as you said 'long ago' choosed to use RFC 3389 things are changed: that was just a 'proposed standard' and... now is updated with 9557

@gregsdennis
Copy link
Member Author

That makes sense. Thanks for the connection.

@karenetheridge
Copy link
Member

if you are going to make format to validate by default this option goes away

No, it doesn't. People can still use x-format (or any x- keyword) if they must have an annotation.

...or they can define their own metaschema, with the format-assertion vocabulary removed and format-annotation added. (Or, we can formally publish a metaschema variant with this change in it already. You may recall that we've previously discussed publishing a draft2020-12 metaschema with format-assertion enabled; and then with this change, for the next spec release we would have the hing only in reverse: the "main" schema has format-assertion turned on, and the "secondary" schema has it turned off).

@gregsdennis
Copy link
Member Author

or they can define their own metaschema, with the format-assertion vocabulary removed and format-annotation added

This isn't an option anymore. We're removing vocabularies. They're being extracted into the new Feature Life Cycle as a proposal. That means the specification can't rely on vocabs anymore.

There is no format-assertion or format-annotation vocab. There's only the format keyword now.

@karenetheridge
Copy link
Member

A validating "format" [potentially] rejects all types of values, instead of within a single type (like most keywords). This makes use with multi-type schemas (e.g. "type": ["integer", "string"]) more difficult.

We could explicitly state in the spec that while a format may choose to only match strings, it MUST NOT reject other data types for which it is not possible to have valid data (that is: you can have a format that matches against both string and number values, but it cannot produce a "valid":false result for objects, arrays etc. Or to rephrase, a "valid":false result may only be produced for a particular data type so long as there is at least one instance of that type for which "valid":true would be produced).

@karenetheridge
Copy link
Member

This isn't an option anymore. We're removing vocabularies.

I did not have the impression that this had been decided yet.

@gregsdennis
Copy link
Member Author

This isn't an option anymore. We're removing vocabularies.

I did not have the impression that this had been decided yet.

It's been in PR for a month, and it was part of an ADR at the beginning of May.

@gregsdennis
Copy link
Member Author

gregsdennis commented Aug 1, 2024

I've re-read the conversation above, and given the objections/hesitations expressed by @Julian (about avoiding breaking changes) and @awwright (about format's applicability to specific types), I think the best way forward is to:

  1. Retain format as an annotation only (permanently, without the option to make it validate*).
  2. Introduce new [type]Format keywords which validate (using the Feature Life Cycle). This feature could also potentially mark format as deprecated.

* Implementations are still going to do what they want, which likely will include an option to validate format. However, I don't think the spec needs to (or should) sanction format validation.

Does anyone object to this? (👍/👎 voting)

If we agree, I'll create two new issues for each of these tasks so that they can be addressed in separate PRs. (We can probably keep #1391 for (2).)

cc: @json-schema-org/spec-team

@gregsdennis
Copy link
Member Author

gregsdennis commented Aug 12, 2024

Another alternative: a new keyword called semanticType that fulfills the annotation. This would allow format to validate, as users expect.

(👍/👎 voting)

@mwadams
Copy link

mwadams commented Aug 13, 2024

I'm torn.

I am tempted by stringFormat, numberFormat being the validating keywords because it also addresses the type: [string, number] issue (though personally I don't like the array form of type and feel you should be doing composition properly, lazybones 😀).

But it doesn't address the existing expectations of format.

And to comply with the existing expectations of format requires a breaking change.

I think I prefer to avoid the breaking change, and introduce new keywords with new semantics. Not least because they address a second issue.

@jviotti
Copy link
Member

jviotti commented Aug 13, 2024

I vote for not introducing a breaking change, keeping format as an annotation while introducing something like stringType to do string related validation.

This also gives us the freedom of potentially allowing us to choose which string types will we standardise on the next version instead of standardising all of them. For example, date-time and uuid are simple to describe without ambiguity for implementors (which was the problem before 2019-09 for some of them), while i.e. email is a lot more ambiguous.

On naming, I get semanticType, but I think being more direct about the fact that these keywords only affect strings (such as stringType) would help users.

@mwadams
Copy link

mwadams commented Aug 13, 2024

On naming, I get semanticType, but I think being more direct about the fact that these keywords only affect strings (such as stringType) would help users.

I agree.

@gregsdennis
Copy link
Member Author

gregsdennis commented Aug 13, 2024

So it looks like we have the following choice:

  • introduce a new keyword that validates
  • fix user expectations, but it's technically a break from previous versions

Personally, I don't think that people will know that there's a new keyword, whether the new one validates or annotates, no matter how much we advertise it. (Evidenced by the continued questions around additional* where unevaluated* was the introduced solution.) My feeling is that people expect format to validate, so it should validate, even if that's technically a breaking change from how previous versions worked.

Moreover, it's only a breaking change in schemas that don't declare $schema because only those will be processed under new rules. If a schema declares $schema, then it's declaring the ruleset by which it should be interpreted, and that means it can't break. We've been telling people to use $schema for years for this very reason.

I'd like to get our users' feelings on this. We need to ask something like what they expect the validation outcome of this to be:

// schema
{
  "type": "string",
  "format": "email"
}

// instance
"definitely not an email"

(As I went to put something up on reddit, reddit decided to start having problems.)

image

The primary problem with leaving format alone is that, from my experience, users expect it to validate. When it doesn't they come here and ask why their schema isn't working. Then we have a bunch of "Per the spec, format doesn't validate" answers.

Why not give them what they want?

@GREsau
Copy link

GREsau commented Aug 13, 2024

If format is made to validate, then my biggest concern is how validators should handle unknown formats. The formats standardised by JSON schema are fairly limited, so I find it useful to use "non-standard" formats in some cases, which consumers may ignore/respect/transform at their leisure. However, several consumers of my JSON schema generator for Rust have encountered issues (or at the very least, been confused) with non-standard formats. This isn't helped by some validators rejecting schemas with unknown formats, despite the very clear "An implementation MUST NOT fail validation or cease processing due to an unknown format attribute" in the spec...

While many non-standard formats could be replaced by a pattern for validation purposes, one of JSON schemas greatest strengths IMO is its human-readability, and I would broadly consider

{
  "type": "string",
  "format": "semver"
}

easier to work with than

{
  "type": "string",
  "pattern": "^(?P<major>0|[1-9]\\d*)\\.(?P<minor>0|[1-9]\\d*)\\.(?P<patch>0|[1-9]\\d*)(?:-(?P<prerelease>(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\\.(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\\.[0-9a-zA-Z-]+)*))?$"
}

So my personal preference would be to make format validate by default, and keep the current "implementations may support custom formats and must not reject unknown formats" behaviour. Some validators will inevitably continue to ignore this, causing user confusion, but frankly I would say that's the cost they pay for diverging from the documented standard.

My biggest concern then would be about adding new formats to future versions of JSON Schema. We would surely want this to be considered a backward-compatible change, and it would be unless a validator were to have implemented a formerly-custom format differently to the newly-standardised spec.

I think that would be an acceptable risk, under the assumption that a new format would not be standardised without first performing due diligence of considering the likelihood of causing consumer breakage. Presumably this would be done by continuing what's largely already done today - facilitating public discussion, soliciting feedback from implementors etc.

Then whenever a new format is being considered for standardisation, in the likely event that implementors generally ignore and/or have consistent rules for validation of the format, then it could be added with no problems.

Otherwise, in the event of implementations having inconsistent validation rules for a format, it would need to be determined whether it's worth standardising it anyway and accepting the potential breakage (e.g. if it's accepted that those implementations are performing validation incorrectly due to misunderstanding the format), or else avoiding breakage by choosing a different name for the format.

@gregsdennis
Copy link
Member Author

how validators should handle unknown formats

Absolutely a good callout. This was brought up recently to me by someone else offline while discussing it. No answers; just the concern.

As a first pass at this, I'd say that the spec would require validation of the formats it defines and immediate passing validation for non-spec formats. An implementation must not provide validation for custom formats unless explicitly configured to do so. (I'd want to state what an implementation MUST NOT do rather than state what it MAY do. Implementations are going to do what they want anyway, so it's not useful to explicitly state what they're allowed to do. See also #1509.)

This would provide a standard expected behavior for a compliant implementation "out of the box", but also allows for users to specify their own formats.

one of JSON schemas greatest strengths IMO is its human-readability

100%. This was stated above, but the example provides a lot of clarity. Thanks.

@jviotti
Copy link
Member

jviotti commented Aug 14, 2024

Personally, I don't think that people will know that there's a new keyword, whether the new one validates or annotates, no matter how much we advertise it.

This is a good point. I fully agree.

My feeling is that people expect format to validate, so it should validate, even if that's technically a breaking change from how previous versions worked.

So I generally agree with this. The only thing I'm concerned about is that format at the moment has grown to support a pretty rich set of formats (19 of them in 2020-12), and for a few of them it would be extremely hard to get validators to agree exactly on how to validate them. Plus if we don't solve the latter, I fear we will end up in the same ambiguous validation problem that resulted on format becoming just an annotation to begin with.

Not necessarily proposing we should, but for example, I'd feel much better making format an assertion if we shorten the list of formats to the ones that are clear how to validate in every case (i.e. unlike email), and double down on the JSON Schema Test Suite for covering them properly, with a lot more test cases than now.

@gregsdennis
Copy link
Member Author

gregsdennis commented Aug 14, 2024

The only thing I'm concerned about is that format at the moment has grown to support a pretty rich set of formats (19 of them in 2020-12), and for a few of them it would be extremely hard to get validators to agree exactly on how to validate them.

Currently, the requirement isn't ambiguous. Each format references some other specification which identifies its strict requirements.

However, some of them can be a pain to implement fully, and even when using a 3rd-party-lib to support the format, there are no guarantees of support.

I think the solution here is to:

  1. continue to identify the specifications that give the requirement
  2. for complex formats (e.g. email, uri) identify some minimum subset of requirements that MUST be supported.

We actually already do this for regex support (Core 6.4) for pattern and patternProperties:

Furthermore, given the high disparity in regular expression constructs support, schema authors SHOULD limit themselves to the following regular expression tokens:

  • individual Unicode characters, as defined by the JSON specification [RFC8259];
  • simple character classes ([abc]), range character classes ([a-z]);
  • complemented character classes ([^abc], [^a-z]);
  • simple quantifiers: "+" (one or more), "" (zero or more), "?" (zero or one), and their lazy versions ("+?", "?", "??");
  • range quantifiers: "{x}" (exactly x occurrences), "{x,y}" (at least x, at most y, occurrences), {x,} (x occurrences or more), and their lazy versions;
  • the beginning-of-input ("^") and end-of-input ("$") anchors;
  • simple grouping ("(...)") and alternation ("|").

(I'd like to rephrase some of this to make it more a requirement on implementations rather than schema authors, but that's #1509.)

I think this is a good solution that's easily testable and removes a lot of edge cases from hard requirement.

@awwright
Copy link
Member

If format is made to validate, then my biggest concern is how validators should handle unknown formats.

It should be handled the same way any unknown value on a keyword is handled, with an error. An unknown $ref is not ignored, so why ignore an unknown format?

My biggest concern then would be about adding new formats to future versions of JSON Schema. We would surely want this to be considered a backward-compatible change

There is a problem in that adding a format is not backwards compatible if pre-existing validators ignore the new format: It would create a situation where new validators will reject certain inputs, but the pre-existing validators ignore the format, and accept the same input.

If a format is too new and authoring a schema using it would produce an error in too many validators, then perhaps there should be a way to signal that a few different alternatives are equivalent, e.g. you say

"format": "semver"
(which is equivalent to "format": "http://example.com/format/semver")
(which is equivalent to "pattern": "^(?P<major>0|[1-9]\\d*)\\.(?P<minor>0|[1-9]\\d*)\\.(?P<patch>0|[1-9]\\d*)(?:-(?P<prerelease>(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\\.(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\\.[0-9a-zA-Z-]+)*))?$")

@gregsdennis
Copy link
Member Author

@awwright, yeah, that makes sense to me.

Considering that, I would update what I said before to "the spec would require validation of the formats it defines and error for non-spec formats."

This still leaves the door open for implementations to support custom formats, and it still requires users to explicitly opt in to that behavior, indicating they understand the risk of future-version collision.

@jviotti
Copy link
Member

jviotti commented Aug 15, 2024

for complex formats (e.g. email, uri) identify some minimum subset of requirements that MUST be supported.

This is a great start, but I think it doesn't fully solve the problem. Some "permissive" implementations will stay at the lower bar, some may go very strict, and many might be somewhere in between. At that point, schemas stop being fully interoperable between implementations even within the same dialect.

If there was a way for us to better define the exact requirements and not only the lower bar, even for complex ones, I'd be 100% sold on this proposal.

@gregsdennis
Copy link
Member Author

I'm open to suggestions, but I don't think that being 100% exact is possible with formats like email.

The point of the minimum requirements isn't really for implementations. It's for users to know where that line is. As long as they stay within the "minimum support" bounds, they're guaranteed to be interoperable.

(We state it as a requiremnt of the implementation, but it's an assurance for users.)

@awwright
Copy link
Member

I agree JSON Schema formats should include arbitrarily complicated formats, e.g. context free languages and even Turing-complete languages. However note most of the formats specified are regular, and validators shouldn't be required to support formats more complicated than regular. I think the only exception are internationalized domain names.

Also, email and uri not only regular, but quite simple once some technical requirements are considered. The regular expression for email addresses is around 220 characters when you factor out the byte length limits, and URIs are similar if you factor out the IPv6 address validation. (This is one of a couple active areas of research for me at the moment, I'm working on a website to generate regular expressions for common protocols and formats for all different programming languages.)

@jviotti
Copy link
Member

jviotti commented Aug 16, 2024

Could we suggest in the spec specific regular expressions like the ones @awwright is mentioning? i.e. a SHOULD as a lower bar?

@mwadams
Copy link

mwadams commented Aug 16, 2024

I do pass all the optional format stuff, and I've got a suite of "fairly good" regexes that deal 'test-passingly' (I hesitate to say "correctly") with some of the more complex internationalisation requirements along with some "encoding/decoding" you have to do to check "a bit more properly".

I can describe the algorithms and regexes in a way that might be helpful for implementers.

@gregsdennis
Copy link
Member Author

i.e. a SHOULD as a lower bar?

The lower bar needs to be a MUST. Complete support would be the SHOULD.

@jdesrosiers
Copy link
Member

Sorry for being late to the discussion. I want to start by talking about the duality of the format keyword. It really does two things. The one we're primarily talking about here is syntactic validation. But, it's also used for semantic identification as well. "format": "date" could be used to identify a value as semantically referring to a date or it could be used to validate a string that conforms to the RFC 3339 specification. When we discuss whether format should be an assertion or an annotation, it's largely a push and pull over which of these identities format should have. format is defined very heavily toward the syntactic side and that's why I think it should be an assertion. However, I still think we need to provide some solution for users who what a semantic keyword, which is where the semanticType suggestion comes from.

I agree with the sentiment that a new keyword will go relatively unknown. One solution to that problem is to deprecate format and replace it with two new keywords. When people discover that format is gone, they'll go looking for it's replacement and find two keywords to choose from and be able to choose the one that suits their needs.

Another option that occurred to me is to use the same x- convention we're introducing for keywords. It would have a very similar meaning. You can use them arbitrarily and they have no validation behavior.

I really like the idea of requiring extension formats that need to validate to be URIs. That eliminates the possibility of us introducing a breaking change by adding new formats. So, we could allow unknown formats to be ignored without a breaking change concern. However, I think there are other reasons not to ignore unknown formats. The biggest reason is consistency. Users should expect consistent results from different validators unless they specifically opt-in to something custom. If one implementation understands https://semver.org and another doesn't, they could give different results. There should only be different results if the user specifically configures it.

Regarding the validation of things like the email format, I think people often read too much into what "valid" implies. It just means that it's syntactically valid according RFC 5321. That's not ambiguous and straightforward for anyone to implement. That doesn't tell you whether an email sent to that location will succeed. Many email servers are more lax than the RFC, so there are email addresses that exist in the wild that work, but aren't valid according to the RFC. I don't think JSON Schema needs to care about any of that. All we should be concerned with is whether the email conforms to the RFC defined syntax. None of the ambiguous stuff should be in scope.

@jviotti
Copy link
Member

jviotti commented Aug 16, 2024

I can describe the algorithms and regexes in a way that might be helpful for implementers.

@mwadams Can you link me to those? Happy to be the guinea pig on this if you agree, as I don't support format at all at the moment. Maybe that will give me more context for this discussion too.

Regarding the validation of things like the email format, I think people often read too much into what "valid" implies. It just means that it's syntactically valid according RFC 5321. That's not ambiguous and straightforward for anyone to implement. That doesn't tell you whether an email sent to that location will succeed. Many email servers are more lax than the RFC, so there are email addresses that exist in the wild that work, but aren't valid according to the RFC. I don't think JSON Schema needs to care about any of that. All we should be concerned with is whether the email conforms to the RFC defined syntax. None of the ambiguous stuff should be in scope.

Sounds great. I'd still like to see more testing for these formats on the official test suite. I think most implementers just look at the basic coverage we have for i.e. email in there and never read the RFC 5321 grammar, leading to the problems I was mentioning before: some attentive implementations would adhere fully to the RFC grammar and other implementations only do enough for the test suite, and then you pass a schema across them and it doesn't quite work the same.

@mwadams
Copy link

mwadams commented Aug 16, 2024

I agree - I'm not super attentive to the grammar if I pass the tests (though most of our tests are derived from the relevant RFCs and are quite tricky to pass!)

I'll see about pulling something together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Discussion
Status: In Discussion
Development

No branches or pull requests

9 participants