Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jsonlines.org and ndjson.org #22

Open
max-mapper opened this issue Jan 29, 2017 · 18 comments
Open

jsonlines.org and ndjson.org #22

max-mapper opened this issue Jan 29, 2017 · 18 comments

Comments

@max-mapper
Copy link

hey I noticed http://ndjson.org/ and http://jsonlines.org/ are very similar, I was just wondering if maybe they could link to each other to reduce confusion? I like both names personally and use them interchangeably

cc @chrisdew

@wardi
Copy link
Owner

wardi commented Jan 29, 2017

This site links to ndjson from http://jsonlines.org/on_the_web/ and ndjson.org links back here from its footer, is that not sufficient?

@karmakaze
Copy link

In the interest of converging on a single standard, would it not be beneficial for these two sites co-ordinate and agree on items, and ideally just be one .org. Having two similar sites each promoting an emerging 'standard' with differences gives a sense it's not ready for interoperation.

@glensc
Copy link

glensc commented Mar 3, 2021

created similar issue in the ndjson repo:

@pekkaklarck
Copy link

pekkaklarck commented Mar 23, 2021

👍 for just one standard and one web site. Based on a quick look there aren't any real differences except to the extension (.jsonl vs .ndjson). Having a common extension would make it more likely that editors and IDEs support this format without extra configuration.

@onacit
Copy link

onacit commented Feb 23, 2022

Cross linking is not sufficient. Anyone would google with v.s.. which might lead to here.

I'm sure the author was not confused. :) But I'm confused.

This format is specified at ndjson.org and documented at the JSON Lines website.

- https://en.wikipedia.org/wiki/JSON_streaming

@jsejcksn
Copy link

jsejcksn commented Jul 6, 2022

Observation: If repository issue activity is any metric for discoverability, then JSON Lines has an advantage.

In the interest of converging on a single standard, would it not be beneficial for these two sites co-ordinate and agree on items, and ideally just be one .org. Having two similar sites each promoting an emerging 'standard' with differences gives a sense it's not ready for interoperation.

I understand that, historically, there were spec differences due to potential ambiguity (UTF-8 encoding, required JSON data on every line, etc.), but it seems as though they are now aligned. At this point in time, are there any remaining spec differences? And are there any other issues which are preventing convergence (e.g. copyright credit, etc.)? I think the community will greatly benefit from a single, unified standard with an RFC, registered IANA media type, etc. The involved parties appear to be reasonable and responsive. Can we make this happen?

@wardi
Copy link
Owner

wardi commented Jul 6, 2022

I prefer the name "JSON lines" because that seemed like the obvious name to me :-) but, the ndjson folks did go the extra mile and write a spec.

If we're fully aligned I like the idea of settling on a single name. Is there an unbiased measure we can use for deciding?

@jsejcksn
Copy link

jsejcksn commented Jul 6, 2022

Is there an unbiased measure we can use for deciding?

@wardi Names are names and will always be arbitrary/subjective. 😅 I think it's just up to the party that submits the RFC and registers. IMO, a unified standard with either name is better than two ambiguously identical alternatives.

@remram44
Copy link

remram44 commented Jul 6, 2022

The ndjson repo hasn't seen any maintainer activity in years. That makes it both impossible to pick this and have them redirect, and a bad idea to pick them and redirect from here.

@pekkaklarck
Copy link

pekkaklarck commented Oct 20, 2022

The owner of the ndjson domain seems to be fine going forward with jsonlines.org.

@stokito
Copy link

stokito commented Feb 22, 2023

This is a mess. Let's finally get to some decision. My proposition is to take the already existing JSON Text Sequences RFC 7464 and enrich it with additions: add a file extension jsonl and make the usage of the RS symbol optional and the LF too.

A good overview of all streaming formats https://en.wikipedia.org/wiki/JSON_streaming

  • Just a concatenated JSON. Each bracket must be paired with a closed bracket. No any spec on this.
  • The NDJSON: separator \n (LF), on parsing accepts \r\n. File ext: .ndjson, MIME: application/x-ndjson
  • The JSON Lines : separator \n, on parsing accepts \r\n. File ext: .jsonl, MIME: none
  • The RFC 7464 File ext: none, MIME: application/json-seq and it's registered IANA. Additionally it uses a RS symbol:

It's basic idea to have "unambiguous JSON" resilient to many forms of damage such as truncation, multiple writers incorrectly configured to write to the same file, corrupted JSON, etc.  An example sequence:

    ␞{"d":"2014-09-22T21:58:35.270Z","value":6}␤
    ␞{"d":"2014-09-22T21:59:15.117Z","value":12}␤

From the spec:

 Phillip Hallam-Baker proposed the use of JSON text sequences for
 logfiles and pointed out the need for resynchronization.  Stephen
 Dolan created https://github.com/stedolan/jq, which uses something
 like JSON text sequences (with LF as the separator between texts on
 output, and requiring only such whitespace as needed to disambiguate
 on input). Carsten Bormann suggested the use of ASCII RS, and Joe
 Hildebrand suggested the use of LF in addition to RS for
 disambiguating top-level number values.

So basically for a simplest case when I know that the data is not corrupted I can simply use a concatenated JSON. I can use line separators too and they'll just ignored as in usual JSON.
The only one requirement is from the parser to accept multiple documents.

Example 1:

{"id":1}{"id":2}

Example 2: two documents but formatted with a newline

{
  "id":1
}
{
  "id":2
}

If I may have corrupted JSONs then a newline may be used. But here may be a problem to distinguish when the newline was used just for a formatting and when to split two documents.

Example 3: the first document is broken and doesn't have a closing bracket but \n anyway allows to split them

{"id":
{"id":2}

Example 4: first doc is broken, then newline, and the second doc is formatted with a newline

{"id":
{
  "id": 2,
  "props": {
    "prop1": 1,
    "prop2": 2
  }
}

But visually we still can distinguish where the first doc ends and the second starts.
And we can use a simple rule: sequence \n{ separates the next document. E.g. { at the start of a line without indentation. But when there is \n some spaces and only then continue the document until finding the closing bracket.
I think the simple rule should work almost always. But anyway the indented JSON makes a little sense for the JSON streaming and not expected.

If I need to have top level values then the RS may be used optionally.
Anyway this is something that a producer may decide to use the RS or not.
In any case a parser may be just configured to require the RS if it expects top level values or broken data e.g. he needs for the "unambiguous JSON". E.g. this should be an option of the format but not a requirement.
As for me the RS at the beginning still makes a little sense for unambiguous because on threading issues you may just have lines intermixed. It looks like overengineering. But probably it came from real world usage and problems so I'm not sure.

@nicowilliams you are the author of the RFC 7464. Please give us your thoughts. Is it possible to make some errata for the spec?

cc: @hoegertn @finnp @wardi

Related: already was discussed an idea to use the application/json-seq as a MIME for the JSONL #19

The file extension: both ndjson and jsonl are easy to google. The jsonl files are easier to pronounce, easier to read at first sight and also they'll sort more naturally with existing json files.
The mime type is json-seq so a file extension jsons would be more appropriate but may cause confusion in a conversation. So IMHO the existing jsonl should be better

@wardi
Copy link
Owner

wardi commented Feb 22, 2023

@stokito updating RFC 7464 as you describe sounds good to me.

@sp4ce
Copy link
Collaborator

sp4ce commented May 5, 2023

Could we include the MIME type application/jsonl that seem to be used already by others and is suggested in #19?

@sp4ce
Copy link
Collaborator

sp4ce commented May 6, 2023

@stokito

from issue in #65 (comment) I don't think jsonlines is going into any direction to allow incomplete record, empty lines, or other type of linebreaks that doesn't separate valid JSON records.

I am not sure amending RFC 6474 will be valid in that context. The examples you gave seems to allow that.

To me streaming JSON is a whole other problem, I think jsonlines is about a succession of valid JSON, like you would do a succession of API call for batching input or reading some process results (we've been using it with Amazon Comprehend to manage training corpus for example, or the recognition job inputs)

@ciscorucinski
Copy link

Imagine the file extension being a format like .lines.json or .stream.json

Taking inspiration from:

  • Kotlin Gradle files (.gradle.kts)
  • Compressed tarballs (.tar.gz)

.stream.json keeps with the idea that .x.y means it is a y file but for x

@remram44
Copy link

remram44 commented Aug 24, 2023

The difference is that a .gradle.kts file is a valid .kts file, and a .tar.gz is a valid .gz file. A lines.json is not a valid JSON file, since it contains multiple JSON objects. It needs to be split before it yields valid JSON documents.

So .json.lines would make more sense if anything.

@ciscorucinski
Copy link

Point taken.

I would still be for .json.stream. It's a higher-level concept that fits all current json streaming formats (I mean the concept is already called streaming).

A good overview of all streaming formats https://en.wikipedia.org/wiki/JSON_streaming

  • Just a concatenated JSON. Each bracket must be paired with a closed bracket. No any spec on this.
  • The NDJSON: separator \n (LF), on parsing accepts \r\n. File ext: .ndjson, MIME: application/x-ndjson
  • The JSON Lines : separator \n, on parsing accepts \r\n. File ext: .jsonl, MIME: none
  • The RFC 7464 File ext: none, MIME: application/json-seq and it's registered IANA. Additionally it uses a RS symbol:

Anyways, just throwing this out. Glad this concept has been seen. Seems like all emoji interactions like the concept, but just preferred it swapped around. I'm completely down for that.

@remram44
Copy link

I would rather see an extension that specifically says which it is. We don't use .img for PNG, JPG, BMP, and TIF. Similarly I think those 4 (well, 3) different formats should have different extensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests