Skip to content
This repository has been archived by the owner on Aug 5, 2024. It is now read-only.

Specify indexing/length units #83

Open
dmsnell opened this issue Jan 7, 2020 · 0 comments
Open

Specify indexing/length units #83

dmsnell opened this issue Jan 7, 2020 · 0 comments

Comments

@dmsnell
Copy link

dmsnell commented Jan 7, 2020

This is a great library! Unfortunately it has been ambiguous about what input it wants to accept and what it wants to output. That is, while we know that it's "character based" we don't have a definition of "character." The Lua library even makes it clear that since Lua is unaware of Unicode then it will treat content as "as a series of bytes, not a series of characters."

This ambiguity has caused numerous problems for folks wanting to interchange delta strings and gets us into tricky situations when dealing with emoji and other characters which are encoded as surrogate pairs in UTF16.

Consider this example:

A: 🅰🅰
B: 🅰a🅰

We can all agree that what happened is that we entered a a in between the two existing 🅰 characters.

Some libraries produce this delta: =1\t+a\t=1

  • Python3
  • Python2 when compiled in wide mode

Most libraries produce this delta: =2\t+a\t=2

  • Python2 when compiled in narrow mode
  • JavaScript
  • Objective-C
  • Java

I didn't check the others. This seems like enough to highlight the disparity in indexing and length calculations.

I propose a new non-breaking change to indicate what the index and length values are measured in.

In my own work in #80 I discovered that clients are fine decoding in fromDelta() a blank insertion group.

Therefore I propose that we send blank insertion groups at the front of a delta to indicate what the indexing and length values correspond to.

There are only three realistic measurement units:

  • Unicode code points (probably what would have been most ideal to use from the start)
  • UTF-16 code units (because most platforms and languages use this internally)
  • bytes (because that's the most agnostic way of measuring this)

In addition we should point out that the legacy behavior is to not report measurement units.

In my proposal we'd stick a number of empty insertion groups at the front of a delta to indicate which of those measurement units we'd want, in the order above: one group to indicate Unicode (since unicode is the nominal way to think about text here); two groups to indicate UTF-16 code units (since these are two-byte characters); three groups to indicate bytes (because I don't know what to do about Lua other than to make it obvious); and no groups to indicate an unreported measurement (identical to all existing deltas).

Measurement units Delta
Unicode code points +\t=1\t+a\t=1
UTF-16 code units +\t+\t=2\t+a\t=2
Bytes +\t+\t+\t=4+a\t=4
Unspecified one of the above without the prefix

Note that these diffs should (might?) work in all existing libraries to produce the same result as they would without the leading + groups. However, this gives us a chance to update fromDelta() to support the denoted measurement units and then we can slowly migrate the client libraries to support returning their deltas in a requested unit.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant