Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No new string implementations on the web: allow re-use of JS engine's strings #54

Open
annevk opened this issue Dec 2, 2022 · 13 comments

Comments

@annevk
Copy link
Member

annevk commented Dec 2, 2022

I think this requirement is somewhat shortsighted. It seems probable that in time JS engines can have mostly UTF-8-backed strings, only using WTF-16 when string indexing is used and such. Tying Wasm to what JS engines do today seems rather unfortunate.

@jakobkummerow
Copy link
Collaborator

I don't understand the concern here: this proposal is carefully designed to not tie Wasm to what JS engines do today.

If the proposal was to simply add WTF-16 strings to Wasm, then that would be a very valid concern.

But the design around encoding-agnostic stringref (with encoding-specific "views" to cater to the needs of various source languages) is specifically intended to let engines do "whatever they want" (in particular, use one or more internal string encodings of their choice) under the hood.

I think it's a reasonable goal to allow (not force!) any combined JS+Wasm engines to use the same internal string representation for both kinds of programs, regardless of the encoding(s) that such engines choose to use for their strings.

Please clarify where you see a danger of baking currently-typical JS implementation choices into the Wasm design.

@annevk
Copy link
Member Author

annevk commented Dec 2, 2022

I see. But also, requirement 1 is "Zero-copy passing of strings from JavaScript to WebAssembly & back". How would that work for Latin1 strings for instance?

And I think you end up baking it in due to the expected performance.

@dcodeIO
Copy link
Contributor

dcodeIO commented Dec 2, 2022

I don't quite see how the Latin1 optimization, which is merely (a subset of) WTF-16 with the high zero byte omitted, would interfere with zero-copy? Regardless of whether an engine uses the optimization or not, what's passed around remains an abstract reference. Makes me wonder, though, if there is value to allow a language to store to/load from linear memory in Latin1, in case a language shares the optimization and has a use case to preserve it in linear memory. But that's probably a different topic.

Performance depends of course. A JS engine today might use Latin1/WTF-16, a JS engine tomorrow WTF-8 with breadcrumbs. Leaves it to the engine, though, since both cases are covered by the proposal and abstracted over. In a sense, this proposal appears to already provide for the evolutionary path you seem to have in mind.

@jakobkummerow
Copy link
Collaborator

Yes, this proposal was definitely designed with the possibility of JS engines adopting UTF/WTF-8 strings in mind.

Regarding the requirements, I kind of read them all with an implicit "make it possible to ..." prefix. So the key point of requirement 1 is that a design that necessarily always requires a copy at the JS/Wasm boundary is considered inadequate. That doesn't mean that every single passing of a string in every single implementation must be zero-copy, just that the design allows sufficiently optimized engines to make most string passings zero-copy.

This is related to the general point of expected performance: given that there are multiple existing string encodings in the world with non-trivial mappings between them and lots of existing code that assumes one particular encoding, it's unrealistic to expect that all encoding conversions can be avoided. The most obvious counter-example is: when a Wasm module compiled from a WTF-16 based language is passing strings back and forth with another module compiled from a UTF-8 based language, then there will be some unavoidable friction: at some point, encodings will have to be converted. But each individual module doesn't know in advance with whom it will be exchanging strings. The point of this proposal's fundamental design is to allow engines to minimize the number of required conversions: if the stars encodings align (i.e. the engine's internal representation as well as all modules involved in the current application use the same encoding, regardless of which encoding that is), then no conversions at all need to happen. If different components assume different encodings, then some minimal number of conversions will need to happen -- that's unavoidable, no matter how stringrefs are designed. (In particular, "all stringrefs use WTF-16" and "all stringrefs use WTF-8" would both be strictly worse than the status quo in this regard.)

@Pauan
Copy link

Pauan commented Dec 5, 2022

That doesn't mean that every single passing of a string in every single implementation must be zero-copy, just that the design allows sufficiently optimized engines to make most string passings zero-copy.

I don't see how that's possible. There are so many encoding differences between languages (and JS runtimes), so it's unlikely that things will line up perfectly, so I think it's just not possible for a host to ensure zero-copy in most situations. It will be more of a happy surprise if it happens to work, but it can't be expected or relied upon.

If a language wants to guarantee that they are using JavaScript strings (and thus guarantee zero-copy transfers to the host) then they should use externref (which can be trivially converted into stringref). And unlike UTF-8 or UTF-16, this is guaranteed to always be zero-cost when sending or receiving from the host.

@jakobkummerow
Copy link
Collaborator

I don't see how that's possible.

The key instrument is letting engines perform conversions on demand (="lazily", as late as possible). Then, for example, if a string originates in JS and is passed into Wasm and back, it never needs to be converted (even without use of externref). If a string is assembled as the concatenation of two existing strings, it never needs to be converted. If it's constructed using one of the string.new_* instructions in this proposal, engines can immediately create it in whichever internal format they can later hand over to JS without conversions at the boundary -- or they can first allocate it in a different format if they think that's better, and wait and see if this particular string actually is ever sent to JS, and only then perform the conversion on demand. They can even choose to keep more than one representation of a string around.

Time will tell whether my phrasing that "most" passings can avoid conversion is overly optimistic and "some" will turn out to be more accurate. With the existing amount of implementation experience (we have stringrefs implemented in V8, and J2Wasm is using them), I remain optimistic. I'm very curious to hear about experiences with different scenarios (other source languages, other engine implementations).

Generally speaking, implementing strings is a game of tradeoffs. Any given design will have certain limitations, that's obvious and unavoidable. Our task is to find a design that works well enough in practice to be useful -- it is impossible to find a "perfect" design in the sense that it would have no drawbacks and no limitations. The current proposal seems to be working well, given the data we have so far. If you have concrete suggestions for improvements, I'm all ears.

@Pauan
Copy link

Pauan commented Dec 5, 2022

Then, for example, if a string originates in JS and is passed into Wasm and back, it never needs to be converted (even without use of externref).

Wouldn't that require runtime information, so the host knows what the type is?

Also, that sounds really iffy to me... for example, consider a simple Rust function that accepts a String and returns that same String:

pub fn identity(x: String) -> String {
    x
}

It does not do any operations on the string, it just returns it back as-is.

And let's suppose that function is called with a JavaScript string. The Rust code will compile to Wasm code which takes a stringref and then copies it into linear memory (as UTF-8), and then copies the UTF-8 string back into a stringref, thus it performs two O(n) copies.

I think it will be very hard to actually optimize it so that it doesn't do any copying (assume a more complicated case than just the identity function). Although the example I gave is for Rust, I think the same thing applies to most languages, since they have their own internal string representation which they have to convert into when receiving a stringref.

It might be possible for some languages to use stringref directly (without any conversions), in which case you are right that it could be zero-copy in some situations, but I still think that would require some runtime type information in the general case. And that is functionally the same as using externref, as I suggested above.

Generally speaking, implementing strings is a game of tradeoffs. Any given design will have certain limitations, that's obvious and unavoidable.

I agree, and I think trying to optimize for the zero-copy case causes massive downsides for many languages, I don't think it's feasible in general. So I would be cautious about trying to guarantee too much with regards to zero-copy.

The important thing is interop: languages should be able to seamlessly use strings from any other language (and the host). That desire is fulfilled by this proposal, but at the cost of copying. I think that's a reasonable tradeoff.

@rossberg
Copy link
Member

rossberg commented Dec 5, 2022

@Pauan, agreeing with what you're saying, except that seamless language interop is explicitly not a design goal for core Wasm either -- that's mostly impossible in general for many of the same reasons. Interop is what the interface types / component layer is intended for.

@jakobkummerow
Copy link
Collaborator

Wouldn't that require runtime information, so the host knows what the type is?

Yes. When JS is involved, everything requires runtime type information, because there is no static typing on the JS side.

The Rust code will compile to Wasm code which takes a stringref and then copies it into linear memory (as UTF-8)

Yeah, that's not useful. I don't really know Rust, so I don't know whether it's feasible (or a good idea) to compile Rust's String to Wasm stringref, and I wouldn't bet on it. I'm more familiar with C++, and I certainly don't expect a C++ std::string to be compiled to a Wasm stringref. Maybe most modules compiled from these languages will simply ignore the existence of stringref (just like they'll ignore the existence of WasmGC, and possibly several other proposals).
That said, a possible way how C-like languages could benefit from stringref is by a class ExternalString (or similar) that's provided by emscripten and gets special treatment when compiling to Wasm, specifically for use in purpose-built interop code, so that Wasm modules at least gain the ability to say "I want to pass a string with the following contents here" (in a way that's much more convenient and robust than exporting their entire memory and saying "there's a string with xyz encoding starting at this offset").

It might be possible for some languages to use stringref directly

Fully agree. Since stringrefs are GC'ed and immutable, this is an interesting option in particular for source languages that also have GC'ed immutable strings, such as anything running on the JVM, or on the CLR, or many functional languages, or script-y languages such as Python, Dart, etc.
It is much less likely that linear-memory languages (like C/C++/Rust) can compile their strings to stringref by default.

I think trying to optimize for the zero-copy case causes massive downsides for many languages

Care to elaborate? And what alternative would you have in mind?

@Pauan
Copy link

Pauan commented Dec 5, 2022

@rossberg Interop is what the interface types / component layer is intended for.

Even without interface types, programs can pass a stringref to other Wasm modules (which can be generated from any source language). The only requirement is that the Wasm module uses stringref at the boundary of the module. For example, I imagine wasm-bindgen will use stringref when converting a Rust String into a JS string.

So it does provide a seamless and standardized way of transferring strings. Interface types might make that workflow easier and/or more efficient, but stringref alone already solves the issue of interop, simply because it is standardized.


@jakobkummerow That said, a possible way how C-like languages could benefit from stringref is by a class ExternalString (or similar) that's provided by emscripten and gets special treatment when compiling to Wasm, specifically for use in purpose-built interop code.

Yes, that is what JsString does in Rust, it is a Rust type that is simply a reference to a JS string, it is equivalent to using externref, and thus it is zero-copy.

However, it's not very useful, because you can only call JS methods, you can't use any Rust methods on it, so in practice it is immediately converted into a Rust String (which is an O(n) cost).

Care to elaborate? And what alternative would you have in mind?

If you desire to have zero-copy in the majority of cases, you have to consider that currently the most popular compile-to-Wasm languages are Rust, C++, and AssemblyScript (because they lack a runtime, and so they produce very small code). The only one of those languages which can take advantage of zero-copy is AssemblyScript.

So in practice most strings will require copying, and trying to solve that is a very non-trivial problem. I don't have any alternative, because I don't think it's a problem that is worth trying to solve. When languages use different representations, they must copy, that is a simple fact of life. I think the proposal is fine as-is, I'm just skeptical of the claims about ubiquitous zero-copy.

@rossberg
Copy link
Member

rossberg commented Dec 5, 2022

@Pauan, yes, you'll need to agree on a common format at the boundary (i.e., specify an ABI), which you then need to copy to and from on both sides, because it will generally coincide with neither. A built-in string type doesn't provide any advantage for this purpose over picking any old array type. In particular, you'll need none of the built-in string operations.

A built-in Wasm string type would only make sense if it could actually be used by many languages internally. But as you say, that is impractical given the diversity of implementations and requirements.

@annevk
Copy link
Member Author

annevk commented Dec 7, 2022

If language Interop isn't a goal I don't see why we'd cater to lone surrogates. They're a legacy leftover that have no place in new environments.

@jakobkummerow
Copy link
Collaborator

Efficient interop with a web embedding is a goal, so we have to deal with the possibility of lone surrogates.

Also, as discussed above it's desirable to make stringrefs useful for as many source languages as possible (knowing that that won't be all languages), and some of them support lone surrogates. So there might be existing programs that (intentionally or accidentally) use lone surrogates, and people might now want to compile these programs to Wasm; it would be quite unfortunate (and, in fact, might well block deployments) if the Wasm version didn't behave the same as the original (or possibly even didn't work at all) because Wasm disallowed lone surrogates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants