-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No new string implementations on the web: allow re-use of JS engine's strings #54
Comments
I don't understand the concern here: this proposal is carefully designed to not tie Wasm to what JS engines do today. If the proposal was to simply add WTF-16 strings to Wasm, then that would be a very valid concern. But the design around encoding-agnostic I think it's a reasonable goal to allow (not force!) any combined JS+Wasm engines to use the same internal string representation for both kinds of programs, regardless of the encoding(s) that such engines choose to use for their strings. Please clarify where you see a danger of baking currently-typical JS implementation choices into the Wasm design. |
I see. But also, requirement 1 is "Zero-copy passing of strings from JavaScript to WebAssembly & back". How would that work for Latin1 strings for instance? And I think you end up baking it in due to the expected performance. |
I don't quite see how the Latin1 optimization, which is merely (a subset of) WTF-16 with the high zero byte omitted, would interfere with zero-copy? Regardless of whether an engine uses the optimization or not, what's passed around remains an abstract reference. Makes me wonder, though, if there is value to allow a language to store to/load from linear memory in Latin1, in case a language shares the optimization and has a use case to preserve it in linear memory. But that's probably a different topic. Performance depends of course. A JS engine today might use Latin1/WTF-16, a JS engine tomorrow WTF-8 with breadcrumbs. Leaves it to the engine, though, since both cases are covered by the proposal and abstracted over. In a sense, this proposal appears to already provide for the evolutionary path you seem to have in mind. |
Yes, this proposal was definitely designed with the possibility of JS engines adopting UTF/WTF-8 strings in mind. Regarding the requirements, I kind of read them all with an implicit "make it possible to ..." prefix. So the key point of requirement 1 is that a design that necessarily always requires a copy at the JS/Wasm boundary is considered inadequate. That doesn't mean that every single passing of a string in every single implementation must be zero-copy, just that the design allows sufficiently optimized engines to make most string passings zero-copy. This is related to the general point of expected performance: given that there are multiple existing string encodings in the world with non-trivial mappings between them and lots of existing code that assumes one particular encoding, it's unrealistic to expect that all encoding conversions can be avoided. The most obvious counter-example is: when a Wasm module compiled from a WTF-16 based language is passing strings back and forth with another module compiled from a UTF-8 based language, then there will be some unavoidable friction: at some point, encodings will have to be converted. But each individual module doesn't know in advance with whom it will be exchanging strings. The point of this proposal's fundamental design is to allow engines to minimize the number of required conversions: if the |
I don't see how that's possible. There are so many encoding differences between languages (and JS runtimes), so it's unlikely that things will line up perfectly, so I think it's just not possible for a host to ensure zero-copy in most situations. It will be more of a happy surprise if it happens to work, but it can't be expected or relied upon. If a language wants to guarantee that they are using JavaScript strings (and thus guarantee zero-copy transfers to the host) then they should use |
The key instrument is letting engines perform conversions on demand (="lazily", as late as possible). Then, for example, if a string originates in JS and is passed into Wasm and back, it never needs to be converted (even without use of Time will tell whether my phrasing that "most" passings can avoid conversion is overly optimistic and "some" will turn out to be more accurate. With the existing amount of implementation experience (we have stringrefs implemented in V8, and J2Wasm is using them), I remain optimistic. I'm very curious to hear about experiences with different scenarios (other source languages, other engine implementations). Generally speaking, implementing strings is a game of tradeoffs. Any given design will have certain limitations, that's obvious and unavoidable. Our task is to find a design that works well enough in practice to be useful -- it is impossible to find a "perfect" design in the sense that it would have no drawbacks and no limitations. The current proposal seems to be working well, given the data we have so far. If you have concrete suggestions for improvements, I'm all ears. |
Wouldn't that require runtime information, so the host knows what the type is? Also, that sounds really iffy to me... for example, consider a simple Rust function that accepts a String and returns that same String: pub fn identity(x: String) -> String {
x
} It does not do any operations on the string, it just returns it back as-is. And let's suppose that function is called with a JavaScript string. The Rust code will compile to Wasm code which takes a I think it will be very hard to actually optimize it so that it doesn't do any copying (assume a more complicated case than just the identity function). Although the example I gave is for Rust, I think the same thing applies to most languages, since they have their own internal string representation which they have to convert into when receiving a It might be possible for some languages to use
I agree, and I think trying to optimize for the zero-copy case causes massive downsides for many languages, I don't think it's feasible in general. So I would be cautious about trying to guarantee too much with regards to zero-copy. The important thing is interop: languages should be able to seamlessly use strings from any other language (and the host). That desire is fulfilled by this proposal, but at the cost of copying. I think that's a reasonable tradeoff. |
@Pauan, agreeing with what you're saying, except that seamless language interop is explicitly not a design goal for core Wasm either -- that's mostly impossible in general for many of the same reasons. Interop is what the interface types / component layer is intended for. |
Yes. When JS is involved, everything requires runtime type information, because there is no static typing on the JS side.
Yeah, that's not useful. I don't really know Rust, so I don't know whether it's feasible (or a good idea) to compile Rust's
Fully agree. Since stringrefs are GC'ed and immutable, this is an interesting option in particular for source languages that also have GC'ed immutable strings, such as anything running on the JVM, or on the CLR, or many functional languages, or script-y languages such as Python, Dart, etc.
Care to elaborate? And what alternative would you have in mind? |
Even without interface types, programs can pass a So it does provide a seamless and standardized way of transferring strings. Interface types might make that workflow easier and/or more efficient, but
Yes, that is what However, it's not very useful, because you can only call JS methods, you can't use any Rust methods on it, so in practice it is immediately converted into a Rust String (which is an
If you desire to have zero-copy in the majority of cases, you have to consider that currently the most popular compile-to-Wasm languages are Rust, C++, and AssemblyScript (because they lack a runtime, and so they produce very small code). The only one of those languages which can take advantage of zero-copy is AssemblyScript. So in practice most strings will require copying, and trying to solve that is a very non-trivial problem. I don't have any alternative, because I don't think it's a problem that is worth trying to solve. When languages use different representations, they must copy, that is a simple fact of life. I think the proposal is fine as-is, I'm just skeptical of the claims about ubiquitous zero-copy. |
@Pauan, yes, you'll need to agree on a common format at the boundary (i.e., specify an ABI), which you then need to copy to and from on both sides, because it will generally coincide with neither. A built-in string type doesn't provide any advantage for this purpose over picking any old array type. In particular, you'll need none of the built-in string operations. A built-in Wasm string type would only make sense if it could actually be used by many languages internally. But as you say, that is impractical given the diversity of implementations and requirements. |
If language Interop isn't a goal I don't see why we'd cater to lone surrogates. They're a legacy leftover that have no place in new environments. |
Efficient interop with a web embedding is a goal, so we have to deal with the possibility of lone surrogates. Also, as discussed above it's desirable to make stringrefs useful for as many source languages as possible (knowing that that won't be all languages), and some of them support lone surrogates. So there might be existing programs that (intentionally or accidentally) use lone surrogates, and people might now want to compile these programs to Wasm; it would be quite unfortunate (and, in fact, might well block deployments) if the Wasm version didn't behave the same as the original (or possibly even didn't work at all) because Wasm disallowed lone surrogates. |
I think this requirement is somewhat shortsighted. It seems probable that in time JS engines can have mostly UTF-8-backed strings, only using WTF-16 when string indexing is used and such. Tying Wasm to what JS engines do today seems rather unfortunate.
The text was updated successfully, but these errors were encountered: