-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Context locals redux #66
Comments
To summarize the difference between thread-local globals and this design, it seems that thread-local globals could be lowered to context locals by moving all the thread-local globals of a module into the context for that module's declared functions, then inserting glue code to switch contexts on all cross-module calls where the caller and callee contexts differ. A benefit of thread-local contexts is then that no switching is required on cross-module calls where the contexts do not differ. As an optimization, a thread-local global implementation could potentially elide the implicit context switch in the analogous case where the callee's thread-local globals are all imported from the caller or vice versa, although checking this condition would still be some amount of work. The other benefit of thread-local contexts is that there are no shared-to-unshared edges involved, although this benefit will be moot if users end up needing strong thread-bound data references anyway. The primary downside compared to thread-local globals is that thread-local contexts make cross-producer calls more difficult to produce because the thread-local contexts of each producer become part of their function signature ABI. Another downside is that these cross-producer calls remain expensive in the case that the separately produced modules are merged. Does that all sound correct? |
The most fundamental disadvantage of context locals seems to be that they are inherently unmodular, as implementation details of local state leak into interfaces and moreover, IIUC this leakage is transitive, so inevitably requires whole-program knowledge to be able to funnel everything through to everywhere. Cross-producer calls are just a special case of this very general problem. This is further elevated by the fact that there is no mechanism to abstract or parameterise over the details of the TLS (like e.g. a monad type would in functional programming with state passing). Would a module be able to ever change or extend its TLS without breaking all clients? I believe some such abstraction capability would be the bare minimum to make this approach scale, at least if we ever want Wasm to be able to express libraries. The other big problem I see that this interacts poorly (read: not at all) with stack switching. If a caller A1 with TLS calls B, which then suspends and perhaps gets resumed by A2 which has different TLS(*), how would one update the "current" contexts in the suspended call chain of B? This is the full problem of dynamic scoping surfacing, and how to make it compose correctly with other effects. (*) Not just different values, but possibly different context shape. (As an aside, can we avoid describing the problem in terms of JavaScript? Either it is generally relevant to host interaction, then it should be described as such, or it is not, then frankly it has no place in Wasm.) |
In a whole-program/closed compilation scenario, all functions would have compatible context locals, so no switching would be needed. More generally, it would be possible to come up with an ABI for separate compilation using indirections like putting thread IDs and structs containing host functions inside context locals instead of direct TLS data and top-level host functions. It becomes a toolchain game - you get a more composable ABI (avoiding switches) by adding indirections, or you can accept the cost of switches in exchange for an ABI with fewer indirections in the TLS.
Thread-bound data references alone aren't enough for unshared host calls, which you also get from context locals. I agree if we get both thread-bound data and functions (both with sufficient efficency/inline-ability), the only remaining benefit of context locals is TLS analogous to thread-local globals.
Yes, although see above regarding my point about composable ABIs. My impression is that cross-producer calls already need very careful ABI coordination.
If desired, you can avoid the need to switch contexts in the merged module by also merging the contexts by concatenating them and shifting relevant access indices. Think of this as like concatenating the globals of the modules. If there is a good enough ABI it would even be possible to de-duplicate.
I don't agree that they're unmodular - context locals can effectively be desugared into regular function arguments (although this would be far less efficient) plus some additional type system cleverness to make continuations work better.
The choice on whether/how to abstract TLS is made at the toolchain/ABI level. For example the context locals could just be used to hold a thread ID, with all TLS state managed in tables and memories (and thus a module could easily extend them without changing the shape of the context). Context locals are a very general feature.
One of the main motivations of this proposal is to allow I'd emphasise again though that the toolchain/ABI has a choice as to how compositional its contexts are. In the "normal" case, I'd expect A1 and A2 to have compatible contexts, but the capability of switching is there for general compositionality.
The motivation based on successful compilation of TLS is totally host-agnostic - it's purely about successfully preserving the semantics of source languages being compiled to Wasm. Thread-local globals are the alternative solution for this problem. The fact that context locals also allow host interaction where the host can only provide unshared functions (and I'll unashamedly hold up JS as the key example) is an important additional motivation, but the TLS motivation stands on its own. |
To do that, though, we don't need much from Wasm. We can code up a thread-id-indexed map in user space and pass down a single reference (which ought to be fairly cheap). The only argument for making it primitive would be performance. But if we still need indirections through untyped maps, then is there a sufficient win?
Well, sure, but that doesn't make it modular — emulating global state with function arguments isn't modular either.
That sounds like you are still assuming some form of whole-program compilation or whole-program linking. I don't see how any of this can work with separate compilation and regular, let alone dynamic, linking, except by using untyped maps for contexts, bypassing most of this feature.
Perhaps I don't understand how the checking is supposed to work. How would the language detect that a suspend/resume/switch switches to a continuation that (currently) expects a different context? Is the check static or dynamic? If the former, wouldn't context types have to bleed into continuation and function reference types everywhere? If it's dynamic, where is the information about the current context type of a stack stored? Does a context switch write the current type to that stack somewhere, for the stack switch to retrieve it and perform the check? And I don't understand how the implementation can be tied to instances. When we have shared functions and instances, then there can be multiple functions originating from the same instance but running in different threads, active at the same time, such that each of them has to see a different copy of the TLS at the same time. AFAICS, TLS-style state has to be tied to stacks, not instances — for both threads and stack switching to work correctly. Moreover, the OP implies that this feature avoids the problem of "ambitious schemes" for initialisation of TLS. But how? Where is the TLS context initialised for a new thread? Doesn't that require all thread creation points to know (and be able to access in user code) all TLS initialisers? How would this work with external thread creation, how with internal thread creation? How would this not introduce cyclic dependencies between modules in general? Sorry if I'm being dense. :)
I'm not convinced. AFAICS, the only real choice toolchains are given with this feature is between unmodular (whole/closed-program) or untyped (map lookup), and both is already possible without it. |
The thread ID itself still needs to be held by the thread somehow - without context locals you would need to either directly propagate the ID through execution as a function argument, support a thread-local host call (e.g.
Maybe the missing piece here is the ability for a module to optionally bind (some?) context locals at export-time (restricted to const expressions only), so that the exported function doesn't need host intervention (through EDIT: the interaction with stack switching would require some thought, but I believe it would actually be ok for bound context locals to be captured in a continuation - the type of the resulting bound function just can't be I'm thinking of context locals as part of the interface of the module, like imports and function arguments. I agree that their use should be kept to a minimum, in the same way that global state via function arguments should be avoided as much as possible. However if we don't get thread-local globals and functions, unless we consider an additional feature like context locals, the only option we have left is a
Yes, the context locals (if declared and unbound) would be part of the function type, like function arguments. All checks are static.
Sorry, the "ambition" here is referring to issues we've discussed around supporting thread-local globals as Wasm language-features in the runtime. Smaller features like context locals push the engineering of this into user-space. The question of how to run TLS initialisers is really hard in general, both in the runtime and the userspace. If a module wants to do TLS in a totally transparent way for separate compilation purposes, every entry into the module's code (or alternatively every TLS access) needs to be guarded by a "has my TLS on this thread been initialised yet" check. It's just a question of whether this check is done by the runtime as part of Wasm's semantics, or in userspace. The Wasm runtime is essentially always forced into this worst-case scenario, but userspace Wasm code could make optimisations to this scheme based on ABI knowledge, toolchain coordination, etc.
With the features we have today, the only way to do TLS and unshared host access is to thread everything through as function arguments, and even this doesn't work if we want to do work-stealing with
|
To try and make this concrete, I've been imagining Emscripten C++ could have a single ABI mandated context of:
The first field is very hot and also mutable (which makes it not a good candidate for passing by param). The second field would be a pointer into linear memory where the current C++ source module's TLS data can be found. The third field would be an untyped JS map containing all the unshared web functions. All C++ functions can agree on these parameters, and the hottest values are typed, while the least hot value (the odd case of web unshared functions) is untyped. You should be able to separately compile C++ functions to use this type and link them together. As part of this, we are pushing the responsibility for updating the context locals into user space. Context locals are just providing the fast mutable storage that's scoped to a call stack. So the tlsBase would need to be updated when one C++ source module calls into another, but not necessarily the shadowStackPointer or webUnsharedFunctions. This is identical to the work the engine would have to do if we had TLS globals in the spec, but the engine knows less than the toolchain IMO, and so we'd do a poorer job at it. |
One small note, this may instead need to be something like a |
I think the main argument in favor of thread-local globals is that it avoids changing the toolchains' existing core wasm ABI. If thread-local-globals were free and didn't have the highly non-trivial implementation implications we've been discussing, I'd be in favor of it. But if there is a lower-level mechanism that is less magic with a more predictable cost model, I think we should do that (it is the wasm way) and context-locals seem to fit the bill. There's also a vaguely anti-modular aspect of thread-local globals that I think is concerning: if I'm calling a module that I want to treat as a black box, the identity of the thread I'm calling on really shouldn't matter -- I should be able to call that module on any thread I want. But if the engine is implicitly creating mutable storage locations for me at the boundary that are tied to the identity of the calling thread, the caller's thread identity now matters in a way that really feels like it's breaking some sort of encapsulation boundary (or parametricity property) that you'd naturally expect -- it's like it's an implicit function argument that you can't avoid passing. It also means that implicitly-created thread-local storage locations have an ambiguous lifetime without a good point to call a destructor (in fixed-thread-pool scenarios, this might seem fine, but once you have a Considering these problems in a cross-component setting (where we're intentionally aiming to be cross-ABI, cross-language, with black-box reuse), the right answer seems to be to treat each cross-component call into core wasm as-if it was on a fresh thread (regardless of the caller's actual thread identity), so that TLS never gets reused and can always be eagerly destroyed -- anything else leads to leaks or requires ad hoc gross hacks. I know this is a component-level argument so maybe it doesn't directly apply at the core wasm level but, IIRC, it sounded earlier like @tlively came to a similar realization in a totally different context; I'd be interested to hear more about that. |
Yes, I'm totally on board with avoiding the problems of "true" thread-local globals. It appears we are all on the same page that the alternative is some form of dynamic scoping. However, my impression is that context locals are an attempt to provide dynamic scoping "cheaply", but cutting so many corners that the result does not interact correctly with other features, while also leaking heavily into types and interfaces, and hence would only work in narrow cases. In particular, I'm still puzzled how tying context locals to instances can behave correctly with threading or stack switching. I'm pretty sure it can't. But if they are connected to stacks instead, then the mechanism is more closely related to what @tlively presented to the stacks group for dynamic scoping a while ago, with all its implications. I doubt that we'll still get much benefit out of declaring context locals at that point. |
Oh, this is a good point. My implementation sketches above don't work if the instance is shared across threads, because everything just gets clobbered. @lukewagner @eqrion did you have an implementation scheme in mind that I've not correctly reproduced here? Would we need a second reserved slot for the context that's separate from the instance? This wouldn't necessarily regress existing code, since existing functions without contexts wouldn't need this slot. EDIT: @tlively @rossberg do you have a link to the previous dynamic scoping presentation, or a brief description? |
Agreed that neither the context (nor a pointer to the context) can be stored in a shared instance. Because it's fixed size, the context can be stored at the point it's created on the stack (as part of the trampoline that enters wasm from the host or in the stack frame of wasm code that performs There is a tradeoff engines have to make, though, between:
The former has less register pressure but more indirection. There's also hybrids that avoid pinning registers inside function bodies (letting the register allocator do its thing instead). Roughly the same tradeoff exists for the thread-local globals, I should add -- I think this aspect of the designs in the same. |
Oh, so you also meant to return the current context everywhere. That makes more sense, but how would that work with context.switch? That would have to back up the previous context and restore it upon exit. Where else would it save it but in the stack? And how would it get updated there when the outer context is changed in the meantime, e.g. through a suspend/resume? Concretely, consider the following pseudo code:
With a correct implementation of dynamic scoping this has to print 1,2,2,3. How can that be implemented by just threading a single context pointer? |
One thing I didn't accurately reproduce from @eqrion's original pitch is that the exit point of |
@conrad-watt, I see, but that seems odd. Then in my example, Not sure I follow how this helps with unshared edges, since the user saving the context in locals induces the same edges as the engine doing it internally would? I agree that the interaction with exceptions is also an issue, either for the engine or for the user. Regarding type checking, either way, this all essentially requires an effect type system (or technically, rather a co-effect system, but that is largely interchangeable): every instruction, function, and continuation type would have to be annotated with its context assumptions. That is quite a heavyweight and intrusive change. And even with the fairly small context that @eqrion sketched above it would substantially increase the size of all function types, unless there is a way to factor out the context definition, e.g., with a new form of type definition that can be referenced from different function types. But when the context has to be put on all function types anyways, there is no need to have a new form of context local declaration. Instead, the body's initial context would be determined by the function's own type, like with parameters. |
I think I described this in a messy way in the OP, but this is what I was envisaging - the context (if declared) is actually a new component of the function type (the "declaration" becomes part of the type, like the "declaration" of parameters in the text format). With this explanation, does the block version of EDIT:
The user is simply prevented from saving |
No, that is unrelated. If it doesn't preserve the binding structure but just destructively assigns, then it shouldn't be a block but simply an assignment. The shape of the context could be changed regardless, that's just strong update, and effect typing could handle that just fine. We'd already need to enforce that the shape of the context is the same for all join points of branches, so getting rid of the block doesn't change much. The logic would be similar to how we handle uninitialised locals.
Why can't that likewise be a type error if saved implicitly? |
How is this made safe in the presence of exceptions? With the block approach I can see how a EDIT: another way of thinking about this might be to have a "call with different context + finaliser" instruction, rather than a block-level |
If it's just a regular assignment, then it doesn't even need any additional feature, because it could be handled by encoding a regular finally as usual with try-catch_all-rethrow in user space, like elsewhere. If, on the other hand, it is proper dynamic scoping, then the engine would implement it internally. It is only with this weird block construct that it is a problem, I think. |
Given the substantial and cross-cutting complexity of introducing an entire effect system just for this, couldn't we make context access dynamically checked? That would be very straightforward by reusing tags: you'd define the shape of a context as a tag, and there would be 3 simple instructions to manipulate it:
That's it, no extension to the type system or validation, no new declaration, no block structure, completely orthogonal to the rest of the language. The only cost is one check per get/set (comparing the expected against the current tag), which is well within the bounds of what we have been willing to accept in other places. |
That seems to only work if we add a full inter-block effect type system for this feature (with join annotations and so on, given our previous decisions). The switch to the new context and the later restoring of the old context need to be tightly paired (since they correspond to crossing an ABI/compilation unit boundary), so I still think a structured block or call-level instruction is a feasible solution and far less disruptive. Maybe the call-level instruction is less objectionable?
This might be an acceptable solution, but it's obviously preferable to avoid checks if we can get away with it! |
Honestly, I think static checking is gonna be a rabbit hole and its complexity an order of magnitude too high for it to be justified for a corner-case feature like TLS. Especially without measurements, given that the dynamic check is quite optimisable: it can easily be hoisted and subsumed in straightline code. It's only necessary before the first context access in a function, after a suspend/resume, and after a call to a target that might have switched it (which in practice typically means only after calls to imports or indirect calls). So it should be really cheap. |
I've been thinking more about the dynamic tag idea. I'm coming around to it :) At the very least it's something that seems quick to prototype and there are clear levers that can be pulled to determine how bad the overhead of the dynamic check is (similar to our cast benchmarking in GC). I think we would still want a What do others think? |
No comment on whether the static checking is feasible or not, I'm thinking over the back and forth you had with Conrad on it. The main motivation for the static checking is for the very hot 'shadow stack pointer' in linear memory languages. Mutable access to that happens very frequently, and the dynamic cast behavior would be really unfortunate if we could avoid it. For other thread/task/realm local things, I could believe that some amount of dynamic checking is acceptable. The other major motivation for a 'context' feature in the VM was a solution to the problem of how to invoke unshared functions from a shared continuation (unshared continuations could use params to thread all unshared state). This worked by letting the context hold unshared values, but only accessed within some sort of barrier to prevent the unshared values from leaking, then inheriting a new context (with unshared values from that thread) when resumed on another thread. For problem (1), it seems like we could split that off and ask toolchains to change their ABI and thread through their own 'context' linear memory pointer as their first parameter. That would point to the shadow stack pointer (and any other state they want. This would have a cost for the extra param, but if it subsumes the VM passing an implicit context param it might be net neutral. The static-typed context would be nicer, but doing it in user space could work. For problem (2), the dynamic context feature that Andreas sketched could solve this problem as well, and I'm less worried about the cost of dynamic checking in this case. If we're inherently doing a dispatch to some thread-local JS/host function, I don't think there is a way to avoid some dynamic checking cost. So, maybe VM dynamic context + user-mode static context better solves this use-case? |
Just commenting on Andreas's code example above, with the semantics I had in mind, the output would be 1,2,3,3 because there is just a single "current context" threaded through all call/suspend/resume/return control flow. I think that means we don't want to think of |
@lukewagner, given that the instruction is supposed to be able to switch to a context of different shape (and back), that interpretation would be unsound. But I think we're on the same page that its block structure is inappropriate if the intended semantics and implementation is destructive update. |
I'm still surprised by the resistance to a block instruction (or alternatively, I should emphasise again, a callsite-scoped FWIW I would interpret the code above as printing 1,2,3,X (with a caveat about the X below). The required shape of the context for the continuation of So for example:
Is well-typed and prints 1,2,3,X, but the final resume would be a type error if the line immediately above was |
@conrad-watt, I don't follow. How can the last |
Oh, sorry - I was focussing on the third print and just autopiloted everything else. I've edited my comment to correct my example and explicitly add the context type for |
How is that clearer than or in any other way preferable to:
|
The latter looks simple because we're dealing with straight line code, but as soon as we involve blocks, loops, and exceptions, explicit join annotations for contexts are needed. We can't even get away with the "forgetful" intra-block semantics of non-nullable locals - the annotations have to appear everywhere for soundness. Since the only purpose of context switching is to facilitate a call to a module with an incompatible ABI, this seems overly disruptive to the language. If everyone is on board with going all the way to join annotations, I guess I wouldn't be totally against the idea. |
Actually, I guess we could have an intra-block semantics, if the type system forces user code to restore the previous context before each block exit. Is this something you were considering? |
Well, an all-out (co-)effect system is already overly disruptive to the language, this would almost be a minor detail at that point. (Conceptually, those context annotations on blocks would even come for free, since block types are function types, which would already need to be enriched with contexts.)
|
Yes, the block nesting is something I considered. It's no more restrictive than the syntactic nesting of a block-like context.switch. But it also seems like an unnecessary restriction. |
me:
Oh, but then we'd definitely need block annotations anyway to make the example expressible, otherwise the jump to the corresponding handler at the resume site could not be typed flexibly enough. As I said, it's a rabbit hole. |
Ah, that's a good point. I guess this means I agree the block |
So I've been talking through the implications of the full effect system rabbit hole with @titzer, and it really does seem quite scary - in particular the interaction with "legacy" functions that don't have a context declared. Think about typing a call from a function with a declared context, to a function without a declared context. How do you ensure that, when the called function returns, the context is in the right shape? It seems like you need to either interpret "no declared context" in some kind of ambitious polymorphic way, or alternatively prevent functions with no context from calling This leads me back to the block I think this leaves me hoping that we can find a way to reasonably implement thread-local globals and functions. @tlively mentioned that he would try to gather some V8 feedback, which I'm now crossing my fingers over. If this doesn't work out, I think my second inclination would be towards the dynamic approach @rossberg sketched here, although I'd be worried that engines might heavily lean on speculative optimisations to facilitate fast accesses/inlining. EDIT: if anyone feels the the block |
I'm probably missing something but: if a function with a declared context calls a function without a declared context, isn't the latter equivalent to a function with a declared-empty context? Thus, before the call, the caller has to switch to an empty context and then on return, the context is known to be empty and so the caller has to switch back to their original context. Amending what I said earlier, I see how this needs a block instruction for switching but, importantly, I think the semantics of this block isn't dynamic scoping or algebraic effects at all; it's typing an implicit function parameter and result -- it's nothing you couldn't polyfill by threading a You also mention "extremely inflexible typing"; how does this problem arise in practice? My assumption here has been that, because the toolchain ABI is going to mostly fix a single context for all functions, the only switches that need to happen will be at coarse-grained boundaries between code compiled with separate ABIs and thus we don't need to do any fancy polymorphic things like you might need to do with a source-language-level effect system. (That being said, if necessary, I could imagine that we could allow prefix-subtyping of contexts and then a dynamic block-scoped downcast with the same semantics as-if it were a GC struct ref.) |
Yes, with a block
I personally agree with this perspective, but it certainly feels like a lower level of compositionality than we've previously (collectively) accepted in Wasm features. As one concrete example, as a natural consequence of the typing rules you can't even |
Actually I realise that I've failed to hold some of our previous discussion in this thread in my head. First, the semantics for On the other hand, if you assume a semantics where the parent context is implicitly saved, this saves you in the "legacy" case, but is problematic for shared continuations since you might capture an unshared thing. |
I agree that the typed approach is scary — but the reasons you considered are not even the real problem. It sounds like you expect that we could get away with using unannotated functions/blocks/tags/references/tables/etc in a polymorphic manner. I'm pretty sure that won't fly. As I see it we'd certainly need to annotate everything everywhere, and that's not practical and has terrible composability. I'd make a bet that no attempt to partially hide away the annotations is gonna work properly, block switch or not. With regard to types it's all or nothing. @conrad-watt, what you observe above is that the contravariant nature of explicit context restore would immediately destroy even the simplest form of subtype polymorphism on contexts. Not that subtype polymorphism is expressive enough in the first place, as we can learn from the practical failures of effect systems. And the problem of avoiding unwanted capture in the implicit case looks related to the problems the Scala folks ran into with their latest capability-based attempt of typing effects, which requires less annotation but then is too weak to prevent unwanted effect capture unless you start statically tracking "capture sets" everywhere. |
Yes, I think it's reasonable to assume that, like |
@lukewagner we'd still have to work out how to deal with the "fork" I outlined above. If switch block exit explicitly restores the parent context values, then interoperation between "context-annotated" and "legacy" (current) functions seems to break. I think you'd have to brutally spec that If switch blocks instead implicitly save and restore the parent context, the obvious semantics for this is incompatible with shared continuations. I think you'd have to interpret |
My two cents: I was intrigued by the notion that we could statically-type contexts because it allows various engine optimizations that basically impossible to do otherwise. For example, working back from the machine code that I'd like to get for certain patterns, statically-typed contexts allows defining what are essentially a set of callee-saved registers across the scope of functions that share a context. That is useful, e.g. to implement an interpreter or state machine that is spread over many functions (think dozens or hundreds) but nevertheless share a large amount of common state (i.e. the interpreter or state-machine state). A statically-typed context could be register-allocated to a fixed set of registers without an inter-procedural analysis. Since contexts are effectively tied to stacks, they represent a new type of storage that is difficult to emulate efficiently another way. That said, working through some of the type system issues with @conrad-watt over the past few days, I am starting to agree with @rossberg that polymorphic effect typing might end up being a quagmire. |
@conrad-watt If we stick with the intuition that contexts are threaded into and out of each call as-if by a |
All typing issues aside, what's missing in this discussion is somebody actually working through a formulation of a scheduler implementing green threads and TLS, based on (some variant of) stack switching combined with this feature. I suspect that won't be possible with some of the restrictions that have been suggested. I also suspect that any block-based or otherwise well-bracketed context semantics is inherently incompatible with the use of a symmetric direct stack switch for implementing scheduler-less green threads, like some folks envision it. Moreover, my guess is that such green threads will want some kind of contextref to be able to switch between threads with minimal cost. Unless contexts have proper dynamic scope, then (and I believe only then) this isn't needed. |
@lukewagner ok, I think interpreting existing functions as having an empty context, with no fancy subtyping or polymorphism, works. I think you would still need explicit context annotations on all blocks and tags (to ensure things like inappropriate @rossberg I think if we went all the way down the route of switch block (with annotations on blocks, functions, and tags), the whole toolchain would most often just fix a single context across all the functions it knows about, like the ABI @eqrion sketched above. Anything much more complex would likely fail to type (especially in the presence of exceptions/continuations). The boundary between toolchains/ABIs can be crossed with an explicit switch block, but the type system would severely restrict how control flow can go across this boundary. So in your example the scheduler and all the threads scheduled on it would need to agree on a context shape (or at least a very coarse-grained switching discipline) ahead of time as part of a toolchain/linking step - it wouldn't be possible to express a fully generic scheduler (over possible contexts) in pure Wasm (is this a goal?). |
@conrad-watt, I do indeed think it should ultimately be possible to write a context-generic scheduler, but even with only a single shape you somehow need to save & restore the contexts when switching threads. I can see how that should work with an indirection through a central scheduler (though I suspect you'd still want contextref to make that cheap), but the devil might be in the details. It's less obvious with a direct stack switch, given a block-like context switch. In both cases satisfying the structure and/or typing may necessitate redundant context switches, which also seems undesirable. And of course the real tough nut is how to make these solutions scale to multiplexing green threads across multiple hardware threads, i.e., work stealing. |
(the below assumes the "explicit restore on exit" interpretation of block context switch, to avoid the issue of capturing the parent context inappropriately) If you have context types in the tags/continuation types, is there a problem? The typing rule would say that the continuation you want to switch to must have a context type matching your current context. With a delimited handler, you'd likely need to ensure that the current context in-scope at the handler matches the declared context of all the tags it's handling, which is a brutal type system restriction, but not problematic for the "one shape" case. |
@conrad-watt, consider scheduler-less thread switching. How would a block-scoped context switch work? The switched-to thread does not return the way it's entered. It is sort of like a tail-switch. Hence, there is no obvious extent for a block-like construct. There is no restore point at all. You keep switching to the next context in a round robin fashion. Even with a scheduler, you probably don't want to restore the previous context redundantly but just switch to the next one right away. This is not a typing problem, but a basic mismatch with the way context switching would be structured. Types only make it more obvious that something is at odds and that this approach is too narrow. Semantically, it is no question that this problem is equivalent to establishing proper dynamic scoping. The threaded context idea could be good as a lower-level primitive, but only as long as it actually is able to emulate dynamic scoping correctly and efficiently — that should be the litmus test. |
Thinking about it a bunch more, I think a "context" is a linear struct value that:
Because of the linearity, this basically matches the engine-internal context structures that exist today (under various names) that are allocated on entry to wasm (e.g., in the stack frame of the entry trampoline) and are propagated through all calls by pointer. The new thing is allowing guest code to store guest fields inside this context structure. This understanding suggests some tweaks to the proposal as presented above: (1) Instead of having a block-like instruction for changing the current context type, we instead track the current context type as part of (or paired with) the current operand-stack type such that it can be strongly updated by any instruction. Thus, I think we can have a simple non-compound (2) Function/effect types should be able to declare two context types: one incoming (as part of parameters) and one outgoing (as part of results). If you don't declare one, it's equivalent to declaring an empty context. Thus a function can declare different incoming and outgoing context types (and I'm even aware of a meaningful use case for having them be different, viz. around context initialization). (Together, I think the above 2 tweaks address some of the incongruencies Andreas was pointing out above.) (3) Because the expected implementation technique wants to allocate a single fixed-sized context structure in the activation record of the entry trampoline (and thread a pointer to it through all transitive calls), we need a low implementation limit on the size of context type so that (4) Due to the abovementioned low-upper-bound on context size, the right way to achieve our primary goal of (5) As a consequence of the above, the context type won't need to contain all the thread-local functions as |
My understanding is that @tlively and Google folks are currently experimenting with full-fat thread-local globals. In response to concerns about the implementation feasibility of this approach, @lukewagner and @eqrion came up with an alternative design for "context locals". After some futher discussion on what these should look like, this issue is an attempt to present a refreshed design for the context locals feature incorporating the iterations that happened in those discussions (e.g. #42). The sketch below assumes the "no capturing on suspension" variant, and is agnostic as to whether we have separate
shared-suspendable
andshared-nonsuspendable
.Background
To support useful compilation to
shared
functions, we need a mechanism for thread-local storage (to accurately compile source-level TLS), and a mechanism for JS interaction (since JS functions arenonshared
, they can't be imported and called in the normal way insideshared
functions).Wasm-level thread-local globals solve the former problem, but require ambitious schemes for initialisation and garbage-collection. JS-API thread-local functions solve the latter problem (and can be used to simulate thread-local storage with a "get_thread_id" function), but pose similar garbage collection issues.
Context locals
Context locals aim to provide a basic mechanism for solving both problems. Conceptually, context locals represent storage that is local to the current Wasm call stack. This is resonant with the way engines already fix a "current instance" when entering a Wasm call stack. If a Wasm call stack is suspended and resumed elsewhere (including in another thread), the context locals at the suspension point are not captured - instead the resumed continuation inherits the context locals of the resumption point (with a type check to ensure the shape matches).
These qualities mean that it is safe to put JS functions into context locals, and call them even from shared code. Context locals can also be used to implement thread-local storage, although some additional care must be taken when crossing JS boundaries or making use of shared continuations.
Brief instruction set sketch
Extend function types with a new kind of local declaration - a sequence of types representing that function's "context" (this could also be declared tag-style in an earlier section as in #42).
e.g.
This function declares a context of type
(ref $t1) (ref $t2)
, made up of two context locals.For simplicity's sake, we'll assume a separate instruction set for interacting with context locals rather than using the existing
local...
instructions, but a combined scheme may be possible. These will be::context.get
context.set
context.call
(likeget
+ref.call
, not strictly necessary but useful in some circumstances [see below])which work as expected. One note - in
shared-suspendable
functions we'll still need something like theshared-barrier
mechanism to allownonshared
results ofcontext.get
etc to be manipulated, butcontext.call
would in principle be permissible even without such a barrier.We also need a block instruction for switching to a new context -
(context.switch t* ... end)
, or alternatively a call instruction that simultaneously switches contexts. This allows functions with mismatching contexts to call each other, and the cost of switching contexts is explicitly represented. A function which declares a context can only be called if its declared context is a subtype of (or for MVP, equivalent to?) the current context. Functions which do not declare a context can still be called from any other function. Acontext.cast
block or call instruction for recovering a context subtype at runtime could be considered, but this would require contexts to preserve RTT information, which is an additional overhead.JS-API
When a function has a declared context, the context must be bound before a function can be called. Functions with contexts, when exported to JS, have an extra
context_bind
(bikeshed name) method to accomplish this, which takes the values to be bound as the function's context, and returns a Wasm function that appears to have no context. Shared functions with unbound contexts can be postMessage'd, but thecontext_bind
method on such a function either always returns an unshared function, or alternatively only returns a shared function if all the context parameters are shareable. The intent is that if the context contains any JS function or object, it should be rebound separately in each Worker that wants to call the function.The reasoning for this separate bind step is to facilitate the compile-time specialisation that V8 has indicated they want to lean heavily on for performance. Due to lazy compilation, when a bound function is called for the first time, relevant
context.call
instructions can be specialised to the known value of the provided JS function. Since this code is only entered through the bound function, deopt checks are only necessary at boundaries where the context may change (e.g. initial JS entrypoints, andcontext.switch
instructions). Pleasantly, no deopt checks are needed when repeatedly calling an already bound function - only when attempting to call the same function in another instantiation/binding.The idea is that 99% of the time (including in situations with JS->Wasm re-entrancy) you're just calling already-bound Wasm functions.
Implementation sketch
EDIT: These proposed implementations are not correct, due to issues if the instance is shared across threads. Reader beware!
Here are two possible approaches - the space for the context is allocated inline with the instance, or the context is a separate allocation referenced by the current instance.
inline
When compiling the module and allocating the instance, find the largest context declared across all functions of the module and allocate that much extra space in the instance. When entering a context (e.g. through a call or resumption), copy the relevant values into this space (guaranteed to be enough space for every possible context). This has the advantage of making context accesses fast, but the context locals must be recopied when there is a cross-instance call (although this can be a wholesale memcpy rather than a per-member iteration).
separate allocation
Each instance has space only for a reference to the current context, which is a separate allocation. This has the advantage of not requiring a copy upon cross-instance call, but adds indirections to context access.
Example
(sorry if the syntax is minorly wrong or otherwise undercooked)
console.log
The text was updated successfully, but these errors were encountered: