Specify behavior around context loss and error reporting. #744

inexorabletash · 2024-07-29T19:36:08Z

Based on @mingmingtasd's work in the Chromium prototype implementation.

A lost promise attribute is added to MLContext, which resolves when the context is lost, and provides an implementation-defined message explaining the reason. Synchronous and asynchronous actions depending on the context will fail if the promise is settled.

This also modifies the omnipresent "has built" tests on MLGraphBuilder methods to be a "can built" test which also checks that the builder's context is not lost.

For #477

Preview | Diff

@mingmingtasd

Based on @mingmingtasd's work in the Chromium prototype implementation. For webmachinelearning#477

inexorabletash · 2024-07-29T19:42:03Z

Some points for discussion:

This uses the internal slot [[lost]] as both (1) the promise value and (2) how to check if a context "is lost". Maybe an explicit boolean internal slot would be clearer, albeit more verbose?
In the implementation, the "lost" state is equivalent to the context's mojo remote being connected. This is only checked explicitly when (1) creating an MLGraphBuilder and (2) creating an MLBuffer. This spec change does the former, but MLBuffer is not in the spec yet so that's not present. I did add an explicit check in compute(); I think that'll happen in the implementation indirectly because the dispatch would fail, but an explicit check seemed like a good idea?
The MLContext section of the spec could be reorganized a bit. Separate PR?

@mingmingtasd and @reillyeon could you do an initial review?

index.bs

mingmingtasd · 2024-07-30T01:18:00Z

Thanks! @inexorabletash @reillyeon
I am also working on a chromium CL to expose MLContext::destroy to make a context lost.

In the implementation, the "lost" state is equivalent to the context's mojo remote being connected. This is only checked explicitly when (1) creating an MLGraphBuilder and (2) creating an MLBuffer. This spec change does the former, but MLBuffer is not in the spec yet so that's not present.

I will check context lost for all of the synchronous and asynchronous actions depending on the context in my chromium CL.

reillyeon · 2024-07-30T01:32:09Z

Something this PR doesn't do is specify the behavior around rejecting in-flight asynchronous operations.

inexorabletash · 2024-07-30T17:35:24Z

Something this PR doesn't do is specify the behavior around rejecting in-flight asynchronous operations.

Thoughts on how to do that and to what level of detail?

in compute(), is that handled by execute graph returning error, or do we need more steps?
in build() it looks like failure other than "does not support a requested feature" isn't covered.

One generic approach for both of those would be to replace: "Queue an ML task with global to resolve promise with ..." with:

Queue an ML task with global to run these steps:

If context is lost, then reject promise with an InvalidStateError.

Resolve promise with ...

... which I think covers the script-observable behavior, but not that the async steps internally should fail.

fdwr · 2024-08-01T00:50:45Z

@RafaelCintron

reillyeon · 2024-08-02T18:37:47Z

There are two separate considerations: what happens to the promise returned by a method and what happens to an asynchronous operation itself. Operations like build(), compute() and dispatch() aren't abortable (or at least, the specification should not require that they are aborted, only that whether or not they are aborted is not visible to script). We can however be explicit that the promises returned by any methods on an object are synchronously rejected when the destroy() method is called or the context is lost. We could either specify this exactly the way the Chromium implementation works, by having each object hold a list of all promises and reject them, or we could add a note to the "in parallel" steps which says "when the context is lost or this is destroyed, reject promise and potentially abort these steps". The definition of "potentially abort" is the tricky one because it has to consider cases like destroying a single buffer that is part of a larger set of pending operations which should still be able to complete.

huningxin · 2024-08-05T14:17:05Z

@inexorabletash

in compute(), is that handled by execute graph returning error, or do we need more steps?

I suppose the following step of execute graph could handle the device lost error:

Issue a compute request to graph.[[implementation]] given name and inputResources and wait for completion.

If that returns an error, then return an "OperationError" DOMException.

A question is if the error is device lost, should it run the "context-lost" steps? Like the Chromium prototype does in HandleComputationFailure(). Or we can just assume the "context-lost" steps are triggered by "When an MLContext context is no longer available to fulfill requests" asynchronously?

in build() it looks like failure other than "does not support a requested feature" isn't covered.

Or we could just say "If that returns an error, then queue an ML task with global to reject promise with an 'OperationError' DOMException" similar to compute()?

One generic approach for both of those would be to replace: "Queue an ML task with global to resolve promise with ..." with:

Queue an ML task with global to run these steps:

If context is lost, then reject promise with an InvalidStateError.

Resolve promise with ...

The new steps may not run because these steps may already be aborted (by "abort these steps") if previous steps fail .

bbernhar · 2024-08-06T16:41:01Z

We can however be explicit that the promises returned by any methods on an object are synchronously rejected when the destroy() method is called or the context is lost.

+1. MLContext.destroy() steps could be spec to the following:

Script timeline:

Immediately reject all pending promises made off |this| context.
Issue steps for |this| context on the device/queue timeline.

Device/queue timeline:

Wait for async operations on the device to complete.
Then Lose |this| context.

Note: impl. is always free to abort pending async ops immediately (and release buffers).

inexorabletash · 2024-08-07T19:47:26Z

A question is if the error is device lost, should it run the "context-lost" steps? Like the Chromium prototype does in HandleComputationFailure(). Or we can just assume the "context-lost" steps are triggered by "When an MLContext context is no longer available to fulfill requests" asynchronously?

I think this would be script observable - i.e. what order do the lost and compute() promises settle. So we should spec it explicitly.

mingmingtasd · 2024-08-19T05:59:03Z

The definition of "potentially abort" is the tricky one because it has to consider cases like destroying a single buffer that is part of a larger set of pending operations which should still be able to complete.

@reillyeon @huningxin @bbernhar I think the DirectML backend here has considered this, MLBuffer can be destroyed prior to Compute() and Dispatch but its resource will be kept alive anyway until all the pending GPU work done. For example, if you destroy a MLBuffer used by Dispatch before the Dispatch is done, Dispatch can still complete but you can't continue to call ReadBuffer to read back the results. It seems OK and as expected?

reillyeon · 2024-08-19T14:42:04Z

I think we've settled on the idea that destroying an MLBuffer will only make readBuffer() fail. Pending compute tasks will still complete.

Destroying an MLContext (or context loss) is the equivalent of calling destroy() on all the builders, graphs and buffers created by the context. We haven't yet decided the semantics of destroying a graph but it will probably similarly allow pending compute tasks to complete.

mingmingtasd · 2024-08-20T04:51:56Z

I think we've settled on the idea that destroying an MLBuffer will only make readBuffer() fail. Pending compute tasks will still complete.

Destroying an MLContext (or context loss) is the equivalent of calling destroy() on all the builders, graphs and buffers created by the context. We haven't yet decided the semantics of destroying a graph but it will probably similarly allow pending compute tasks to complete.

I have submitted a CL to expose MLGraph::destroy : https://chromium-review.googlesource.com/c/chromium/src/+/5799069

Specify behavior around context loss and error reporting.

aa2a8b8

Based on @mingmingtasd's work in the Chromium prototype implementation. For webmachinelearning#477

reillyeon suggested changes Jul 29, 2024

View reviewed changes

index.bs Outdated Show resolved Hide resolved

index.bs Show resolved Hide resolved

inexorabletash added 4 commits July 29, 2024 14:48

Add link for 'settled'

7456eff

Add 'is lost' check to build()

01cf420

Add 'is not lost' dfn alias

1149c8d

Add "can (not) build" defn, fix throwing vs. rejecting for methods

d2ec3f4

reillyeon approved these changes Jul 29, 2024

View reviewed changes

mingmingtasd approved these changes Jul 30, 2024

View reviewed changes

Merge branch 'refs/heads/review' into context-lost

057280f

Merge branch 'refs/heads/review' into context-lost

e4fa318

inexorabletash marked this pull request as draft August 6, 2024 19:29

Merge branch 'refs/heads/review' into context-lost

68c5b97

inexorabletash added 3 commits August 7, 2024 13:05

Make build() and compute() explicitly invoke context lost steps

61ff07d

reword to reduce some indenting

164d3e4

restore missing algorithm class

8f5decf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify behavior around context loss and error reporting. #744

Specify behavior around context loss and error reporting. #744

inexorabletash commented Jul 29, 2024 •

edited by pr-preview bot

Loading

inexorabletash commented Jul 29, 2024

mingmingtasd commented Jul 30, 2024 •

edited

Loading

reillyeon commented Jul 30, 2024

inexorabletash commented Jul 30, 2024

fdwr commented Aug 1, 2024

reillyeon commented Aug 2, 2024

huningxin commented Aug 5, 2024

bbernhar commented Aug 6, 2024 •

edited

Loading

inexorabletash commented Aug 7, 2024

mingmingtasd commented Aug 19, 2024

reillyeon commented Aug 19, 2024

mingmingtasd commented Aug 20, 2024

Specify behavior around context loss and error reporting. #744

Are you sure you want to change the base?

Specify behavior around context loss and error reporting. #744

Conversation

inexorabletash commented Jul 29, 2024 • edited by pr-preview bot Loading

inexorabletash commented Jul 29, 2024

mingmingtasd commented Jul 30, 2024 • edited Loading

reillyeon commented Jul 30, 2024

inexorabletash commented Jul 30, 2024

fdwr commented Aug 1, 2024

reillyeon commented Aug 2, 2024

huningxin commented Aug 5, 2024

bbernhar commented Aug 6, 2024 • edited Loading

inexorabletash commented Aug 7, 2024

mingmingtasd commented Aug 19, 2024

reillyeon commented Aug 19, 2024

mingmingtasd commented Aug 20, 2024

inexorabletash commented Jul 29, 2024 •

edited by pr-preview bot

Loading

mingmingtasd commented Jul 30, 2024 •

edited

Loading

bbernhar commented Aug 6, 2024 •

edited

Loading