This repository has been archived by the owner on Aug 11, 2023. It is now read-only.
Improve backend-firmware clock synchronization #424
Labels
domain: backend
On the backend
domain: firmware
On the firmware
stage: dev
On/for a development version
type: fix
Iterations on existing features or infrastructure
work: complicated
The situation is complicated (known unknowns), good practices used
Milestone
Currently it's unclear if physical (wall clock) time synchronization will work past 40 days of ventilator unit uptime without system restarts. Additionally, clock synchronization is only implemented for the event log, and it's not clear whether this implementation can be generalized in a way that clocks are synchronized consistently across different messages or message pairs. It may be a better design to have an explicit request/response pair for system-level clock synchronization rather than message-level clock synchronization.
The current implementation of the clock synchronization algorithm uses an improvised algorithm which is not necessarily as accurate as possible but is probably good enough. It may be better to use a published (but still simple) clock synchronization algorithm, such as the synchronization algorithm used in PTP (see further discussion here), with the firmware acting as a time server (since it needs to produce messages with timestamps before it can receive the date from the backend, and since any messages it sends with timestamps are effectively T1 broadcasts). The synchronization error of these algorithms is equal to half the difference between the delays in sending to the remote peer and receiving from the remote peer. If we put these messages on the Event Synchronization protocol, that delay is variable (depending on the number of other active events to be sent) and not necessarily symmetric; the upper bound on the error in time offset estimation is half of the round-trip time between the peers. In the best case where no other events are active on either peer, the round-trip time may range from 0 ms to 60 ms (for an upper bound on error of +/- 30 ms, which is good enough for us); but in challenging conditions with many simultaneously active events the delay may be as high as 1 s or even higher (which is really bad); so we should probably ignore/cancel synchronization attempts where the round-trip time is greater than some threshold (e.g. 30 ms, if we find that we don't need to ignore too many synchronization attempts). If this isn't an option, then we can't layer Clock Synchronization over State Synchronization, but instead we'd need to generate the clock synchronization timestamps right before the state gets sent by State Synchronization.
Due to the way State Synchronization and Event Synchronization work, on startup the backend will receive messages with timestamps before it has the actual time offset from the firmware. The backend could initialize the delay to some estimate (e.g. 30 ms) in order to calculate an initial inaccurate offset, and then once it has actually properly measured the delay via an additional request/response pair it could improve the estimate of the delay and thus of the time offset.
The text was updated successfully, but these errors were encountered: