HAL_ETH_ReleaseTxPacket() does not test gState and races with itself and TX functions when accessing TX descriptors and counters #72
Labels
bug
Something isn't working
eth
Ethernet-related issue or pull-request
hal
HAL-LL driver-related issue or pull-request.
internal bug tracker
Issue confirmed and reported into a ticket in the internal bug tracking system
HAL_ETH_ReleaseTxPacket() does NOT check gState before execution AT ALL.
https://github.com/STMicroelectronics/stm32h7xx_hal_driver/blob/ceda3ceeca2ee77a76d2de2ee6b560ad87c90a48/Src/stm32h7xx_hal_eth.c#L1400-L1413
It just runs and modifies TX descriptors, decrements buffer counters and reads state from &heth into local variables.
This buffered state can become stale/outdated if HAL_ETH_ReleaseTxPacket() gets preempted and other threads issuing calls to HAL_ETH_Transmit_IT() or HAL_ETH_Transmit() get scheduled instead.
Among other things HAL_ETH_ReleaseTxPacket() contains the decrement of
dmatxdesclist->BuffersInUse
.https://github.com/STMicroelectronics/stm32h7xx_hal_driver/blob/ceda3ceeca2ee77a76d2de2ee6b560ad87c90a48/Src/stm32h7xx_hal_eth.c#L1477
for which the increment has been guarded by a critical section
https://github.com/STMicroelectronics/stm32h7xx_hal_driver/blob/ceda3ceeca2ee77a76d2de2ee6b560ad87c90a48/Src/stm32h7xx_hal_eth.c#L3333-L3340
So somebody suspected there might be race conditions there but failed to follow through ;)
You really, really, really need to lock this properly with atomic Test-and-Set of gState, because even the read-modify-write of TX descriptor OWN bits is potentially racy (see #71 #70 and #69)
This is not a theoretical bug! Under high incoming packet load (~8000 packets per second RX) there are frequent preemptions by the input thread which runs with realtime priority (code generated by CubeIDE).
This causes the TX calls and TX packet releases in other threads to be preempted and scheduled randomly and thrashes the data in &heth to a point where the system would not send TX packets any more. Under low packet load this can still occur and leads to unstable TX operation where systems will stop transmitting data after multiple days when these race conditions randomly occur but with much lower probability.
The text was updated successfully, but these errors were encountered: