athenad memory leak #34079

sshane · 2024-11-21T21:18:41Z

It was using 15% of memory, causing low memory events. I had not rebooted the device in a few days iirc and had uploaded many rlogs.

I tried to attach with memray, but instead the process just crashed so I didn't get any information about this. @deanlee could this be similar to the leak you found in athena previously?

deanlee · 2024-11-22T05:24:58Z

It's hard to identify the exact issue, as the code is unclear and difficult to understand. However, multiple factors could be causing the resource leak. One possibility is that the athenad's implementation might cause garbage collector to accumulate objects over time, leading to memory issues during long runs.
additionally, thread race conditions related to queue and other global variable (like cur_upload_items, cancelled_uploads,...) usage, as well as bad practices like not calling task_done() after queue.get(), could also contribute to issues.

deanlee · 2024-11-23T15:53:32Z

The following issues and fixes address potential resource leaks and long-term accumulation risks:

Memory Leaks in Python requests: A common issue occurs when requests keep references to response objects in memory unnecessarily, leading to memory leaks by preventing garbage collection. athenad: fix memory leak by closing Response objects #34101
Unclosed WebSocket: Not closing the WebSocket after a main loop restart can cause resource leaks by keeping connections open. athenad: close websocket before starting next loop iteration #34085
Thread Safety in Upload Handling: Lack of proper synchronization for upload_queue, cancelled_uploads, and cur_upload_items can lead to race conditions, resulting in thread termination, main loop restarts, and inconsistent or corrupted data, as well as queue bloat. athenad: fix thread safety issues in upload handing #34084
Thread Safety in ws_manage: The Python socket object is not thread-safe. A mutex should be used to protect socket.setoption while ws_send or ws_recv are sending or receiving messages, preventing race conditions and ensuring safe multithreaded operations.

sshane · 2024-12-05T21:39:32Z

This is still happening after uploading files

sshane · 2024-12-05T23:26:10Z

I see this after uploading rlogs:

lock                            120       +47
cell                            828       +47
builtin_function_or_method     1685       +43
list                           1776       +38
tuple                          4715       +30
ReferenceType                  2736       +30
Event                            75       +15
socket                           24       +15
Condition                        54       +14
deque                            62       +14
method                          109       +13
function                       7974       +13
SSLContext                       14        +8
WebSocket                        13        +8
sock_opt                         13        +8
frame_buffer                     13        +8
continuous_frame                 13        +8
SplitResult                      14        +8
SSLSocket                        14        +8
_SSLSocket                       14        +8

deanlee · 2024-12-06T05:14:40Z

The issue may not just be with the upload process—WebSocket instance growth could indicate deeper resource management problems.

deanlee · 2024-12-06T09:18:46Z

@sshane: Do we really need the ws_manage thread? It doesn’t handle exceptions, and changing socket options while the WebSocket is sending or receiving could cause race conditions or undefined behavior. If socket options need to be set, a better approach would be to close the connection, reconnect, and configure the socket before restarting ws_recv/ws_send threads.Also, storing sock = ws.sock at the start of the thread and assuming it's valid throughout the loop may not be safe.

It may not be directly related to this issue, but I recommend removing this thread until we have a more robust solution.

deanlee · 2024-12-06T11:11:44Z

The current athenad implementation lacks protections for concurrent data access in a multi-threaded environment. Besides PR #34084, which fixes race conditions for global variables, the UploadQueueCache class has multi-threading issues. It is accessed in both uploadFilesToUrls and upload_handler, which run in different threads, but UploadQueueCache is not thread-safe.

deanlee · 2024-12-07T16:33:33Z

@sshane: The current implementation does not preserve the order between received commands and their responses. Responses may be sent out of order.

Commands are placed in the recv_queue and processed by multiple jsonrpc_handler threads, with responses added to the send_queue after handling. Due to the multithreaded nature, the order of responses may differ from the order of received commands.

Is this behavior by design, or should it be considered an issue?

jakethesnake420 · 2024-12-13T00:45:33Z

@sshane: The current implementation does not preserve the order between received commands and their responses. Responses may be sent out of order.

Commands are placed in the recv_queue and processed by multiple jsonrpc_handler threads, with responses added to the send_queue after handling. Due to the multithreaded nature, the order of responses may differ from the order of received commands.

Is this behavior by design, or should it be considered an issue?

The order of responses does not matter if handled by the backend server correctly since the JsonRPC spec handles this. The 'requester' sends a request with a unique ID which could just be a ns timestamp for example. When the 'responder' (athena in this case) responds, it will also send the ID in the payload. https://www.jsonrpc.org/specification#response_object. This is done automatically so its opaque if you just look at the athena code.

Commas connect backend server should be doing the generation of pseudo unique ids and buffering the responses in a hashmap. Although as seen here https://github.com/commaai/connect/blob/79880f1203269eb145bc87863808e6cfca797ed1/src/actions/files.js#L147
it looks like the id is always zero so it might be worth double checking the backend code that it handles this properly.

sshane added the bug label Nov 21, 2024

sshane added this to the 0.9.8 milestone Nov 21, 2024

deanlee mentioned this issue Nov 24, 2024

athenad: fix memory leak by closing Response objects #34101

Merged

sshane closed this as completed in #34101 Dec 3, 2024

sshane reopened this Dec 5, 2024

sshane changed the title ~~Possible athenad memory leak~~ athenad memory leak Dec 12, 2024

sshane mentioned this issue Dec 12, 2024

athenad: add threshold to update upload progress #34219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

athenad memory leak #34079

athenad memory leak #34079

sshane commented Nov 21, 2024

deanlee commented Nov 22, 2024 •

edited

Loading

deanlee commented Nov 23, 2024 •

edited

Loading

sshane commented Dec 5, 2024

sshane commented Dec 5, 2024

deanlee commented Dec 6, 2024

deanlee commented Dec 6, 2024 •

edited

Loading

deanlee commented Dec 6, 2024 •

edited

Loading

deanlee commented Dec 7, 2024 •

edited

Loading

jakethesnake420 commented Dec 13, 2024 •

edited

Loading

athenad memory leak #34079

athenad memory leak #34079

Comments

sshane commented Nov 21, 2024

deanlee commented Nov 22, 2024 • edited Loading

deanlee commented Nov 23, 2024 • edited Loading

sshane commented Dec 5, 2024

sshane commented Dec 5, 2024

deanlee commented Dec 6, 2024

deanlee commented Dec 6, 2024 • edited Loading

deanlee commented Dec 6, 2024 • edited Loading

deanlee commented Dec 7, 2024 • edited Loading

jakethesnake420 commented Dec 13, 2024 • edited Loading

deanlee commented Nov 22, 2024 •

edited

Loading

deanlee commented Nov 23, 2024 •

edited

Loading

deanlee commented Dec 6, 2024 •

edited

Loading

deanlee commented Dec 6, 2024 •

edited

Loading

deanlee commented Dec 7, 2024 •

edited

Loading

jakethesnake420 commented Dec 13, 2024 •

edited

Loading