-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQLite Crash on deleteEvents #183
Comments
Anonymized crash log from customer:
|
Seems that the closeDB has been called before the dispatch worker thread processes the deleteEvent enqueued block has been executed, meaning the delete statement has been finalized, and would cause the segmentation fault seen in the callstack. Previously I was thinking this could be due to a race condition due to client calls to closeDB, but that isn't actually exposed to clients. Seems that closeDB could be called if a sqlite call fails for whatever reason. Not sure if the keen client has logging enabled by default, but if it could be enabled that might shed some light on if/what is failing and causing the db to be closed. |
We never experienced the crashes in development and don't have logging enabled in production (and wouldn't be able to access those logs anyway). Or even better - could you check your server logs? Perhaps we see there your servers response and then can understand what happens in |
I checked our server logs for your account a few days ago. There were no errors on writes in the time period we're looking at, so the response should have been a standard |
but the responsedata would also contain collectionNames - and just wondering if anything could go wrong here: if a collectionName returned by your API, wouldn't be found in the passed |
:( Our logs are unfortunately not that verbose as far as what is in the responses. We'd have to fill the every data center everywhere with |
@rist not sure if that could potentially lead to the state we're seeing that causes the crash, but the crash itself seems to be because the SQLite database used internally has been closed by the time the internal delete operation is processed through the KIODBStore queue. We might investigate what previous SQLite operation failed which is the only way the database gets closed. @josephwegner would it be possible that a failure is generated on that level due to a nil eventId passed to the SDK? I imagine there are checks in place for that, or otherwise wouldn't lead to a SQLite failure? |
Ok, if the theory ist that the DB was closed before we execute some delete statement - why do you check if the DB is open at the beginning of The DB could have been closed in the meantime |
Since the offending line is Almost all the code paths that lead to Also, if we fail to delete events from the local store, I don't know if there's code in KeenClient designed to handle that, so it might be the case that such events, which succeeded in being pushed, get pushed again later due to a failure upon attempting to delete them (@josephwegner mentioned this, and it's mentioned in Issue #170 and Issue #156). One would need a callback set on KIODBStore so that it could Either way, it's important to not just try to mask the original failure which caused the |
yes #170 is definitely an issue for us as well - we see lots of duplicated event |
@baumatron has worked on creating some potential repro scenarios and identified a few issues that could be addressed around how DB corruption is handled. We could try to address those as well as how local delete failures are handled and how DB closed is handled by dispatched blocks as described above. Some combination of these patches could ameliorate the situation. Since none of that identifies the root cause, we'd still love to introduce some sort of better logging. @baumatron and I were tossing around the idea of adding some callback logging interface the client code could opt into in order to use a third-party logging solution that actually posts logs to somewhere useful. That could be turned on by client code for certain clients in order to help track down the root cause. Does that sound interesting @rist ? |
Here's a bit of an update. The general issue here is that the db gets closed all over the place when encountering anything considered to be an error from the db, which is a bit dubious to begin with, but causes issues because this closing of the db isn't synchronized with code elsewhere that is responsible for opening the db again. So you can get into some call in KIODBStore that checks to be sure the db is open, then drops a block in the queue that will do some db operation, in the meantime some other operation "fails" and closes the db, and then the queued block executes and crashes since the db is now closed. Since I last updated the bug, a couple of issues have been discovered and fixed where this "failure" condition is triggered erroneously, meaning the db is closed more often than it should be, making this issue observed more frequently. Those fixes should help mitigate this in the short term (#196, #200). A long term fix would be to do an audit of how failures are handled in KIODBStore. I believe that in general these failures should not be closing the db, but instead surfacing a meaningful error to the SDK client when a call can't be handled. In cases where an error case should be explicitly handled because KIODBStore knows how to deal with it, such as is the case for existing corruption handling code-paths, that should all be done immediately where the error is encountered instead of partially where the error is encountered and then handled by checkOpenDB in random places salted throughout the code. I would suggest checkOpenDB gets removed and a single explicit initialization takes it's place, and all methods should be able to assume the db is open if called and should throw an exception if that isn't the case. |
A customer reported an issue with Keen's interaction with SQLite causing their app to crash often. Given a crash log, it appears that the crash originates in
[KIODBStore deleteEvent]
, or more specifically in thekeen_io_sqlite3_bind_int64
statement.They also reported an issue with duplicate events, similar to #170. This makes sense, because if we fail to delete an event from SQLite after uploading, it will be uploaded again after the app resumes.
The text was updated successfully, but these errors were encountered: