You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We frequently discard and re-fetch from scratch almost all the data from the server: the list of streams and their names, colors, etc., the list of other users and their names, avatars, etc., how many unread messages the user has and where, etc., etc.
Specifically, we do this
(a) each time the app starts up when not already running;
(b) each time the user switches from one account to another as the one they're actively looking at;
(c) and each time the app becomes able to connect to the server again after 10 minutes of not being able to do so.
In particular this applies to the device being in airplane mode, or otherwise without a connection, for 10 minutes.
This might also apply to the app being put in the background, or to the device being put to sleep / locked, if those cause the operating system to stop letting the app run and make network requests. I'm not sure to what extent mobile OSes currently do that in either of those situations.
What this looks like for the user is:
In cases (a) and (c), we show the stale data we have, along with a "Connecting..." banner at the top of the screen.
In addition to that behavior (which is intended, as long as we have the underlying issue that we don't have current data), there are some glitches if the user navigates to the message list while we're still connecting: the sequence of different things we show in the message list is glitchy, and if they start writing a draft message it can get lost. (I believe this is Just-opened narrow sometimes flickers "no messages" before fetch has started #5152.)
In case (b), once they switch accounts we just show the loading screen until we get the new server data. (We don't have any data, even stale, for the other account.)
The lack of data also means we can't do features like show on the list-of-accounts screen the number of unreads for each account. (This is an old feature request: Add unread counts to account view #893.)
One of the key technical points going on in the background that drives this behavior is that in order to use a given set of server data, we need to have on the Zulip server an active event queue that tracks changes to that data. When we lack such a queue, the data unavoidably becomes out of date and it's impossible to reliably maintain a set of data that's internally consistent.
A related technical point: if we have a set of server data that's old, but also have an active event queue for it, then we can update the data by just fetching the events. In a large organization, this will be far less data than is needed when refetching from scratch, so may be much faster to acquire.
The goals
The main goal here is:
When the user opens the app, we should typically have fresh up-to-date server data to show them.
(For this issue, we'll stick to doing so for one account at a time. Doing so for multiple accounts is #5005.)
This means that the user will immediately see the Zulip messages they have, rather than sit there waiting for them to load.
Sub-goals in order to accomplish this include that we should:
Keep the event queue between runs of the app, so that if the old event queue is still active then we pick up where we left off.
Keep the user's server data up to date in the background, by polling the event queue in between times that the app is in the foreground. (For example, we could trigger this when we get a notification, and fetch the messages in that conversation. We could then go further by occasionally sending background notifications, and having the client update the set of unreads and perhaps fetch some messages.)
Thus far, we'd have great behavior but only if the user has used the app in the last 10 minutes -- longer than that, and the event queue on the server would have expired so we'd be forced to reload from scratch after all. To go further, we should:
Keep the event queue alive longer on the server.
An hour or two would be enough to help significantly for someone using the app regularly during the day.
A day would let the app remain fresh overnight. In combination with background updates, this could keep it fresh indefinitely for a user in a busy realm.
To keep the app's data up to date for users in all kinds of realms, we could keep event queues live for a longer period like a week while sending "heartbeat" background notifications after a slightly shorter interval to cause the app to check in.
Two of these three changes -- keeping the event queue between runs, and keeping it alive longer on the server -- are very small code changes in themselves. The main work to be done for them is in dealing with certain risks and challenges:
The fact that we regularly re-load all server data from scratch helps us paper over many small gaps and bugs in how we handle Zulip events (and other fetches of data) to maintain that data over time. See in particular:
I think we can cheerfully start keeping data somewhat longer than we do today with the existing state of these bugs. But we should crank the duration up gradually, and we should spend some time resolving those issues one by one before we take things to a point where many users don't routinely get any from-scratch reloads at all.
Similarly, today if a user encounters buggy behavior in the Zulip app, they can force-quit and relaunch it, and if the bug had to do with how we maintain server data then that will clear the issue by causing a reload from scratch.
I think it's OK to just give that up -- I can't recall the last time I used this workaround myself, nor the last time we heard from a user doing so.
Event queues consume RAM on the server.
This consumption builds up as things happen in the user's realm, and gets cleared out when the client polls for events and acknowledges their receipt. If the queue sticks around longer between getting polled by the client, then it consumes more RAM -- plus, more queues may be alive at a time. So longer-lived queues will have a resource cost on the server.
I don't have a clear sense of how big that cost looks quantitatively. Some things we could do to mitigate it include:
I believe the events fetched on a given round of polling don't get dropped from the queue then, but only on the next round -- because that's when the client tells the server it's indeed received those events. This is good for reliability, but makes the queues longer in steady state than they could be.
When polling for events after a long period, we could make a second poll promptly afterward, to let the server drop those events from the queue. (Potentially it'd even be useful to add a feature to the API for the client to purely acknowledge events, without asking the server to send it any new ones.)
But before optimizing, we should measure. That is:
We should get a sense of the existing cost of the event queues, and of how that cost varies when they last longer.
The text was updated successfully, but these errors were encountered:
The problem
We frequently discard and re-fetch from scratch almost all the data from the server: the list of streams and their names, colors, etc., the list of other users and their names, avatars, etc., how many unread messages the user has and where, etc., etc.
Specifically, we do this
What this looks like for the user is:
One of the key technical points going on in the background that drives this behavior is that in order to use a given set of server data, we need to have on the Zulip server an active event queue that tracks changes to that data. When we lack such a queue, the data unavoidably becomes out of date and it's impossible to reliably maintain a set of data that's internally consistent.
A related technical point: if we have a set of server data that's old, but also have an active event queue for it, then we can update the data by just fetching the events. In a large organization, this will be far less data than is needed when refetching from scratch, so may be much faster to acquire.
The goals
The main goal here is:
(For this issue, we'll stick to doing so for one account at a time. Doing so for multiple accounts is #5005.)
This means that the user will immediately see the Zulip messages they have, rather than sit there waiting for them to load.
Sub-goals in order to accomplish this include that we should:
Keep the event queue between runs of the app, so that if the old event queue is still active then we pick up where we left off.
Keep the user's server data up to date in the background, by polling the event queue in between times that the app is in the foreground. (For example, we could trigger this when we get a notification, and fetch the messages in that conversation. We could then go further by occasionally sending background notifications, and having the client update the set of unreads and perhaps fetch some messages.)
Thus far, we'd have great behavior but only if the user has used the app in the last 10 minutes -- longer than that, and the event queue on the server would have expired so we'd be forced to reload from scratch after all. To go further, we should:
Two of these three changes -- keeping the event queue between runs, and keeping it alive longer on the server -- are very small code changes in themselves. The main work to be done for them is in dealing with certain risks and challenges:
The fact that we regularly re-load all server data from scratch helps us paper over many small gaps and bugs in how we handle Zulip events (and other fetches of data) to maintain that data over time. See in particular:
I think we can cheerfully start keeping data somewhat longer than we do today with the existing state of these bugs. But we should crank the duration up gradually, and we should spend some time resolving those issues one by one before we take things to a point where many users don't routinely get any from-scratch reloads at all.
Similarly, today if a user encounters buggy behavior in the Zulip app, they can force-quit and relaunch it, and if the bug had to do with how we maintain server data then that will clear the issue by causing a reload from scratch.
I think it's OK to just give that up -- I can't recall the last time I used this workaround myself, nor the last time we heard from a user doing so.
Event queues consume RAM on the server.
This consumption builds up as things happen in the user's realm, and gets cleared out when the client polls for events and acknowledges their receipt. If the queue sticks around longer between getting polled by the client, then it consumes more RAM -- plus, more queues may be alive at a time. So longer-lived queues will have a resource cost on the server.
I don't have a clear sense of how big that cost looks quantitatively. Some things we could do to mitigate it include:
When an event queue gets long and/or old, we could downgrade it to a more compact form, one that's still enough to be able to reliably bring the state back up to date (but potentially with the client making additional requests when it returns, to fill in the details.) This idea is Support downgrading to a long-lived event queue #3916 / events: Add basic downgradeable event queue support. zulip#12926 .
I believe the events fetched on a given round of polling don't get dropped from the queue then, but only on the next round -- because that's when the client tells the server it's indeed received those events. This is good for reliability, but makes the queues longer in steady state than they could be.
When polling for events after a long period, we could make a second poll promptly afterward, to let the server drop those events from the queue. (Potentially it'd even be useful to add a feature to the API for the client to purely acknowledge events, without asking the server to send it any new ones.)
But before optimizing, we should measure. That is:
The text was updated successfully, but these errors were encountered: