Skip to content

Slow sometimes to replace event queue after expiry #809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gnprice opened this issue Jul 13, 2024 · 10 comments
Open

Slow sometimes to replace event queue after expiry #809

gnprice opened this issue Jul 13, 2024 · 10 comments
Labels
a-api Implementing specific parts of the Zulip server API a-sync Event queue; retry; local echo; races
Milestone

Comments

@gnprice
Copy link
Member

gnprice commented Jul 13, 2024

When the app resumes after having been in the background for a while, or after the device was offline, its event queue on the server may have expired. (The server does this after 10 minutes of not getting any get-events requests for that queue.) Normally this is fine because we just make a new one: that was #185 / #466.

That mechanism does work, because I regularly see it working. But at least some of the time, it doesn't. I observed it failing to work this evening; details below.

I think the next step for debugging this issue is: keep an eye out for the symptom, and if you see it happen then try to get promptly to a computer — like within a few minutes — to take a look at the log. It's likely there will be something telltale there, even on a release build; if one of us happens to catch it on a debug build, there will almost certainly be good information there.

Related issues

Detailed report

Specifically, earlier this evening I opened up the app after it was in the background, and for a long time it didn't recover a live event queue. I replied to a message, and the message didn't appear in the existing message list, but did appear when I navigated to another message list that should contain it — which is the telltale symptom of a lack of an event queue. (Given the absence of #465.)

I was at home, on wifi, and had no trouble loading web pages while sitting at the same spot before and after this.

I left the app open for a while, with the device unlocked on the table, because I was curious how long the situation would persist. It lasted for at least 20 minutes: I kept looking down at the screen occasionally to check, and it was still at the "Combined feed" message list and still didn't have the new message.

On the other hand it didn't last forever! A while after that, I looked and the new message had indeed appeared. I'm about 97% sure I hadn't done anything that would cause it to appear short of a restored event queue: didn't navigate out of the screen and back in, and didn't kill the app and restart. I have some doubt only because it seems strange that an issue would cause the app to take over 20 minutes to do this, and yet not just prevent it completely from succeeding.

@gnprice gnprice added the a-api Implementing specific parts of the Zulip server API label Jul 13, 2024
@gnprice gnprice added this to the Launch milestone Jul 13, 2024
@gnprice
Copy link
Member Author

gnprice commented Aug 14, 2024

This morning I observed what might be the same issue. Specifically:

  • I looked at my phone for the first time after waking up.
  • There was a notification. I tapped it to open.
  • The app opened, and navigated to the conversation.
  • It showed old data, immediately (no loading spinner) — so the app must have been still in memory.
  • But the data stayed old. It's been that way for at least 10 minutes now as I write this; for most of that time, the device has been awake with the app on the screen.

One new piece of information is that the loading indicator from #852 hasn't appeared. So that should help narrow down what's happening.


Another is that I went promptly over to my computer and looked at logcat. Here's what logcat showed, filtered to the com.zulip.flutter package:

2024-08-14 06:33:43.585  8352-8352  ActivityThread          com.zulip.flutter                    E  Package [com.google.android.gms] reported as REPLACED, but missing application info. Assuming REMOVED.
2024-08-14 06:33:43.587  8352-8352  ActivityThread          com.zulip.flutter                    E  Package [com.google.android.gms] reported as REPLACED, but missing application info. Assuming REMOVED.
2024-08-14 09:04:52.074  8352-8352  VRI[MainActivity]       com.zulip.flutter                    D  visibilityChanged oldVisibility=true newVisibility=false

So those first two entries are from when I was still asleep, before I observed the issue. And the last one is from shortly after I observed it. A similar "visibilityChanged" line appears again if I swipe the app into the background.

In other words, there's nothing in logcat — at least nothing logged by our package. Sometimes logcat has relevant entries logged by system packages, but I didn't capture those; by the time I thought to broaden the logcat filter a couple of minutes later, the log buffer was overrun by its normal background noise.

It's likely that if this build were logging all the things we log in debug mode, then there'd be something interesting in it. So one possible next step would be to have the release build on my device start logging more. This requires a bit of care because I don't want a normal release build (the ones we publish for users) to log all the things that our debug builds log with debugLog.

I've also made the logcat buffer on my device a lot bigger — it turns out that, very sensibly, the default buffer size is small because most users will never read it, but with adb you can just set it to some other value. It was 256 KiB and I made it 32 MiB:

$ adb logcat -g
main: ring buffer is 256 KiB (241 KiB consumed, 782 KiB readable), max entry is 5120 B, max payload is 4068 B
system: ring buffer is 256 KiB (247 KiB consumed, 850 KiB readable), max entry is 5120 B, max payload is 4068 B
crash: ring buffer is 256 KiB (64 KiB consumed, 51 KiB readable), max entry is 5120 B, max payload is 4068 B
kernel: ring buffer is 256 KiB (253 KiB consumed, 1 MiB readable), max entry is 5120 B, max payload is 4068 B

$ adb logcat -b main -G 32M

$ adb logcat -g
main: ring buffer is 32 MiB (248 KiB consumed, 778 KiB readable), max entry is 5120 B, max payload is 4068 B
system: ring buffer is 256 KiB (253 KiB consumed, 803 KiB readable), max entry is 5120 B, max payload is 4068 B
crash: ring buffer is 256 KiB (64 KiB consumed, 51 KiB readable), max entry is 5120 B, max payload is 4068 B
kernel: ring buffer is 256 KiB (253 KiB consumed, 1 MiB readable), max entry is 5120 B, max payload is 4068 B

So that'll make it easier to catch any relevant lines that other packages may be emitting, as well as any that we add (or if I happen to spot the issue while using a debug build, then the log lines we'd already be emitting in that case).

@gnprice
Copy link
Member Author

gnprice commented Aug 14, 2024

One new piece of information is that the loading indicator from #852 hasn't appeared. So that should help narrow down what's happening.

Rereading [UpdateMachine.poll], I think what this has to mean is probably that we threw an exception from within PerAccountStore.handleEvent.

That represents a bug, but we can and should recover from it: we should show the user an error dialog (for the beta), and then replace the store the same way we do when the event queue expires. (As described by #563.)

@gnprice
Copy link
Member Author

gnprice commented Aug 14, 2024

This morning I observed what might be the same issue

… Nope, false alarm. The message that was in the notification doesn't appear in a fresh reload either, nor in web. Presumably the message was just deleted.

Still this is a thing we should do, though, in case we do get an exception in handleEvent:

we can and should recover from it: we should show the user an error dialog (for the beta), and then replace the store the same way we do when the event queue expires.

And now my logcat buffer is bigger, which will help in any future debugging on my device.

@gnprice
Copy link
Member Author

gnprice commented Aug 21, 2024

And now my logcat buffer is bigger, which will help in any future debugging on my device.

Well, sadly this didn't stick: the buffer size is back to 256 kiB.

There's a setting at Developer settings > Logger buffer sizes, which I've now gone and set to "8M per log buffer", the largest option. These SO answers (both on the same question) were helpful:
https://stackoverflow.com/a/32094784
https://stackoverflow.com/a/57657874

After changing that, getprop persist.logd.size seems to reflect it:

$ adb shell getprop persist.logd.size
8388608

but adb logcat -g doesn't — still 256 KiB. Maybe the settings UI neglects to update the live value, and only sets that "persist" value which IIUC gets used on boot.

So I also ran adb logcat -G 32M, and now adb logcat -g reflects that.

@PIG208

This comment was marked as off-topic.

@gnprice
Copy link
Member Author

gnprice commented Aug 21, 2024

Let's continue the logcat thread in chat:
https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/adb.20logcat/near/1928685

@gnprice gnprice added the a-sync Event queue; retry; local echo; races label Oct 22, 2024
@gnprice
Copy link
Member Author

gnprice commented Oct 22, 2024

I don't recall having seen this a second time yet, since my original report in July.

(I had a second report above in August but then found at #809 (comment) that it was unrelated.)

This report earlier this month by @PIG208 might be the same issue, though:

so it's possible we've now seen it twice. → Hmm, scratch that — on rereading, I don't think the symptoms are consistent with being the same issue. In the original report above, new events for messages weren't arriving as seen in the message list. In #979, when a new event was provoked, it sounds like the message list promptly updated as usual.

@gnprice
Copy link
Member Author

gnprice commented Oct 22, 2024

Anyway. Having not seen this happen again, I'll push it out to the post-launch milestone, for the same reasons as I just described for that other similar-flavored issue: #979 (comment)

I think this won't be actionable until we hear more reports of it happening, so I'll push it out to the post-launch milestone. If we do see or hear of it again, this thread will help us aggregate those reports.

@gnprice
Copy link
Member Author

gnprice commented Oct 23, 2024

Well, having said that just yesterday:

This morning @chrisbobbe encountered what seems like potentially the same issue, as described at #514 (comment) .

Then just a few minutes ago, I also saw those same symptoms! In my case:

  • I picked up my phone and went back to the app, which had been in the background for perhaps an hour or two.
  • I don't remember whether there was a loading indicator at all; I think there wasn't. (Our loading indicators are new since this issue was filed; that was Show "Connecting…" banner (or equivalent) when server data is stale #465.) In any case if there was one, it went away, and the app appeared caught up.
  • But the mark-as-read button wasn't appearing in a message list, even though there were unreads there. (That's something that we should prevent happening in a different way: Update unreads data on fetching messages #649. But until then, it's a symptom of not having a functioning event queue.)
  • And then to confirm I wasn't getting events, I added 👍 reactions to a couple of messages. The reactions reached the server, but didn't show up locally.

Edit: Also the version I was running had a previous revision of #868 in it, which would show the same notices as the later merged version of #868 does plus a toast when replacing a stale event queue. I don't think I got any toasts when I went back to the app.

@gnprice
Copy link
Member Author

gnprice commented Oct 23, 2024

In the case Chris saw:

  • A bit more detail is available, because the app was a debug build. It was running in the iOS simulator, on his laptop.
  • Looking at the debug log, it shows requests yesterday afternoon and events arriving, up to a message event whose timestamp is late afternoon. Chris believes that's right around when he last interacted with the simulator; the laptop remained awake, at least for a while, but the simulator wouldn't have had focus.
  • Just after that event arrives, there's a getEvents request, as there would normally be.
  • But then there's never a further log line that looks like that getEvents request completing — neither an event showing it completed successfully, nor the log lines we print in any of the error paths in the UpdateMachine.poll loop. So it seems like the request's future never got completed, not even completed with an error.
  • That remained true even after several hours of Chris using the app today. So there were plenty of later log lines showing requests (like getMessages or addReaction); just no errors or anything else about the fate of that getEvents request from yesterday.

OTOH I guess one difference from my original report above is that on that occasion:

On the other hand it didn't last forever! A while after [the 20-minute mark], I looked and the new message had indeed appeared. I'm about 97% sure I hadn't done anything that would cause it to appear short of a restored event queue: didn't navigate out of the screen and back in, and didn't kill the app and restart. I have some doubt only because it seems strange that an issue would cause the app to take over 20 minutes to do this, and yet not just prevent it completely from succeeding.

That suggests these could be separate issues. Then again, as I said there, that belated recovery is so puzzling that I'm not sure it wasn't some kind of error in my observation.

Anyway, I'll bump the priority of that issue that tracks adding a timeout here:

because that may effectively resolve this issue for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a-api Implementing specific parts of the Zulip server API a-sync Event queue; retry; local echo; races
Projects
Status: No status
Development

No branches or pull requests

2 participants