[Questions] Question about behavior of non durable classic queues #12829

Rmarian · 2024-11-27T10:01:53Z

Rmarian
Nov 27, 2024

Community Support Policy

I have read RabbitMQ's Community Support Policy
I agree to provide all relevant information (versions, logs, rabbitmq-diagnostics output, detailed reproduction steps)

RabbitMQ version used

3.13.7 or older

Erlang version used

26.2.x

Operating system (distribution) used

Red hat

How is RabbitMQ deployed?

RPM package

rabbitmq-diagnostics status output

See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics

# PASTE OUTPUT HERE, BETWEEN BACKTICKS

Logs from node 1 (with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 2 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 3 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

rabbitmq.conf

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location

# PASTE rabbitmq.conf HERE, BETWEEN BACKTICKS

Steps to deploy RabbitMQ cluster

Containerized RabbitMQ with docker

Steps to reproduce the behavior in question

Three node cluster
have a non durable classic queue on node 1 and a consumer connected to it
isolate node 1 from rest of cluster
Rejoin node to cluster
Queue can sometimes disappear

advanced.config

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location

# PASTE advanced.config HERE, BETWEEN BACKTICKS

Application code

# PASTE CODE HERE, BETWEEN BACKTICKS

Kubernetes deployment file

# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKS

What problem are you trying to solve?

Hello team,

I want to open a discussion about a situation we encountered recently with RabbitMQ.

Setup:

RabbitMQ 3.13.6 with Khepri enabled.
three node cluster
We are using non durable classic queues to consume messages from an exchange.

At some point, the cluster experienced a short lived partition event (lasted about 12 seconds I think) where one node got isolated from the other three.

Oct 13, 2024 @ 05:33:34.862 +00:00	2024-10-13 05:33:34.861488+00:00 [info] <0.553.0> rabbit on node 'rabbit@rabbitmq-1' down	rabbitmq-2
Oct 13, 2024 @ 05:33:35.073 +00:00	2024-10-13 05:33:35.072958+00:00 [info] <0.688.0> rabbit on node 'rabbit@rabbitmq-1' down	rabbitmq-3
Oct 13, 2024 @ 05:33:34.497 +00:00	2024-10-13 05:33:34.497548+00:00 [error] <0.189.0> ** Node 'rabbit@rabbitmq-2' not responding **	rabbitmq-1
Oct 13, 2024 @ 05:33:34.498 +00:00	2024-10-13 05:33:34.497693+00:00 [error] <0.263.0> ** Node 'rabbit@rabbitmq-3' not responding **	rabbitmq-1

After the partition was resolved, we noticed that we stopped getting messages to one of the clients connected to the node that got temporarily partitioned.

Upon further examination, we saw that the queue was deleted during the partition recovery phase as can be seen from the logs:

Oct 13, 2024 @ 05:33:34.869 +00:00	2024-10-13 05:33:34.868824+00:00 [info] <0.553.0> 1 transient queues from an old incarnation of node ... deleted in 0.007234s

But the client consumer was not notified about this.

To us this poses a risk of losing messages and would like an opinion on how to handle this case if possible.

Is this actually a feature or a bug? Tested with RabbitMQ 4.0 and the behavior is the same.

What would you recommend we do?

Thanks in advance,
Radu.

Answered by mkuratczyk

Nov 27, 2024

This is one of the reasons transient non-exclusive queues have been deprecated:
https://www.rabbitmq.com/blog/2021/08/21/4.0-deprecation-announcements#removal-of-transient-non-exclusive-queues

They are still allowed in 4.0 by default, but you can configure RabbitMQ to not allow them:

deprecated_features.permit.transient_nonexcl_queues = false

What you are observing is most likely a race condition. The intended order of events would be:

the queue exists and everything works
network partition triggers a deletion of a non-durable queue
the application reconnects, redeclares a queue
everything works again

However, sometimes the application redeclares the queue first and only then the nodes…

View full answer

mkuratczyk · 2024-11-27T10:23:41Z

mkuratczyk
Nov 27, 2024
Maintainer

This is one of the reasons transient non-exclusive queues have been deprecated:
https://www.rabbitmq.com/blog/2021/08/21/4.0-deprecation-announcements#removal-of-transient-non-exclusive-queues

They are still allowed in 4.0 by default, but you can configure RabbitMQ to not allow them:

deprecated_features.permit.transient_nonexcl_queues = false

What you are observing is most likely a race condition. The intended order of events would be:

the queue exists and everything works
network partition triggers a deletion of a non-durable queue
the application reconnects, redeclares a queue
everything works again

However, sometimes the application redeclares the queue first and only then the nodes deletes it, which means it deletes the new (post-partition) incarnation of the queue.

Exclusive server-named queues are the solution since their new incarnation would be named differently.

8 replies

michaelklishin Nov 27, 2024
Maintainer

One reasonable request would be to introduce a protocol extension that would notify clients when a node finds itself in a minority (with Khepri, in 4.0.x), and when it rejoins. Clients then would be able to reconnect.

Another, a much more drastic option would be to disconnect all clients in one or both of those cases but I don't expect it to be popular.

You cannot get a queue deletion notification from a majority partition or for an event long time ago, in particular in a client on a minority side.

What you can do is periodically verify that the non-replicated queue you use is still there, using a queue.declare on a one-off channel with passive property set to true.

fruetschi Dec 11, 2024

Hi,

I'm in the same dev team as @Rmarian. I wanted to point out that our biggest issue with the current behavior is not that the queues are deleted on the node in minority, but rather that after the network partition resolves, the clients being subscribed to the queues are not notified in any way that those queues are gone.
This leads to a situation where a service only consuming messages from a queue is not noticing at all that it actually needs to react and redeclare the queue again. The system goes into a state it does not work and is not easily noticeable.

The before-mentioned workaround to periodically try passive queue declarations would probably work, however in a distributed system with many services, such a heartbeat may have negative performance impact.

In addition to that, we are developing a platform and this would mean to impose a requirement on all our users, so they make sure they implement such a heartbeating mechanism in a variety of different languages. As mentioned before, we are working in the safety-critical domain and suffer from broken queue bindings after transient network partitions using Mnesia (similar to #4237) and our biggest hope was that Khepri resolves this issue.
However, with the current behavior, it would immediately trigger the next potential safety problem since the behavior of RMQ 4.0 with Khepri changed in some regards.

Fixing this single issue on the RMQ-side would allow us to move forward in adopting RMQ 4.0 with Khepri, so it would be great if you could think again once more if there could be a way to mitigate this issue - I believe also others could be hit by this.

Thanks,
Dominik

mkuratczyk Dec 12, 2024
Maintainer

I can reproduce this behaviour and I agree that this is something we should change if possible

fruetschi Dec 16, 2024

Thanks for re-testing and confirming! Only now I also found an explicit statement about that in the docs actually: https://www.rabbitmq.com/docs/reliability#cancel-notification

Do you have an idea about when this could possibly be fixed?

mkuratczyk Dec 16, 2024
Maintainer

I created an issue for this: #12949

michaelklishin · 2024-11-27T12:05:32Z

michaelklishin
Nov 27, 2024
Maintainer

3.13.x is a release series out of support.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Questions] Question about behavior of non durable classic queues #12829

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Questions] Question about behavior of non durable classic queues #12829

Uh oh!

Rmarian Nov 27, 2024

Community Support Policy

RabbitMQ version used

Erlang version used

Operating system (distribution) used

How is RabbitMQ deployed?

rabbitmq-diagnostics status output

Logs from node 1 (with sensitive values edited out)

Logs from node 2 (if applicable, with sensitive values edited out)

Logs from node 3 (if applicable, with sensitive values edited out)

rabbitmq.conf

Steps to deploy RabbitMQ cluster

Steps to reproduce the behavior in question

advanced.config

Application code

Kubernetes deployment file

What problem are you trying to solve?

Replies: 2 comments · 8 replies

Uh oh!

Uh oh!

mkuratczyk Nov 27, 2024 Maintainer

Uh oh!

michaelklishin Nov 27, 2024 Maintainer

Uh oh!

Uh oh!

fruetschi Dec 11, 2024

Uh oh!

mkuratczyk Dec 12, 2024 Maintainer

Uh oh!

fruetschi Dec 16, 2024

Uh oh!

mkuratczyk Dec 16, 2024 Maintainer

Uh oh!

michaelklishin Nov 27, 2024 Maintainer

Rmarian
Nov 27, 2024

Replies: 2 comments 8 replies

mkuratczyk
Nov 27, 2024
Maintainer

michaelklishin Nov 27, 2024
Maintainer

mkuratczyk Dec 12, 2024
Maintainer

mkuratczyk Dec 16, 2024
Maintainer

michaelklishin
Nov 27, 2024
Maintainer