-
Notifications
You must be signed in to change notification settings - Fork 44
Closed
Description
Small test program based on the examples in this repo:
#include <boost/redis/connection.hpp>
#include <boost/asio/detached.hpp>
#include <boost/asio/co_spawn.hpp>
#include <boost/asio/redirect_error.hpp>
#include <boost/asio/steady_timer.hpp>
#include <iostream>
#include <chrono>
namespace asio = boost::asio;
using boost::redis::request;
using boost::redis::ignore;
using boost::redis::config;
using boost::redis::connection;
using boost::redis::logger;
using boost::asio::awaitable;
using boost::asio::detached;
using boost::asio::consign;
awaitable<void> run_publisher(std::shared_ptr<connection> conn)
{
asio::steady_timer timer(co_await asio::this_coro::executor);
while (conn->will_reconnect())
{
request req;
req.push("PUBLISH", "channel", "some data");
boost::system::error_code ec;
co_await conn->async_exec(req, ignore, asio::redirect_error(asio::use_awaitable, ec));
if (ec)
{
std::clog << std::chrono::system_clock::now()
<< " PUBLISH failed with message: " << ec.message()
<< std::endl;
}
timer.expires_after(std::chrono::seconds(2));
co_await timer.async_wait(asio::redirect_error(asio::use_awaitable, ec));
}
}
awaitable<void> co_main(config cfg)
{
cfg.health_check_interval = std::chrono::seconds(30);
auto conn = std::make_shared<connection>(co_await asio::this_coro::executor);
conn->async_run(cfg, {logger::level::debug}, consign(detached, conn));
co_await run_publisher(conn);
conn->cancel();
}
Using 30 seconds health check interval as indicated in this comment will be default in a future version.
Typically seeing a delay between 3x to 4x health check interval after restarting Redis before connection lost is detected and reconnect attempted.
When restarting Redis the run-op completes almost immediately. This log shows that almost 2 min later connection lost is detected:
2024-07-26 10:54:51.1038269 (Boost.Redis) run-all-op: resolve addresses 127.0.0.1:6379
2024-07-26 10:54:51.1085491 (Boost.Redis) run-all-op: connected to endpoint 127.0.0.1:6379
2024-07-26 10:54:51.1090630 (Boost.Redis) writer-op: 130 bytes written.
2024-07-26 10:54:51.1110903 (Boost.Redis) reader-op: 973 bytes read.
2024-07-26 10:54:51.1117493 (Boost.Redis) hello-op: Success
2024-07-26 10:54:53.1252447 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:54:53.1268255 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:54:55.1416191 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:54:55.1428586 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:54:57.1517312 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:54:57.1527005 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:54:59.1575930 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:54:59.1581645 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:55:01.1650597 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:55:01.1659037 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:55:03.1787981 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:55:03.1792210 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:55:05.1378655 (Boost.Redis) reader-op: End of file
2024-07-26 10:55:05.1382268 (Boost.Redis) reader-op: EOF received. Exiting ...
2024-07-26 10:55:05.1386657 (Boost.Redis) writer-op: canceled (3). Exiting ...
2024-07-26 10:55:05.1389448 (Boost.Redis) run-op: The operation completed successfully (reader), The operation completed successfully (writer)
2024-07-26 10:56:51.1219054 (Boost.Redis) check-timeout-op: Response has no value. Exiting ...
2024-07-26 10:56:51.1228789 (Boost.Redis) ping_op: error/cancelled (1).
2024-07-26 10:56:51.1236270 (Boost.Redis) check-health-op: The I/O operation has been aborted because of either a thread exit or an application request (async_ping), Pong timeout. (async_check_timeout).
2024-07-26 10:56:51.1239811 (Boost.Redis) runner-op: The operation completed successfully (async_run_all), Pong timeout. (async_health_check) The operation completed successfully (async_hello).
2024-07-26 10:56:51.1243785 (Boost.Redis) Connection lost.
2024-07-26 10:56:52.1326278 (Boost.Redis) run-all-op: resolve addresses 127.0.0.1:6379
2024-07-26 10:56:52.1333834 (Boost.Redis) run-all-op: connected to endpoint 127.0.0.1:6379
2024-07-26 10:56:52.1337355 (Boost.Redis) writer-op: 130 bytes written.
2024-07-26 10:56:52.1354276 (Boost.Redis) reader-op: 973 bytes read.
2024-07-26 10:56:52.1358929 (Boost.Redis) hello-op: Success
2024-07-26 10:56:54.1395472 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:56:54.1408228 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:56:56.1418841 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:56:56.1428140 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:56:58.1477065 (Boost.Redis) writer-op: 45 bytes written.
Added a small change to logger to get the timestamps:
void logger::write_prefix()
{
std::clog << std::chrono::system_clock::now() << " ";
if (!std::empty(prefix_))
std::clog << prefix_;
}
Setting cancel_if_not_connected = true
in the request will result in async_exec
failing with the expected not connected error:
2024-07-26 11:17:43.6053426 PUBLISH failed with message: Not connected.
2024-07-26 11:17:43.6057837 (Boost.Redis) run-all-op: resolve addresses 127.0.0.1:6379
2024-07-26 11:17:43.6074385 (Boost.Redis) run-all-op: connected to endpoint 127.0.0.1:6379
2024-07-26 11:17:43.6078098 (Boost.Redis) writer-op: 85 bytes written.
2024-07-26 11:17:43.6107975 (Boost.Redis) reader-op: 969 bytes read.
2024-07-26 11:17:43.6115954 (Boost.Redis) hello-op: Success
2024-07-26 11:17:45.6173984 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 11:17:45.6179272 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 11:17:47.6251202 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 11:17:47.6257825 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 11:17:49.6389446 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 11:17:49.6398523 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 11:17:49.9818766 (Boost.Redis) reader-op: End of file
2024-07-26 11:17:49.9822220 (Boost.Redis) reader-op: EOF received. Exiting ...
2024-07-26 11:17:49.9825654 (Boost.Redis) writer-op: canceled (3). Exiting ...
2024-07-26 11:17:49.9827326 (Boost.Redis) run-op: The operation completed successfully (reader), The operation completed successfully (writer)
2024-07-26 11:17:51.6439209 PUBLISH failed with message: Not connected.
2024-07-26 11:17:53.6537792 PUBLISH failed with message: Not connected.
2024-07-26 11:17:55.6638643 PUBLISH failed with message: Not connected.
2024-07-26 11:17:57.6701620 PUBLISH failed with message: Not connected.
2024-07-26 11:17:59.6732612 PUBLISH failed with message: Not connected.
...
2024-07-26 11:19:42.0683707 PUBLISH failed with message: Not connected.
2024-07-26 11:19:43.6180784 (Boost.Redis) check-timeout-op: Response has no value. Exiting ...
2024-07-26 11:19:43.6187335 (Boost.Redis) ping_op: error/cancelled (1).
2024-07-26 11:19:43.6193823 (Boost.Redis) check-health-op: The I/O operation has been aborted because of either a thread exit or an application request (async_ping), Pong timeout. (async_check_timeout).
2024-07-26 11:19:43.6198053 (Boost.Redis) runner-op: The operation completed successfully (async_run_all), Pong timeout. (async_health_check) The operation completed successfully (async_hello).
2024-07-26 11:19:43.6200351 (Boost.Redis) Connection lost.
2024-07-26 11:19:44.0833046 PUBLISH failed with message: Not connected.
2024-07-26 11:19:44.6271292 (Boost.Redis) run-all-op: resolve addresses 127.0.0.1:6379
2024-07-26 11:19:44.6283678 (Boost.Redis) run-all-op: connected to endpoint 127.0.0.1:6379
2024-07-26 11:19:44.6290895 (Boost.Redis) writer-op: 85 bytes written.
2024-07-26 11:19:44.6397801 (Boost.Redis) reader-op: 969 bytes read.
2024-07-26 11:19:44.6403232 (Boost.Redis) hello-op: Success
2024-07-26 11:19:46.0852712 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 11:19:46.0861008 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 11:19:48.1011338 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 11:19:48.1018705 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 11:19:50.1027101 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 11:19:50.1036486 (Boost.Redis) reader-op: 4 bytes read.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
mzimbres commentedon Aug 3, 2024
Thanks for the detailed analysis (and sorry for the delayed response).
I am not yet sure what is better.
Yeah it takes 1min52s more precisely. My analysis below shows the following
I am not sure yet, but the only place where I still can adjust to
reduce the delay is on how long the check-health operation waits, at
the moment it is
2 * ping_interval
, in principle it could be reducedto ping_interval + small number so that instead of 2min you would
experience 1min30s.
VemundH commentedon Aug 5, 2024
The design of the health checks looks OK.
I think the actual problem is the
wait_for_all
here:redis/include/boost/redis/detail/runner.hpp
Line 96 in e8dd4d6
When
async_run_all
completes the health check can no longer succeed and should be canceled.mzimbres commentedon Aug 17, 2024
Good point, perhaps wait_for_one_error is the correct token here, I will check.
Improves reconnection responsiveness.
Improves reconnection responsiveness.
VemundH commentedon Aug 19, 2024
Testing with wait_for_one_error did not make any difference. Maybe because the sever disconnect is not considered an error?
VemundH commentedon Aug 19, 2024
As a simple test I added
conn_->cancel(operation::health_check);
here:redis/include/boost/redis/detail/runner.hpp
Line 158 in e8dd4d6
Then I get:
The reasoning is that no more reads and writes will be attempted after
run_all_op
completes. Health check can no longer succeed. Canceling it instead of waiting for timeout will remove the long delay before reconnect logic is started.I have not familiarized myself enough with the code and design to see if this change will have any negative side-effects.
anarthal commentedon Sep 1, 2024
Been out for a few days. Without being a Boost.Redis expert, I'd say @mzimbres diagnostics is mainly correct. But bear in mind that from this log line:
This op is completing successfully. I don't know under which circumstances the other ops may complete, but I'd say it makes sense to cancel the other operations once one of them exits.
Instead of
conn_->cancel(operation::health_check);
, I think usingasio::experimental::wait_for_one()
is better, as it provides exactly this semantics.The degree of nesting async ops makes testing quite difficult, so I'd suggest exploring the option of replacing some of these by a sans-io state machine, which usually leads to more testable code.
Improves reconnection responsiveness.
mzimbres commentedon Sep 8, 2024
Back form summer vacation ...
@VemundH I couldn't reproduce this problem so far, I tried to run Redis manually, on systemd and on docker as you explained in the PR. But I think you are correct on the fact that since EOF is not considered and error the parallel group linked above will have to wait until the health-check returns an error. I have changed this in the PR here. Can you please try again?
VemundH commentedon Sep 9, 2024
@mzimbres I have tested with the changes in the PR and now the disconnect is detected when EOF is received. Thanks!
Improves reconnection responsiveness.
mzimbres commentedon Sep 9, 2024
Correct, that is achieved with the PR by making EOF an error and using
wait_for_one_error
instead ofwait_for_all
.The problem with that set of operation is that two of them are long running (
async_run
and the health check) and the other is short lived (resp3 handshake) so if we usedwait_for_one
instead when the latter finishes it would trigger cancellation of the other two.At first I tried to launch all operations in the same group i.e. read, write, handshake, health_check. That reduces the nested ops but bloated the connection class, that is why I split it in two classes. I agree however there is still room for improvements.
Improves reconnection responsiveness.
Improves reconnection responsiveness.