[ISSUE #4098]Optimized the algorithm of trace message send. #4099

dugenkui03 · 2022-04-01T20:47:33Z

Make sure set the target branch to develop

What is the purpose of the change

XXXXX

Brief changelog

Verifying this change

XXXX

Follow this checklist to help us incorporate your contribution quickly and easily. Notice, it would be helpful if you could finish the following 5 checklist(the last one is not necessary)before request the community to review your PR.

Make sure there is a Github issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a Github issue. Your pull request should address just this issue, without pulling in other changes - one PR resolves one issue.
Format the pull request title like [ISSUE #123] Fix UnknownException when host config not exist. Each commit in the pull request should have a meaningful subject line and body.
Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
Write necessary unit-test(over 80% coverage) to verify your logic correction, more mock a little better when cross module dependency exist. If the new feature or significant change is committed, please remember to add integration-test in test module.
Run mvn -B clean apache-rat:check findbugs:findbugs checkstyle:checkstyle to make sure basic checks pass. Run mvn clean install -DskipITs to make sure unit-test pass. Run mvn clean test-compile failsafe:integration-test to make sure integration-test pass.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

…WaitTime.

coveralls · 2022-04-01T21:18:37Z

Coverage decreased (-0.05%) to 51.903% when pulling fc20d00 on dugenkui03:patch-7 into e3b6748 on apache:develop.

codecov-commenter · 2022-04-01T21:19:05Z

Codecov Report

Merging #4099 (fc20d00) into develop (cc478bd) will increase coverage by 0.41%.
The diff coverage is 100.00%.

@@              Coverage Diff              @@
##             develop    #4099      +/-   ##
=============================================
+ Coverage      47.46%   47.88%   +0.41%     
- Complexity      4909     5000      +91     
=============================================
  Files            633      634       +1     
  Lines          42317    42533     +216     
  Branches        5544     5574      +30     
=============================================
+ Hits           20086    20367     +281     
+ Misses         19744    19666      -78     
- Partials        2487     2500      +13

Impacted Files	Coverage Δ
...he/rocketmq/client/trace/AsyncTraceDispatcher.java	`79.12% <100.00%> (+0.90%)`	⬆️
...va/org/apache/rocketmq/store/FlushDiskWatcher.java	`81.25% <0.00%> (-9.38%)`	⬇️
...in/java/org/apache/rocketmq/test/util/MQAdmin.java	`38.88% <0.00%> (-5.56%)`	⬇️
...ache/rocketmq/common/stats/MomentStatsItemSet.java	`39.13% <0.00%> (-4.35%)`	⬇️
.../java/org/apache/rocketmq/acl/common/AclUtils.java	`76.25% <0.00%> (-3.60%)`	⬇️
...g/apache/rocketmq/broker/util/ServiceProvider.java	`46.42% <0.00%> (-3.58%)`	⬇️
...etmq/client/latency/LatencyFaultToleranceImpl.java	`49.35% <0.00%> (-3.22%)`	⬇️
...n/java/org/apache/rocketmq/store/RunningFlags.java	`31.11% <0.00%> (-2.23%)`	⬇️
...pache/rocketmq/store/MultiPathMappedFileQueue.java	`92.30% <0.00%> (-1.93%)`	⬇️
...and/acl/ClusterAclConfigVersionListSubCommand.java	`20.40% <0.00%> (-1.82%)`	⬇️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc478bd...fc20d00. Read the comment docs.

ltamber

+1

dongeforever · 2022-04-13T03:04:04Z

@dugenkui03 I found another issue.
In the AsyncAppenderRequest, the max message size is 128000.
So If the byte size of the accumulated TraceContext is more than 128000, it cannot reduce the RPC to improve the performance, but cause delay.

So the algorithm may need to consider the byte size too. If the byte size is enough, it should be scheduled immediately.

BTW, the data topic generated by "sendTraceData" is not used in "sendTraceDataByMQ".

Maybe we need to clarify the design of topic mechanism for trace.

dugenkui03 · 2022-04-14T16:10:47Z

@dugenkui03 I found another issue. In the AsyncAppenderRequest, the max message size is 128000. So If the byte size of the accumulated TraceContext is more than 128000, it cannot reduce the RPC to improve the performance, but cause delay.

So the algorithm may need to consider the byte size too. If the byte size is enough, it should be scheduled immediately.

BTW, the data topic generated by "sendTraceData" is not used in "sendTraceDataByMQ".

Maybe we need to clarify the design of topic mechanism for trace.

@dongeforever Thanks for you suggestion. I trace the code and find we need do more to speed up performance( such as reduce the total number of send trace message). And the only benefit of this pr is that make the timing of submiting AsyncAppenderRequest more clearly, and it's is hard to say it's really improve the perfromance.

Describe in Chinese

感谢建议和提醒，按照你的提示我跟踪代码后发现提交的任务AsyncAppenderRequest中的数据并非一次发送给服务端、还会按照topic+regionId分组且每组数据按照maxMsgSize分批发送。

当前的任务AsyncAppenderRequest上报逻辑和数据按照topic+regionId分组逻辑对数据上报似乎没有太大帮助、因为数据最终发送的topic是由AsyncTraceDispatcher#accessChannel决定的。

该pr可以使任务AsyncAppenderRequest提交的时机更明确，但是要切实优化性能需要进行多的修改。如下给出当前数据发送算法分析和优化思路

当前逻辑

业务线程调用AsyncTraceDispatcher#append向任务元数据队列traceContextQueue添加TraceContext；
worker线程循环从traceContextQueue中获取TraceContext、并包装为任务AsyncAppenderRequest，由线程池traceExecutor执行；
AsyncAppenderRequest#sendTraceData中将TraceContext转换为可计量大小的TraceTransferBean，并按照按照topic+regionId对TraceTransferBean进行分组、获取结构为Map<String, List<TraceTransferBean>>的数据；
将TraceTransferBean按照maxMsgSize进行分批次的发送，根据AsyncTraceDispatcher#accessChannel决定发送到rmq_sys_TRACE_DATA_${regionId}还是轨迹消息队列(默认为RMQ_SYS_TRACE_TOPIC) 。

当前逻辑的问题是worker通过AsyncAppenderRequest对数据进行了不必要的分割List<TraceTransferBean>，在AsyncAppenderRequest#sendTraceData中对数据也进行了不必要的分组。

优化思路

work线程将数据按照topic分组、保存到HashMap<String, TraceDataSegment>中，TraceDataSegment中保存了TraceTransferBean列表、数据大小和当前列表第一个TraceTransferBean的保存时间；
将数据保存到TraceDataSegment中时会判断TraceTransferBean列表是否已经达到 maxMsgSize，是则发送数据、并恢复TraceDataSegment的初始值；
work线程每pollingTimeMil毫秒会遍历所有的TraceDataSegment，检查其数据保存超过是否waitTimeThresholdMil毫秒，是则发送所有数据、以此避免时间保存过久不发送。

该算法的总体思路是在数据达到大小阈值maxMsgSize或者时间阈值时发送数据。可避免之前因为数据不恰当的分组导致过多的请求。代码初步实现见#4180。

lwclover

I think there's no need to be so precise. The logic is a little complicated.

dugenkui03 · 2022-04-25T18:12:37Z

I think there's no need to be so precise. The logic is a little complicated.

最新的代码已经在 pr-4180 中进行review，其实我认为这个优化是十分有意义的。我们可以简单做下分析：

100个会发送给不同traceTopic的轨迹消息生产者均以 1个/ms 的速度提交轨迹消息，那么原始的算法每次循环结束都会进行100次网络IO发送消息、因为消息要发送给100个不同的traceTopic。100ms中100次循环就会产生1w次网络IO。但最优思路应该是对这100ms中产生的消息按照要发送的traceTopic进行分组然后发送，最优次数应该是100次、而非1w次。

而优化的算法则是持续不断的将这些不同目标traceTopic的数据分配到traceTopic对应的缓存对象TraceDataSegment中，直到TraceDataSegment中数据大小达到指定阈值、或者存放时间达到指定阈值。至于实现，将逻辑捋顺了只要很少的代码即可。

…yncTraceDispatcher` (#4180) * Optimized the performance of sending traceMessage * Optimized the performance of sending traceMessage * Optimized the performance of sending traceMessage * Optimized the performance of sending traceMessage

dongeforever · 2022-04-27T08:14:02Z

@dugenkui03 Since another related PR #4180 is merged.
I will close this PR.
You are free to reopen it if you found some else problems.

Thanks for your contribution.

…in `AsyncTraceDispatcher` (apache#4180) * Optimized the performance of sending traceMessage * Optimized the performance of sending traceMessage * Optimized the performance of sending traceMessage * Optimized the performance of sending traceMessage

submit a trace request when get 'batchSize' TraceContext or reach max…

fc20d00

…WaitTime.

dugenkui03 mentioned this pull request Apr 2, 2022

Optimized the algorithm of trace message send. #4098

Closed

MatrixHB approved these changes Apr 5, 2022

View reviewed changes

ltamber approved these changes Apr 6, 2022

View reviewed changes

This was referenced Apr 14, 2022

[ISSUE #4099]Optimized the performance of sending traceMessage #4169

Closed

[ISSUE #4099]Optimized the performance of sending traceMessage in AsyncTraceDispatcher #4180

Merged

lwclover reviewed Apr 24, 2022

View reviewed changes

dongeforever closed this Apr 27, 2022

duhenglucky added this to the 4.9.4 milestone May 19, 2022

duhenglucky added type/enhancement module/client labels May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ISSUE #4098]Optimized the algorithm of trace message send. #4099

[ISSUE #4098]Optimized the algorithm of trace message send. #4099

Uh oh!

dugenkui03 commented Apr 1, 2022 •

edited

Loading

Uh oh!

coveralls commented Apr 1, 2022

Uh oh!

codecov-commenter commented Apr 1, 2022

Uh oh!

ltamber left a comment

Uh oh!

dongeforever commented Apr 13, 2022

Uh oh!

dugenkui03 commented Apr 14, 2022 •

edited

Loading

Uh oh!

lwclover left a comment •

edited

Loading

Uh oh!

dugenkui03 commented Apr 25, 2022

Uh oh!

dongeforever commented Apr 27, 2022

Uh oh!

Uh oh!

[ISSUE #4098]Optimized the algorithm of trace message send. #4099

[ISSUE #4098]Optimized the algorithm of trace message send. #4099

Uh oh!

Conversation

dugenkui03 commented Apr 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief changelog

Verifying this change

Uh oh!

coveralls commented Apr 1, 2022

Uh oh!

codecov-commenter commented Apr 1, 2022

Codecov Report

Uh oh!

ltamber left a comment

Choose a reason for hiding this comment

Uh oh!

dongeforever commented Apr 13, 2022

Uh oh!

dugenkui03 commented Apr 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe in Chinese

当前逻辑

优化思路

Uh oh!

lwclover left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dugenkui03 commented Apr 25, 2022

Uh oh!

dongeforever commented Apr 27, 2022

Uh oh!

Uh oh!

dugenkui03 commented Apr 1, 2022 •

edited

Loading

dugenkui03 commented Apr 14, 2022 •

edited

Loading

lwclover left a comment •

edited

Loading