Skip to content

Conversation

dugenkui03
Copy link
Contributor

@dugenkui03 dugenkui03 commented Apr 1, 2022

Make sure set the target branch to develop

What is the purpose of the change

XXXXX

Brief changelog

#4098

Verifying this change

XXXX

Follow this checklist to help us incorporate your contribution quickly and easily. Notice, it would be helpful if you could finish the following 5 checklist(the last one is not necessary)before request the community to review your PR.

  • Make sure there is a Github issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a Github issue. Your pull request should address just this issue, without pulling in other changes - one PR resolves one issue.
  • Format the pull request title like [ISSUE #123] Fix UnknownException when host config not exist. Each commit in the pull request should have a meaningful subject line and body.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Write necessary unit-test(over 80% coverage) to verify your logic correction, more mock a little better when cross module dependency exist. If the new feature or significant change is committed, please remember to add integration-test in test module.
  • Run mvn -B clean apache-rat:check findbugs:findbugs checkstyle:checkstyle to make sure basic checks pass. Run mvn clean install -DskipITs to make sure unit-test pass. Run mvn clean test-compile failsafe:integration-test to make sure integration-test pass.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) to 51.903% when pulling fc20d00 on dugenkui03:patch-7 into e3b6748 on apache:develop.

@codecov-commenter
Copy link

Codecov Report

Merging #4099 (fc20d00) into develop (cc478bd) will increase coverage by 0.41%.
The diff coverage is 100.00%.

@@              Coverage Diff              @@
##             develop    #4099      +/-   ##
=============================================
+ Coverage      47.46%   47.88%   +0.41%     
- Complexity      4909     5000      +91     
=============================================
  Files            633      634       +1     
  Lines          42317    42533     +216     
  Branches        5544     5574      +30     
=============================================
+ Hits           20086    20367     +281     
+ Misses         19744    19666      -78     
- Partials        2487     2500      +13     
Impacted Files Coverage Δ
...he/rocketmq/client/trace/AsyncTraceDispatcher.java 79.12% <100.00%> (+0.90%) ⬆️
...va/org/apache/rocketmq/store/FlushDiskWatcher.java 81.25% <0.00%> (-9.38%) ⬇️
...in/java/org/apache/rocketmq/test/util/MQAdmin.java 38.88% <0.00%> (-5.56%) ⬇️
...ache/rocketmq/common/stats/MomentStatsItemSet.java 39.13% <0.00%> (-4.35%) ⬇️
.../java/org/apache/rocketmq/acl/common/AclUtils.java 76.25% <0.00%> (-3.60%) ⬇️
...g/apache/rocketmq/broker/util/ServiceProvider.java 46.42% <0.00%> (-3.58%) ⬇️
...etmq/client/latency/LatencyFaultToleranceImpl.java 49.35% <0.00%> (-3.22%) ⬇️
...n/java/org/apache/rocketmq/store/RunningFlags.java 31.11% <0.00%> (-2.23%) ⬇️
...pache/rocketmq/store/MultiPathMappedFileQueue.java 92.30% <0.00%> (-1.93%) ⬇️
...and/acl/ClusterAclConfigVersionListSubCommand.java 20.40% <0.00%> (-1.82%) ⬇️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc478bd...fc20d00. Read the comment docs.

Copy link
Contributor

@ltamber ltamber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@dongeforever
Copy link
Member

@dugenkui03 I found another issue.
In the AsyncAppenderRequest, the max message size is 128000.
So If the byte size of the accumulated TraceContext is more than 128000, it cannot reduce the RPC to improve the performance, but cause delay.

So the algorithm may need to consider the byte size too. If the byte size is enough, it should be scheduled immediately.

BTW, the data topic generated by "sendTraceData" is not used in "sendTraceDataByMQ".

Maybe we need to clarify the design of topic mechanism for trace.

@dugenkui03
Copy link
Contributor Author

dugenkui03 commented Apr 14, 2022

@dugenkui03 I found another issue. In the AsyncAppenderRequest, the max message size is 128000. So If the byte size of the accumulated TraceContext is more than 128000, it cannot reduce the RPC to improve the performance, but cause delay.

So the algorithm may need to consider the byte size too. If the byte size is enough, it should be scheduled immediately.

BTW, the data topic generated by "sendTraceData" is not used in "sendTraceDataByMQ".

Maybe we need to clarify the design of topic mechanism for trace.

@dongeforever Thanks for you suggestion. I trace the code and find we need do more to speed up performance( such as reduce the total number of send trace message). And the only benefit of this pr is that make the timing of submiting AsyncAppenderRequest more clearly, and it's is hard to say it's really improve the perfromance.

Describe in Chinese

感谢建议和提醒,按照你的提示我跟踪代码后发现提交的任务AsyncAppenderRequest中的数据并非一次发送给服务端、还会按照topic+regionId分组且每组数据按照maxMsgSize分批发送。

当前的任务AsyncAppenderRequest上报逻辑和数据按照topic+regionId分组逻辑对数据上报似乎没有太大帮助、因为数据最终发送的topic是由AsyncTraceDispatcher#accessChannel决定的。

该pr可以使任务AsyncAppenderRequest提交的时机更明确,但是要切实优化性能需要进行多的修改。如下给出当前数据发送算法分析和优化思路

当前逻辑

image

  1. 业务线程调用AsyncTraceDispatcher#append向任务元数据队列traceContextQueue添加TraceContext
  2. worker线程循环从traceContextQueue中获取TraceContext、并包装为任务AsyncAppenderRequest,由线程池traceExecutor执行;
  3. AsyncAppenderRequest#sendTraceData中将TraceContext转换为可计量大小的TraceTransferBean,并按照按照topic+regionIdTraceTransferBean进行分组、获取结构为Map<String, List<TraceTransferBean>>的数据;
  4. TraceTransferBean按照maxMsgSize进行分批次的发送,根据AsyncTraceDispatcher#accessChannel决定发送到rmq_sys_TRACE_DATA_${regionId}还是轨迹消息队列(默认为RMQ_SYS_TRACE_TOPIC) 。

当前逻辑的问题是worker通过AsyncAppenderRequest对数据进行了不必要的分割List<TraceTransferBean>,在AsyncAppenderRequest#sendTraceData中对数据也进行了不必要的分组。

优化思路

image

  1. work线程将数据按照topic分组、保存到HashMap<String, TraceDataSegment>中,TraceDataSegment中保存了TraceTransferBean列表、数据大小和当前列表第一个TraceTransferBean的保存时间;
  2. 将数据保存到TraceDataSegment中时会判断TraceTransferBean列表是否已经达到 maxMsgSize,是则发送数据、并恢复TraceDataSegment的初始值;
  3. work线程每pollingTimeMil毫秒会遍历所有的TraceDataSegment,检查其数据保存超过是否waitTimeThresholdMil毫秒,是则发送所有数据、以此避免时间保存过久不发送。

该算法的总体思路是在数据达到大小阈值maxMsgSize或者时间阈值时发送数据。可避免之前因为数据不恰当的分组导致过多的请求。代码初步实现见#4180

Copy link
Contributor

@lwclover lwclover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no need to be so precise. The logic is a little complicated.

@dugenkui03
Copy link
Contributor Author

I think there's no need to be so precise. The logic is a little complicated.

最新的代码已经在 pr-4180 中进行review,其实我认为这个优化是十分有意义的。我们可以简单做下分析:

100个会发送给不同traceTopic的轨迹消息生产者均以 1个/ms 的速度提交轨迹消息,那么原始的算法每次循环结束都会进行100次网络IO发送消息、因为消息要发送给100个不同的traceTopic。100ms中100次循环就会产生1w次网络IO。但最优思路应该是对这100ms中产生的消息按照要发送的traceTopic进行分组然后发送,最优次数应该是100次、而非1w次。

而优化的算法则是持续不断的将这些不同目标traceTopic的数据分配到traceTopic对应的缓存对象TraceDataSegment中,直到TraceDataSegment中数据大小达到指定阈值、或者存放时间达到指定阈值。至于实现,将逻辑捋顺了只要很少的代码即可。

duhenglucky pushed a commit that referenced this pull request Apr 27, 2022
…yncTraceDispatcher` (#4180)

* Optimized the performance of sending traceMessage

* Optimized the performance of sending traceMessage

* Optimized the performance of sending traceMessage

* Optimized the performance of sending traceMessage
@dongeforever
Copy link
Member

@dugenkui03 Since another related PR #4180 is merged.
I will close this PR.
You are free to reopen it if you found some else problems.

Thanks for your contribution.

@duhenglucky duhenglucky added this to the 4.9.4 milestone May 19, 2022
GenerousMan pushed a commit to GenerousMan/rocketmq that referenced this pull request Aug 12, 2022
…in `AsyncTraceDispatcher` (apache#4180)

* Optimized the performance of sending traceMessage

* Optimized the performance of sending traceMessage

* Optimized the performance of sending traceMessage

* Optimized the performance of sending traceMessage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants