Skip to content

Conversation

David-Engel
Copy link
Contributor

@David-Engel David-Engel commented Sep 20, 2024

Address the failure in the DNN scenario described in #2400.

In the MSF parallel connect path, socket connections can fail quickly (as opposed to timing out), resulting in the first socket connect task being the "success" task that closes all other sockets. This happens because in a VNN, there is no listener at the other end of inactive IP addresses and socket connects take a while before they time out. In a DNN, the gateway responds with an RST on the socket for inactive IP addresses and the RST may come in before the connection completes on the active IP address. This means a "success" socket connect task (success just meaning ConnectAsync finished) will disconnect all other attempted sockets.

The change here is to define success as the ConnectAsync task finished and the socket.Connected property is true.

If this addresses the DNN issue, this simple fix would be a good candidate for backport. There are other code changes that could be made to eliminate tasks on this sync path. I have some drafted up, but the changes would require much more testing.

@JRahnama
Copy link
Contributor

I can ask the client to test with this set of changes. Hope they do not need it to be signed.

Copy link

codecov bot commented Sep 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.05%. Comparing base (cb11317) to head (0059548).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2877      +/-   ##
==========================================
- Coverage   72.06%   72.05%   -0.01%     
==========================================
  Files         299      299              
  Lines       61457    61457              
==========================================
- Hits        44290    44285       -5     
- Misses      17167    17172       +5     
Flag Coverage Δ
addons 92.90% <ø> (ø)
netcore 76.00% <100.00%> (+<0.01%) ⬆️
netfx 70.24% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cheenamalhotra
Copy link
Member

@JRahnama can you confirm if this has been tested by client? Thanks.

@JRahnama
Copy link
Contributor

@JRahnama can you confirm if this has been tested by client? Thanks.

I have provided the package and waiting for client's response.

@David-Engel
Copy link
Contributor Author

I don't think this fix will actually work. Testing #2915 with a mock server that sends a RST after connect, MDS moves on before the RST even registers on the socket. The RST might never even register on the socket without using a socket.Poll (or other read/write). Even adding socket.Poll in the parallel logic only gets me ~30ms grace between connect and RST before MDS will pass the connected socket on up the stack. I'm very hesitant to add a wait here, watching for the endpoint to send a delayed RST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants