-
Notifications
You must be signed in to change notification settings - Fork 3k
ESP8266: Avoid duplicate data sends #12157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESP8266: Avoid duplicate data sends #12157
Conversation
@michalpasztamobica, thank you for your changes. |
Data not written to ESP8266 yet and ESP8266 replying busy, this is the most appropriate condition to return if (!_parser.recv(">")) {
_parser.remove_oob("OK");
if (_busy) {
if (_ok_received) {
tr_debug("send(): _ok_received.");
goto RETRY;
} else if (_parser.recv("OK")) {
tr_debug("send(): parser found OK");
goto RETRY;
}
}
tr_debug("send(): Didn't get \">\"");
ret = NSAPI_ERROR_WOULD_BLOCK;
goto END;
} Data written to ESP8266 but failed, it is OK to return if (_parser.write((char *)data, (int)amount) < 0) {
tr_debug("Failed to write data");
ret = NSAPI_ERROR_DEVICE_ERROR;
}
// The "Recv X bytes" is not documented.
int bytes_confirmed;
if (!_parser.recv("Recv %d bytes", &bytes_confirmed)) {
tr_debug("Bytes not confirmed.");
ret = NSAPI_ERROR_DEVICE_ERROR;
} else if (bytes_confirmed != amount) {
tr_debug("Error: confirmed %d bytes, but expected %d.", bytes_confirmed, amount);
ret = NSAPI_ERROR_DEVICE_ERROR;
} Converting error code to // error hierarchy, from low to high
if (_busy) {
ret = NSAPI_ERROR_WOULD_BLOCK;
tr_debug("send(): Modem busy. ");
}
if (ret == NSAPI_ERROR_DEVICE_ERROR) {
ret = NSAPI_ERROR_WOULD_BLOCK;
tr_debug("send(): Send failed.");
} |
e48af34
to
02050c5
Compare
Thanks for your review, @ccli8 . Indeed, the code makes more sense with the three Regarding the This makes some sense to me, but this would be a breaking change, so I would much appreciate more feedback... |
No. Either the send call accepts the data for sending, or it doesn't. The precise error code doesn't affect that. "In progress" as a return code does exist, but it's not applicable for data transfer calls. That is used for "connect" where a "0" return means "I am connected", so there needs to be a separate return for "I have now started connecting, but I'm not connected yet" in the non-blocking case. ("Would block" isn't used because it HAS accepted the connect request, and it's now ongoing. "Would block" would mean it HADN'T accepted the connect request, and state hadn't changed). For the data transfer calls, all UDP and TCP's non-error responses means "I've accepted the data for sending". So every non-negative return already has the meaning "in progress". There's no completion to indicate. Read all the discussion about send return codes here: #12083 This is quite similar. From that discussion, if you believe the modem hasn't accepted the data, then "would block" makes sense for TCP. It's better to use "no memory" for UDP, and/or do some internal retries. (You can't return anything other than "would block" for TCP, as any other error indicates connection breakage. It's a reliable transport, so can't return errors except for the pseudo-error "would block", unless the connection breaks.) If your belief that the modem didn't here you turns out to be wrong for UDP, then you can get a duplicate packet, but that should(TM) be harmless. UDP is allowed to duplicate packets - UDP apps should(TM) be tolerant. If that belief turns out to be wrong for TCP, then the connection is corrupted. If you ever find out (can you?) that you accidentally got the modem to send something twice, you should treat the connection as reset. Do a "force reset" close (if you can) to the modem - if you can't just close it - and give the application a "connection reset" error. Now, I don't really follow what's going on here, but if you're treating the "recvd X bytes" message as the "modem has got the data" completion, then, yes, you don't need to necessarily wait for "SEND OK". There's no requirement for UDP that you hang around to see if the data was transmitted. A packet is allowed to be dropped later down the chain after an initial "OK" response. For TCP, a problem would at least need to be signalled later. It would be acceptable to return "OK" on a send that later failed as long as that failure triggered an error from the next call. The standard "OK" from TCP only ever means "I put it in the send buffer" in a normal implementation, it's never meant it actually got sent. You only really find out if buffered data was sent if you complete a clean close handshake. The only issue is that if the "SEND OK" is what completes the AT command, you do presumably need to not attempt to send anything else before you get it? |
I just submitted another commit which adjusts the situation to what @kjbracey-arm suggested: return WOULD_BLOCK for TCP or NO_MEMORY for UDP and throw other errors when connection should be considered broken. I also found that we were silently truncating buffers larger than 2048 B, claiming we have sent the whole packet, so I fixed this, too. (Our tests did not test this scenario, not even sure it's a real one). At the same time ESP8266 is now able to handle a partial send ( @ccli8 , ESP8266's serial mechanism and documentation do not allow for a 100% reliable transportation. I think the current code is the most reasonable situation. I extended the timeout that waits for |
@michalpasztamobica This doesn't avoid duplicate tcp data send. My concern is that According to your above comment, Code snippet below with // error hierarchy, from low to high
if (_busy) {
ret = NSAPI_ERROR_WOULD_BLOCK;
tr_debug("send(): Modem busy. ");
}
if (ret == NSAPI_ERROR_DEVICE_ERROR) {
ret = NSAPI_ERROR_WOULD_BLOCK;
tr_debug("send(): Send failed.");
} |
3135124
to
61409da
Compare
Update code and description to avoid duplicate packets (if the |
// The "Recv X bytes" is not documented. | ||
if (!_parser.recv("Recv %d bytes", &bytes_confirmed)) { | ||
tr_debug("send(): Bytes not confirmed."); | ||
ret = NSAPI_ERROR_DEVICE_ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should be removed because TCP and UDP are the only protocols supported.
|
||
// ESP8266 ACKed data over serial, but did not ACK over TCP or report any error. | ||
_prev_send_ok_pending = true; | ||
_parser.oob("SEND OK", callback(this, &ESP8266::_oob_send_ok_received)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see OOB handler added here but I don't see one removed anywhere in the code.
We are now checking if ESP8266 has confirmed receiving data over serial port with an undocumented (but existing) "Recv x bytes" message. Next we are explicitly waiting for an official "SEND OK".
61409da
to
953196c
Compare
@VeijoPesonen , @ccli8 thanks for review, I hope I addressed both your remarks. Regarding the oob adding/removal. Regarding the status instead of flags - I agree with you, @ccli8 and I added a variable to handle this (although a bit different than you suggested). There is one more thing I noticed. With waiting for serial ACK and then for |
@michalpasztamobica With Now that bool ESP8266::close(int id)
{
//May take a second try if device is busy
for (unsigned i = 0; i < 2; i++) {
_smutex.lock();
if (_parser.send("AT+CIPCLOSE=%d", id)) { |
@AnttiKauppila , @VeijoPesonen , @SeppoTakalo we need your judgement with the last open issue in this PR. No need to ready through the whole discussion, don't worry ;-). The last issue we have left is that currently
I think this is a bug and that we should return @ccli8 thinks that Can you please help us decide which is the right way here? |
Oh my.. ESP never ceases to amaze me... When Socket::close() fails... does it mean that you could continue to use the socket? If you allow that, you just open a new can of worms.. Socket has only one error code, that allows you to continue using it, the I would even claim that DEVICE_ERROR is a type of error where right procedure would be to shut down the whole network interface and restart it. It should not be returned from device, if the conditions allow it to proceed. But, unfortunately, it has been used in cases where there is no other generic error codes available. But for the close.. I would still allow user to call Another reason not to expect anyone retrying |
Thanks a lot, @SeppoTakalo |
@michalpasztamobica I've removed |
1. Fix 'spurious close' by adding close() in open(). 'spurious close' gets frequent and cannot ignore when send() changes to asynchronous. User can retry open() until 'spurious close' gets true. 2. Allow only one actively sending socket because: (1) ESP8266 AT packets 'SEND OK'/'SEND FAIL' are not associated with socket ID. No way to tell them. (2) In original implementation, ESP8266::send() is synchronous, which implies only one actively sending socket. 3. Register 'SEND OK'/'SEND FAIL' oobs, like others in ESP8266::ESP8266 constructor. Don't get involved in oob management with send status because ESP8266 modem possibly doesn't reply these packets on error case. 4. Now that ESP8266::send() changes to asynchronous, drop the code with _parser.recv("SEND OK")/_parser.recv("SEND FAIL"). _parser.recv("SEND OK")/_parser.recv("SEND FAIL") and 'SEND OK'/'SEND FAIL' oobs both consume 'SEND OK'/'SEND FAIL' packets and complicate flow control.
Ci started |
Test run: SUCCESSSummary: 11 of 11 test jobs passed |
Aside from the official CI I run netsocket-* and network-* tests locally and they all passed (some DNS tests failed if I had logs enabled, probably because the oobs logs got quite heavy). |
@AnttiKauppila , @SeppoTakalo , @VeijoPesonen , could at least one of you look through the changes ang give your approval? |
if (id == _sock_sending_id) { | ||
_sock_sending_id = -1; | ||
} | ||
_sock_i[id].send_fail = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This same snippet has been added at least to 6 different places so maybe it should be turned into a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a function and replaced the reoccurent code.
if (_sock_sending_id >= 0 && _sock_sending_id < SOCKET_COUNT) { | ||
if (!_sock_i[id].send_fail) { | ||
tr_debug("send(): Previous packet (socket %d) was not yet ACK-ed with SEND OK.", _sock_sending_id); | ||
return NSAPI_ERROR_WOULD_BLOCK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if modem doesn't reply with anything? Might SEND OK
or SEND FAIL
get lost? I'm just thinking out loud here so this isn't a request to change anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed in the main thread. Please see this comment and the previous one.
Long story short: ERROR
is another possibility aside from SEND OK
and SEND FAIL
, but as experiments showed it is a recoverable error, so we basically ignore it.
If neither SEND OK
nor SEND FAIL
are coming we just keep returning WOULD_BLOCK
to any new send()
attempts. It's up to the application to decide how long this can be tolerated. I assume some socket timeout will take care of this in a typical mbed application?
goto END; | ||
} else if (bytes_confirmed != amount) { | ||
tr_debug("send(): Error: confirmed %d bytes, but expected %d.", bytes_confirmed, amount); | ||
ret = NSAPI_ERROR_DEVICE_ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be ok in case of TCP that ESP8266 accepts less data than we are trying to send?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this as well, but wasn't brave enough to implement it ;-). I have never seen this situation to happen, so I assume it's either all or nothing, but you are right, if ESP ever decides to accept partial write, we should be able to handle it.
Test run: SUCCESSSummary: 11 of 11 test jobs passed |
Summary of changes
Fixes #11544.
EDIT 30.12.2019: Changed the description quite heavily, therefore did not leave the old part
stroke through, see history if you want to see the original description.EDIT 9.01.2020: Now I updated the description below.
The original issue report points out that in case the ESP8266 fails to respond within serial timeout period (2 seconds), there is no way of telling if the packet was sent or not. Therefore if the application decides to retry sending it might be that packet gets sent out twice.
Following the remarks from @ccli8 and from @kjbracey-arm I propose the following:
Recvd x bytes
message arrives. This basically acknowledges successful UART transfer. If it doesn't come or if any other serial-related error pops up, we can assume the ESP is an unknown state and better be reset - we returnDEVICE_ERROR
if only theSEND OK
is missing, we returnWOULD_BLOCK
and can handle the acknowledgement asynchronously via an OOB.2) If the serial acknowledge arrives, we can wait a while forSEND OK
message (although, according to @kjbracey-arm 's explanation, we don't really have to). There must be some timeout to this wait. I added a counter that allows 3 blocking checks forSEND OK
in case ofbusy s...
message.@AnttiKauppila , I moved your recently added retry mechanism waiting for
SEND OK
. Now that we check for serial acknowledgement it makes sense to do this wait a bit later. I also added a limit to it (3 retries), please bear this in mind during review.SEND FAIL
in mind, introduces per-socket failure flag and removed the large chunk of code which was holdingsend()
back waiting forbusy
orSEND OK/FAIL
- @AnttiKauppila , please pay attention to this part when reviewing as this removes your recent changes.I checked that all greentea tests which are passing on master are also passing with this PR (CI is not running ESP8266) and the
mbed-os-example-pelion
connects fine. I noticed however that netsocket-udp test suite takes much longer to pass in RAAS (more than 1100 seconds), so the timeout was increased accordingly.Impact of changes
The
send()
calls might take longer, but success rate will increase.Migration actions required
None
Documentation
None
Pull request type
Test results
Reviewers
@ccli8
@AnttiKauppila
@VeijoPesonen
@SeppoTakalo