-
Notifications
You must be signed in to change notification settings - Fork 83
PANIC: FS::FileWrap::dtor: file not closed in endpoint logs which caused endpoint to restart #6624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@nehasharma5 can you please paste Panic output from previous endpoint pod. You can use command "oc logs -p" to check it. I have raised 6266 for endpoint panic. Please see if it looks similar. |
Hi @akmithal ep.log already attached in bug , they are previous endpoint logs. The id you mentioned 6266 I did not find any bug associated to that id |
Logs shows [32mJun-30 9:11:28.185�[35m [Endpoint/8] �[36m [L1]�[39m core.server.system_services.system_store:: SystemStore: rebuild took 0.8ms |
@nehasharma5 I think it was a typo in the number and it's this one - #6566
But in this case the file causing the panic is a multipart file:
I see in the endpoint log (that has high debug level!) that we were aborting the multipart a bit before that panic.
Which is followed by successful Open:
That open part file is being written to in nsfs _upload_stream But then when you aborted the upload, we get an abort request from the client:
And delete the entire temp directory of the multipart upload, and part-275 with it:
But I don't know why the deletion caused the FileWrap to fail on dtor, because we expected the finally clause in _upload_stream to close that filewrap before the memory is destructed. @jeniawhite I think this would be easy to try and reproduce this locally just by Ctrl-C on the aws cli during multipart upload, and then we can debug more easily, WDYT? |
Referencing to what I think is the root cause - nodejs/node#36674 It is a real hazard that async functions can get "cancelled" without going to their noobaa-core/src/endpoint/s3/s3_utils.js Lines 106 to 114 in b700dae
Here is a direct reproduction of the dtor panic: 'use strict';
const stream = require('stream');
const nb_native = require('./src/util/nb_native');
async function main() {
nb_native();
await new Promise(r => setTimeout(r, 100));
const source = new stream.Readable({
read() {
this.push(Buffer.from(`hello world ${process.hrtime.bigint()}}`));
}
});
source.on('error', err => console.log('*** SOURCE CLOSED', err));
await new Promise(r => setTimeout(r, 100));
source.destroy(new Error('DIE'));
await new Promise(r => setTimeout(r, 100));
// See https://github.com/nodejs/node/issues/36674
console.log('*** SOURCE', source.destroyed ? 'destroyed !!!' : '');
const strm = stream.pipeline(
source,
new stream.PassThrough(),
err => console.log('*** PIPELINE:', err),
);
await write_stream_to_file(strm, '/dev/null');
}
/**
* @param {stream.Readable} strm
* @param {string} fname
* @returns {Promise<void>}
*/
async function write_stream_to_file(strm, fname) {
let file;
try {
console.log('*** OPEN', fname);
file = await nb_native().fs.open({}, fname, 'w');
console.log('*** STREAM', strm.destroyed ? 'destroyed !!!' : '');
for await (const buf of strm) {
console.log(buf.toString());
// await file.write({}, buf);
}
} catch (err) {
console.log('*** CATCH', err);
} finally {
// we expect this finally block to be executed in any case
// but the async function can get "cancelled" if it awaits
// a promise that is pending and gets garbage collected.
console.log('*** FINALLY', file);
if (file) await file.close({});
}
}
main(); This is the output: $ node a.js
OpenSSL 1.1.1k 25 Mar 2021 setting up
init_rand_seed: starting ...
...
init_rand_seed: done
*** SOURCE CLOSED Error: DIE
at main (/Users/gu/code/noobaa-core/a.js:19:20)
*** SOURCE destroyed !!!
*** OPEN /dev/null
*** STREAM
PANIC: FS::FileWrap::dtor: file not closed _path=/dev/null _fd=23 Socket operation on non-socket (38) ~FileWrap() at ../src/native/fs/fs_napi.cpp:725
Abort trap: 6 So now that we know why our finally blocks are not called, I think we should change the FileWrap dtor to just warn and close the fd. @jeniawhite I would write like this: ~FileWrap()
{
if (_fd) {
LOG("FS::FileWrap::dtor: file not closed " << DVAL(_path) << DVAL(_fd));
int r = ::close(_fd);
if (r) LOG("FS::DirWrap::dtor: file close failed " << DVAL(_path) << DVAL(_fd) << DVAL(r));
_fd = 0;
}
} also similar for DirWrap: ~DirWrap()
{
if (_dir) {
LOG("FS::DirWrap::dtor: dir not closed " << DVAL(_path) << DVAL(_dir));
int r = closedir(_dir);
if (r) LOG("FS::DirWrap::dtor: dir close failed " << DVAL(_path) << DVAL(_dir) << DVAL(r));
_dir = 0;
}
} |
This is also getting reproduced by destroying the source_stream (req stream) after opening the FD.
Running an upload of a 30MB file (that will cause multipart upload) and destroying it mid-air:
Running this upload several times will cause the same DTOR to happen (without the chunked_upload). |
More context on this GC-vs-finally issue here - |
With #6670 being opened, moving this one to verification |
Thanks @jeniawhite for resolving. The IO is aborted with proper error Endpoint logs shows correct error: |
Uh oh!
There was an error while loading. Please reload this page.
Environment info
[[email protected] ~]# noobaa version
INFO[0000] CLI version: 5.9.0
INFO[0000] noobaa-image: noobaa/noobaa-core:master-20210627
INFO[0000] operator-image: noobaa/noobaa-operator:5.9.0
[[email protected] ~]#
[[email protected] ~]# oc version
Client Version: 4.7.13
Server Version: 4.7.13
Kubernetes Version: v1.20.0+df9c838
[[email protected] ~]#
[[email protected] ~]# oc get nodes
NAME STATUS ROLES AGE VERSION
master0.ns.cp.fyre.ibm.com Ready master,worker 4d1h v1.20.0+df9c838
master1.ns.cp.fyre.ibm.com Ready master,worker 4d1h v1.20.0+df9c838
master2.ns.cp.fyre.ibm.com Ready master,worker 4d1h v1.20.0+df9c838
[[email protected] ~]#
Actual behavior
Expected behavior
Steps to reproduce
NAME READY STATUS RESTARTS AGE
noobaa-core-0 1/1 Running 0 67m
noobaa-db-pg-0 1/1 Running 0 47m
noobaa-endpoint-84858cb98d-sxfdr 1/1 Running 2 67m
noobaa-operator-57d449689c-hxdsm 1/1 Running 0 45h
[[email protected] logs]#
More information - Screenshots / Logs / Other output
[root@fyreauto-x-app1 ~]# du -sh testfile2
98G testfile2
ep.log
noobaa_diagnostics_1625044851.tar.gz
The text was updated successfully, but these errors were encountered: