Skip to content

PANIC: FS::FileWrap::dtor: file not closed in endpoint logs which caused endpoint to restart #6624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nehasharma5 opened this issue Jun 30, 2021 · 10 comments
Assignees
Labels

Comments

@nehasharma5
Copy link

nehasharma5 commented Jun 30, 2021

Environment info

[[email protected] ~]# noobaa version
INFO[0000] CLI version: 5.9.0
INFO[0000] noobaa-image: noobaa/noobaa-core:master-20210627
INFO[0000] operator-image: noobaa/noobaa-operator:5.9.0
[[email protected] ~]#

[[email protected] ~]# oc version
Client Version: 4.7.13
Server Version: 4.7.13
Kubernetes Version: v1.20.0+df9c838
[[email protected] ~]#
[[email protected] ~]# oc get nodes
NAME STATUS ROLES AGE VERSION
master0.ns.cp.fyre.ibm.com Ready master,worker 4d1h v1.20.0+df9c838
master1.ns.cp.fyre.ibm.com Ready master,worker 4d1h v1.20.0+df9c838
master2.ns.cp.fyre.ibm.com Ready master,worker 4d1h v1.20.0+df9c838
[[email protected] ~]#

Actual behavior

  1. Endpoint restarted and gave PANIC error

Expected behavior

  1. Endpoint should not be restarted

Steps to reproduce

  1. Start an IO (s3 cp testfile2 s3://buck-2) and stop using Ctrl+C in between
  2. Endpoint PANIC and restart and on concurrent IO gave upload failed: ./testfile2 to s3://buck-2/testfile2 An error occurred (503) when calling the UploadPart operation (reached max retries: 2): Service Unavailable
  3. [[email protected] logs]# oc get pod
    NAME READY STATUS RESTARTS AGE
    noobaa-core-0 1/1 Running 0 67m
    noobaa-db-pg-0 1/1 Running 0 47m
    noobaa-endpoint-84858cb98d-sxfdr 1/1 Running 2 67m
    noobaa-operator-57d449689c-hxdsm 1/1 Running 0 45h
    [[email protected] logs]#

More information - Screenshots / Logs / Other output

[root@fyreauto-x-app1 ~]# du -sh testfile2
98G testfile2
ep.log
noobaa_diagnostics_1625044851.tar.gz

@akmithal
Copy link

@nehasharma5 can you please paste Panic output from previous endpoint pod. You can use command "oc logs -p" to check it.

I have raised 6266 for endpoint panic. Please see if it looks similar.

@nehasharma5
Copy link
Author

Hi @akmithal ep.log already attached in bug , they are previous endpoint logs. The id you mentioned 6266 I did not find any bug associated to that id

@nehasharma5
Copy link
Author

Logs shows [32mJun-30 9:11:28.185�[35m [Endpoint/8] �[36m [L1]�[39m core.server.system_services.system_store:: SystemStore: rebuild took 0.8ms
�[32mJun-30 9:11:28.185�[35m [Endpoint/8] �[36m [L3]�[39m core.rpc.rpc:: RPC._on_request: COMPLETED srv server_inter_process_api.load_system_store reqid 13@ws://[::ffff:10.254.2.42]:45760/(xi8u32ov) connid wss://noobaa-mgmt.noobaa.svc:443(120pr9a)
�[32mJun-30 9:11:33.429�[35m [Endpoint/8] �[36m [L3]�[39m core.rpc.rpc:: RPC._on_request: ENTER srv server_inter_process_api.update_master_change reqid 14@ws://[::ffff:10.254.2.42]:45760/(xi8u32ov) connid wss://noobaa-mgmt.noobaa.svc:443(120pr9a)
�[32mJun-30 9:11:33.430�[35m [Endpoint/8] �[36m [L1]�[39m core.server.common_services.auth_server:: load_auth: { system: �[33mfalse�[39m } { iat: �[33m1625041606�[39m }
�[32mJun-30 9:11:33.430�[35m [Endpoint/8] �[36m [L3]�[39m core.server.common_services.auth_server:: load auth system: �[90mundefined�[39m
�[32mJun-30 9:11:33.430�[35m [Endpoint/8] �[36m [L3]�[39m core.rpc.rpc:: RPC._on_request: COMPLETED srv server_inter_process_api.update_master_change reqid 14@ws://[::ffff:10.254.2.42]:45760/(xi8u32ov) connid wss://noobaa-mgmt.noobaa.svc:443(120pr9a)
�[32mJun-30 9:11:38.552�[35m [Endpoint/8] �[36m [L4]�[39m core.rpc.rpc:: RPC PING fcall://fcall(omztin1)
�[32mJun-30 9:11:38.553�[35m [Endpoint/8] �[36m [L4]�[39m core.rpc.rpc:: RPC PONG fcall://fcall(omztin1)
�[32mJun-30 9:11:38.553�[35m [Endpoint/8] �[36m [L4]�[39m core.rpc.rpc:: RPC PINGPONG fcall://fcall(omztin1)
PANIC: FS::FileWrap::dtor: file not closed _path=/nsfs/fs1/buck-2/.noobaa-nsfs_60dab1e5f90f240027de0ef2/multipart-uploads/4b3a4c39-707d-47d2-a410-ffdafc4d2da4/part-275 _fd=27 Transport endpoint is not connected (107) ~FileWrap() at ../src/native/fs/fs_napi.cpp:616
/noobaa_init_files/noobaa_init.sh: line 74: 8 Aborted $*

@guymguym
Copy link
Member

guymguym commented Jul 2, 2021

@nehasharma5 I think it was a typo in the number and it's this one - #6566
Although both panics are similar (FileWrap::dtor: file not closed) there could be a different cause.
In #6566 the panic is during object read on a file:

PANIC: FS::FileWrap::dtor: file not closed 
_path=/nsfs/nsfs-nsr-1/bucket-11/medium_file_4 
_fd=38  Success (0) ~FileWrap() at ../src/native/fs/fs_napi.cpp:603

But in this case the file causing the panic is a multipart file:

PANIC: FS::FileWrap::dtor: file not closed 
_path=/nsfs/fs1/buck-2/.noobaa-nsfs_60dab1e5f90f240027de0ef2/multipart-uploads/4b3a4c39-707d-47d2-a410-ffdafc4d2da4/part-275 
_fd=27 Transport endpoint is not connected (107) ~FileWrap() at ../src/native/fs/fs_napi.cpp:616

I see in the endpoint log (that has high debug level!) that we were aborting the multipart a bit before that panic.
if I track this part in the logs I see this flow:

Jun-30 9:11:17.240 [Endpoint/8]    [L1] core.endpoint.s3.s3_rest:: 
S3 REQUEST PUT /buck-2/testfile2?uploadId=4b3a4c39-707d-47d2-a410-ffdafc4d2da4&partNumber=275  
    op put_object_uploadId
    request_id kqj9el87-2i8phs-om6 
    { 'accept-encoding': 'identity', 'user-agent': 'aws-cli/2.1.33 Python/3.8.8 Linux/4.18.0-305.3.1.el8_4.x86_64 exe/x86_64.rhel.8 prompt/off command/s3.cp', 'content-md5': 'LHq4Wokyg+mMkx6VEa3Rgg==', expect: '100-continue', 'x-amz-date': '20210630T091117Z', 'x-amz-content-sha256': 'UNSIGNED-PAYLOAD', authorization: 'AWS4-HMAC-SHA256 Credential=hTCjHTjraxvvwSTCRgcr/20210630/us-east-1/s3/aws4_request, SignedHeaders=content-md5;host;x-amz-content-sha256;x-amz-date, Signature=299de6a1038dcca758eb56fb6bccb78f9bfbf87bc545475716616ee6578f829e', 'content-length': '16777216', host: 's3-noobaa.apps.ns.cp.fyre.ibm.com', 'x-forwarded-host': 's3-noobaa.apps.ns.cp.fyre.ibm.com', 'x-forwarded-port': '443', 'x-forwarded-proto': 'https', forwarded: 'for=10.17.33.26;host=s3-noobaa.apps.ns.cp.fyre.ibm.com;proto=https', 'x-forwarded-for': '10.17.33.26' }

Jun-30 9:11:17.240 [Endpoint/8]    [L0] core.endpoint.s3.ops.s3_put_object_uploadId:: 
PUT OBJECT PART buck-2 testfile2 275

Which is followed by successful Open:

2021-06-30 09:11:17.240660 [PID-8/TID-8] [L1] FS::FSWorker::Begin: FileOpen _path=/nsfs/fs1/buck-2/.noobaa-nsfs_60dab1e5f90f240027de0ef2/multipart-uploads/4b3a4c39-707d-47d2-a410-ffdafc4d2da4/part-275 _flags=577 _mode=432
2021-06-30 09:11:17.240713 [PID-8/TID-16] [L1] FS::FSWorker::Execute: FileOpen _path=/nsfs/fs1/buck-2/.noobaa-nsfs_60dab1e5f90f240027de0ef2/multipart-uploads/4b3a4c39-707d-47d2-a410-ffdafc4d2da4/part-275 _flags=577 _mode=432 _uid=2001 _gid=2002 _backend=GPFS
2021-06-30 09:11:17.240979 [PID-8/TID-8] [L1] FS::FileOpen::OnOK: _path=/nsfs/fs1/buck-2/.noobaa-nsfs_60dab1e5f90f240027de0ef2/multipart-uploads/4b3a4c39-707d-47d2-a410-ffdafc4d2da4/part-275

That open part file is being written to in nsfs _upload_stream

But then when you aborted the upload, we get an abort request from the client:

Jun-30 9:11:19.022 [Endpoint/8]    [L1] core.endpoint.s3.s3_rest:: 
  S3 REQUEST DELETE /buck-2/testfile2?uploadId=4b3a4c39-707d-47d2-a410-ffdafc4d2da4  
  op delete_object_uploadId 
  request_id kqj9emlp-fg676m-hnz 
  { 'accept-encoding': 'identity', 'user-agent': 'aws-cli/2.1.33 Python/3.8.8 Linux/4.18.0-305.3.1.el8_4.x86_64 exe/x86_64.rhel.8 prompt/off command/s3.cp', 'x-amz-date': '20210630T091119Z', 'x-amz-content-sha256': 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', authorization: 'AWS4-HMAC-SHA256 Credential=hTCjHTjraxvvwSTCRgcr/20210630/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=db6ae676fadddeb079b0e489e0370f2f599f540fdadd5c126c3b8a15d199417e', 'content-length': '0', host: 's3-noobaa.apps.ns.cp.fyre.ibm.com', 'x-forwarded-host': 's3-noobaa.apps.ns.cp.fyre.ibm.com', 'x-forwarded-port': '443', 'x-forwarded-proto': 'https', forwarded: 'for=10.17.33.26;host=s3-noobaa.apps.ns.cp.fyre.ibm.com;proto=https', 'x-forwarded-for': '10.17.33.26' }

And delete the entire temp directory of the multipart upload, and part-275 with it:

2021-06-30 09:11:19.027977 [PID-8/TID-8] [L1] FS::FSWorker::Begin: Unlink _path=/nsfs/fs1/buck-2/.noobaa-nsfs_60dab1e5f90f240027de0ef2/multipart-uploads/4b3a4c39-707d-47d2-a410-ffdafc4d2da4/part-275
...
2021-06-30 09:11:19.111761 [PID-8/TID-16] [L1] FS::FSWorker::Execute: Unlink _path=/nsfs/fs1/buck-2/.noobaa-nsfs_60dab1e5f90f240027de0ef2/multipart-uploads/4b3a4c39-707d-47d2-a410-ffdafc4d2da4/part-275 _uid=2001 _gid=2002
...
2021-06-30 09:11:19.113580 [PID-8/TID-8] [L1] FS::FSWorker::OnOK: undefined Unlink _path=/nsfs/fs1/buck-2/.noobaa-nsfs_60dab1e5f90f240027de0ef2/multipart-uploads/4b3a4c39-707d-47d2-a410-ffdafc4d2da4/part-275

But I don't know why the deletion caused the FileWrap to fail on dtor, because we expected the finally clause in _upload_stream to close that filewrap before the memory is destructed.

@jeniawhite I think this would be easy to try and reproduce this locally just by Ctrl-C on the aws cli during multipart upload, and then we can debug more easily, WDYT?

@akmithal
Copy link

akmithal commented Jul 6, 2021

Hello @guymguym , Sorry for the typo. My defect is #6626 so please refer to this one as well.

@guymguym
Copy link
Member

Referencing to what I think is the root cause - nodejs/node#36674

It is a real hazard that async functions can get "cancelled" without going to their finally block just by having a pending promise or async iterator/stream that gets garbage collected. In the upload part case this is most likely caused by the pipeline used in s3_utils.decode_chunked_upload():

function decode_chunked_upload(source_stream) {
const decoder = new ChunkedContentDecoder();
// pipeline will back-propagate errors from the decoder to stop streaming from the source,
// so the error callback here is only needed for logging.
stream.pipeline(source_stream, decoder, err => {
if (err) console.log('decode_chunked_upload: pipeline error', err.stack || err);
});
return decoder;
}

Here is a direct reproduction of the dtor panic:

'use strict';

const stream = require('stream');
const nb_native = require('./src/util/nb_native');

async function main() {
    nb_native();
    await new Promise(r => setTimeout(r, 100));

    const source = new stream.Readable({
        read() {
            this.push(Buffer.from(`hello world ${process.hrtime.bigint()}}`));
        }
    });
    source.on('error', err => console.log('*** SOURCE CLOSED', err));

    await new Promise(r => setTimeout(r, 100));
    source.destroy(new Error('DIE'));
    await new Promise(r => setTimeout(r, 100));

    // See https://github.com/nodejs/node/issues/36674
    console.log('*** SOURCE', source.destroyed ? 'destroyed !!!' : '');
    const strm = stream.pipeline(
        source,
        new stream.PassThrough(),
        err => console.log('*** PIPELINE:', err),
    );

    await write_stream_to_file(strm, '/dev/null');
}

/**
 * @param {stream.Readable} strm
 * @param {string} fname
 * @returns {Promise<void>}
 */
async function write_stream_to_file(strm, fname) {
    let file;
    try {
        console.log('*** OPEN', fname);
        file = await nb_native().fs.open({}, fname, 'w');
        console.log('*** STREAM', strm.destroyed ? 'destroyed !!!' : '');
        for await (const buf of strm) {
            console.log(buf.toString());
            // await file.write({}, buf);
        }
    } catch (err) {
        console.log('*** CATCH', err);
    } finally {
        // we expect this finally block to be executed in any case
        // but the async function can get "cancelled" if it awaits
        // a promise that is pending and gets garbage collected.
        console.log('*** FINALLY', file);
        if (file) await file.close({});
    }
}

main();

This is the output:

$ node a.js
OpenSSL 1.1.1k  25 Mar 2021 setting up
init_rand_seed: starting ...
...
init_rand_seed: done
*** SOURCE CLOSED Error: DIE
    at main (/Users/gu/code/noobaa-core/a.js:19:20)
*** SOURCE destroyed !!!
*** OPEN /dev/null
*** STREAM 
PANIC: FS::FileWrap::dtor: file not closed _path=/dev/null _fd=23  Socket operation on non-socket (38) ~FileWrap() at ../src/native/fs/fs_napi.cpp:725
Abort trap: 6

So now that we know why our finally blocks are not called, I think we should change the FileWrap dtor to just warn and close the fd. @jeniawhite I would write like this:

    ~FileWrap()
    {
        if (_fd) {
            LOG("FS::FileWrap::dtor: file not closed " << DVAL(_path) << DVAL(_fd));
            int r = ::close(_fd);
            if (r) LOG("FS::DirWrap::dtor: file close failed " << DVAL(_path) << DVAL(_fd) << DVAL(r));
            _fd = 0;
        }
    }

also similar for DirWrap:

    ~DirWrap()
    {
        if (_dir) {
            LOG("FS::DirWrap::dtor: dir not closed " << DVAL(_path) << DVAL(_dir));
            int r = closedir(_dir);
            if (r) LOG("FS::DirWrap::dtor: dir close failed " << DVAL(_path) << DVAL(_dir) << DVAL(r));
            _dir = 0;
        }
    }

@jeniawhite
Copy link
Contributor

This is also getting reproduced by destroying the source_stream (req stream) after opening the FD.
Let's take an example of a multipart_upload (non chunked upload) that uploads parts and destroys the stream mid-way.
I've managed to reproduce this with our standalone NSFS code as well.
Running the NSFS standalone with constraint resources and garbage collection visibility

node --trace_gc --expose-gc --max-old-space-size=50 src/core/nsfs.js /tmp

Running an upload of a 30MB file (that will cause multipart upload) and destroying it mid-air:

/* Copyright (C) 2020 NooBaa */
var AWS = require('aws-sdk');
// Set the region 
AWS.config.update({region: 'REGION'});
// Create S3 service object
var s3 = new AWS.S3({apiVersion: '2006-03-01', accessKeyId: 'ABC', secretAccessKey: 'ABC', endpoint: 'http://localhost:6001', s3ForcePathStyle: true });
// call S3 to retrieve upload file to specified bucket
var uploadParams = {Bucket: 'jenia', Key: 'uploadterm' };
var file = '/Users/jenia/Downloads/Docker.dmg';
// Configure the file stream and obtain the upload parameters
var fs = require('fs');
var fileStream = fs.createReadStream(file);
fileStream.on('error', function(err) {
  console.log('File Error', err);
});
let counter = 0;
uploadParams.Body = fileStream;
// call S3 to retrieve upload file to specified bucket
s3.upload(uploadParams, function(err, data) {
    if (err) {
        console.log("Error", err);
    } if (data) {
        console.log("Upload Success", data.Location);
    }
});
fileStream.on('data', function(data) {
    counter++;
    if (counter > 100) {
        fileStream.destroy(new Error('DESTROY'));
    }
});

Running this upload several times will cause the same DTOR to happen (without the chunked_upload).

@guymguym
Copy link
Member

More context on this GC-vs-finally issue here -
tc39/proposal-async-await#89

@nimrod-becker
Copy link
Contributor

With #6670 being opened, moving this one to verification

@nehasharma5
Copy link
Author

nehasharma5 commented Jul 20, 2021

Thanks @jeniawhite for resolving.
Verified on
[[email protected] ~]# noobaa version
INFO[0000] CLI version: 5.9.0
INFO[0000] noobaa-image: noobaa/noobaa-core:master-20210719
INFO[0000] operator-image: noobaa/noobaa-operator:5.9.0
[[email protected] ~]#

The IO is aborted with proper error
dvanced-usage.html#ssl-warnings
^Curllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 's3-noobaa.apps.ns.cp.fyre.ibm.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
cancelled: ctrl-c received
[root@fyreauto-x-app1 ~]#

Endpoint logs shows correct error:
Jul-20 6:48:32.646 [Endpoint/8] [L0] core.sdk.namespace_fs:: NamespaceFS: abort_object_upload /nsfs/fs1/buck-1/.noobaa-nsfs_60f66f8ea5f4e300224afadb/multipart-uploads/9c5a5e23-2a4e-4243-8558-2bf38aff4dcf
[[email protected] ~]#

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants