-
-
Notifications
You must be signed in to change notification settings - Fork 732
Slow running process dying at the last hurdle #1836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm seeing similar behavior on
and
and
|
Do you happen to have logs from the failed workers?
…On Thu, Mar 22, 2018 at 1:27 PM, Brett Naul ***@***.***> wrote:
I'm seeing similar behavior on 1.21.3 for a job that was succeeding
before (I think on 1.21.1 but I'm not sure). Memory on the workers was
around 1 out of 2 GB, scheduler was 12 out of 24 GB. Lots of the above Unexpected
worker throughout (but I added some workers as I went along so that
seemed fine), then suddenly lots of
distributed.scheduler - INFO - Register tcp://10.63.236.4:40392
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.63.236.4:40392
and
distributed.scheduler - INFO - Worker 'tcp://10.63.233.4:41486' failed from closed comm: in <closed TCP>: Stream is closed
distributed.scheduler - INFO - Remove worker tcp://10.63.233.4:41486
and
"Traceback (most recent call last):
File "/opt/conda/envs/dask/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
return self.callback()
File "/opt/conda/envs/dask/lib/python3.6/site-packages/distributed/stealing.py", line 309, in balance
stealable = self.stealable[sat.address][level]
KeyError: 'tcp://10.63.74.4:33090'
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1836 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszOm9RXlvpIy9s2veZxmQ9npCvibZks5tg971gaJpZM4SpmBG>
.
|
We should downgrade that warning to INFO level. I am a bit surprised that it occurs though. I would expect the newer stealing handshakes to make things pretty transactional. |
Previously we would fail in the following situation: 1. A and B both data 2. C tries to get data from A 3. A fails during this transfer 4. The scheduler goes to clean things up and gets confused because someone (B) still has the data The fix for this wasn't hard, but the test is a bit odd. It seems that the subsequent communication from C to B fails, which I don't yet understand. Related to dask#1836
Partially resolved by #1853 |
Previously we would fail in the following situation: 1. A and B both data 2. C tries to get data from A 3. A fails during this transfer 4. The scheduler goes to clean things up and gets confused because someone (B) still has the data The fix for this wasn't hard, but the test is a bit odd. It seems that the subsequent communication from C to B fails, which I don't yet understand. Related to #1836
The KeyError at this line is likely resolved now. Thank you for reporting. Leaving this issue open as there appear to be a couple other things here.
|
Twice now I've seen a slow running job, that's taking up a good amount of memory, but still is okay, slow down exponentially towards the end and then die with just a few tasks left with the following error:
In both cases I was able to successfully get the job to run without distributed using the vanilla dask scheduler in less time with far less memory. I believe the memory requirements were being affected by the pandas issue pandas-dev/pandas#19941
System info:
spawn
notforkserver
The text was updated successfully, but these errors were encountered: