Skip to content

[SYCL] Fix memory leak (queue_impl) due to #5901 #6707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 13, 2022

Conversation

KseniyaTikhomirova
Copy link
Contributor

Cross dependency event_impl vs queue_impl prevents objects release. Event_impl now has only weak pointer to queue.

Signed-off-by: Tikhomirova, Kseniya [email protected]

@KseniyaTikhomirova KseniyaTikhomirova marked this pull request as ready for review September 6, 2022 11:08
@KseniyaTikhomirova KseniyaTikhomirova requested a review from a team as a code owner September 6, 2022 11:08
@al42and
Copy link
Contributor

al42and commented Sep 6, 2022

Not that anyone asked, but I can confirm that it fixes the problem I was having 👍

@KseniyaTikhomirova
Copy link
Contributor Author

Not that anyone asked, but I can confirm that it fixes the problem I was having 👍

thank you, good to know

@steffenlarsen
Copy link
Contributor

Is there any case where the queue may die before the event? In that case, are there any operations on the event that would try to lock the queue and fail?

@KseniyaTikhomirova
Copy link
Contributor Author

Is there any case where the queue may die before the event? In that case, are there any operations on the event that would try to lock the queue and fail?

I was also thinking about this. In event_impl there is very similar approach in event_impl::flushIfNeeded.
I think it is possible when on application side we save event but destroy queue, device queue (host is stored in scheduler also and is destroyed with it). In this case operations with event like wait still should be ok because PIEvent is still alive and keeps reference to PiQueue. So earlier destruction is possible for queue_impl wrapper not backend objects.

@steffenlarsen
Copy link
Contributor

I might worry that this could break if the scenario you mentioned happens and the event is then used as a dependency event in some command. Maybe the worker queue is only really used if the event is the result event of a command? If so, the command constructor should ensure it gets initialized. Is that the case?

@KseniyaTikhomirova
Copy link
Contributor Author

I might worry that this could break if the scenario you mentioned happens and the event is then used as a dependency event in some command. Maybe the worker queue is only really used if the event is the result event of a command? If so, the command constructor should ensure it gets initialized. Is that the case?

Could you please describe a bit more what you meant here "Maybe the worker queue is only really used if the event is the result event of a command?" I am not sure that understand scenario in a right way

@steffenlarsen
Copy link
Contributor

steffenlarsen commented Sep 8, 2022

Could you please describe a bit more what you meant here "Maybe the worker queue is only really used if the event is the result event of a command?" I am not sure that understand scenario in a right way

From my reading of these changes, the worker queue is only set on the result event of a command and not any of the dependency events. I guess my main concern is if someone does something like

sycl::event Ev;
{
  sycl::queue Q1;
  Ev = Q.submit(...); // Ev will have Q1 as its worker queue.
}
{
  sycl::queue Q2;
  Q.submit([&](sycl::handler &CGH) {
   CGH.depends_on(Ev); // Ev.getWorkerQueue() will fail to lock its weak_ptr?
   ...
  }
}

it will fail. Is that a possible case? If so, maybe getWorkerQueue could be changed to return nullptr if the worker queue is dead? Would that break something?

@KseniyaTikhomirova
Copy link
Contributor Author

it will fail. Is that a possible case? If so, maybe getWorkerQueue could be changed to return nullptr if the worker queue is dead? Would that break something?

What I expect in the scenario above:

  1. std::weak_ptr::lock - "Creates a new std::shared_ptr that shares ownership of the managed object. If there is no managed object, i.e. *this is empty, then the returned shared_ptr also is empty.". So getWorkerQueue should already be able to return nullptr if weak_ptr is expired.
  2. yes, in the scenario above I see the possibility for queue to be deleted after ev related work completion but PiEVent in ev is still alive and related PiQueue - too. On the second kernel submission in getPIEvent we will handle PiEVent of ev which is still absolutely valid (and completed BTW if Queue is deleted) and pass to backend.

@KseniyaTikhomirova
Copy link
Contributor Author

it will fail. Is that a possible case? If so, maybe getWorkerQueue could be changed to return nullptr if the worker queue is dead? Would that break something?

I will try to add suggested code as unit test tomorrow.

@steffenlarsen
Copy link
Contributor

std::weak_ptr::lock - "Creates a new std::shared_ptr that shares ownership of the managed object. If there is no managed object, i.e. *this is empty, then the returned shared_ptr also is empty.". So getWorkerQueue should already be able to return nullptr if weak_ptr is expired.

Admittedly I completely misremembered the semantics of std::weak_ptr::lock. A unittest would indeed put my mind at ease, thank you! 😄

@KseniyaTikhomirova
Copy link
Contributor Author

Unfortunately scenario above involves host task execution because we connect two Queues via connection task which is always a host task.
It is not possible to implement a proper unit test there. In the scenario waiting for piEvent related to event for which Queue can be deleted is done via host task execution. That means that I need to involve DispatchHostTask execution that works with real scheduler and real plugins which we can not afford.
I can confirm that your scenario successfully passes locally and can suggest llvm-test-suite test only but not sure if it is reasonable enough.

@KseniyaTikhomirova
Copy link
Contributor Author

failures are not related, already reported here #6760

@KseniyaTikhomirova
Copy link
Contributor Author

@intel/llvm-gatekeepers hello, could you please merge it?

@steffenlarsen steffenlarsen merged commit e112609 into intel:sycl Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants