-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Changes to support input sequence ID tracking #70264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
in the NVTX markers. This feature adds additional information to the NVTX marker string eg seq_ids=[101, 102, 103]. This indicates the sequence id of the op which produced the input tensor based on its position index in the array. This is the same way the sizes array is organized. If you know the sequence id of the node and the sequence ids of the input edges, then you have enough information to construct the network graph.
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 1547536 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Hi - this is my first PR for pytorch. What is the next step for getting this through the CI process? |
Hey, thanks for the PR. I'm on vacation until the 7th, and then I'll review and help guide it in. |
Thanks!
…On Tue, Jan 4, 2022 at 6:04 PM Taylor Robie ***@***.***> wrote:
Hi - this is my first PR for pytorch. What is the next step for getting
this through the CI process?
Hey, thanks for the PR. I'm on vacation until the 7th, and then I'll
review and help guide it in.
—
Reply to this email directly, view it on GitHub
<#70264 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI5QN2TLKCC5KPX7VMC6ME3UUORJNANCNFSM5KRGFM6Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Merge remote-tracking branch 'origin/master' into alex_input_seq_ids
I just resolved a merge conflict that was introduced after I started this PR. It looks like the kineto sw has been changing recently and it is causing conflicts with my changes. |
We just wrapped up performance reviews, so I'll finally have time to review this. Thanks for your patience. |
Thanks!
Any idea how to restart a the test? I saw that one of the CI jobs was
cancelled.
…-Alex
On Tue, Jan 18, 2022 at 8:39 AM Taylor Robie ***@***.***> wrote:
We just wrapped up performance reviews, so I'll finally have time to
review this. Thanks for your patience.
—
Reply to this email directly, view it on GitHub
<#70264 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI5QN2XFROTPXQDW3ONBMFDUWWJVBANCNFSM5KRGFM6Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I went ahead and just manually restarted it. |
Thanks for restarting the pipeline, looks like it worked this time. |
At a high level, this seems awesome! So much so that I also want this for the Kineto profiler. (Though I'm perfectly happy to have this PR just add it for NVTX and then I can steal all the utility functions later.) I do have some concerns about using Autograd Node as a proxy for Tensor identity. I understand why it's appealing since Tensors don't have a good way of assigning uuid, but I worry about losing information. For one, this won't work in inference mode (which, ironically, is when a lot of more intrusive optimizations like buffer reuse become available). It also (I think) will alias Tensors that are produced from multi-output ops like |
Thanks for the feedback @robieta ! I think your concerns about using the autograd Node are valid. When I added this feature I was specifically working on performance analysis of training, so not working on inference wasn't really a concern at the time. Also the Autograd Node was the only data structure I could find that maintained information about the producer of a given tensor. Leveraging the JIT graph is another option I looked into, but it was difficult to convert the entire network to torch script for all the nets I was investigating. I would like for this to work for inference, I'm actually starting to analyze inference performance now, so it is an important issue for me. Do you have any suggestions for alternative data structures for maintaining the tensors' producer information? For training the autograd Node approach works reasonably well. I have run into some issues that would be good to resolve. Would it be possible to move forward with this PR, then work on a better solution to recording the tensors' producer? Most of the code supporting the profiling would stay in-tact. The LOC for the autograd node approach is small, so replacing it with a more robust solution shouldn't be too much effort once there's s a good solution in place. |
@robieta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
the early exits in THPFunction_apply(). Feedback from code review.
…put_seq_ids Remote alex_input_seq_ids was already merged with master, this merge makes local branch match the remote.
Hi @robieta I fixed the RECORD_FUNCTION issue in python_function.cpp. CI passed, so it should be good to go now. |
@robieta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
torch/csrc/profiler/util.cpp
Outdated
std::string str("["); | ||
int idx = 0; | ||
|
||
for (const auto op_id_info_pair : input_op_ids) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this to const auto& op_id_info_pair
? One of the internal builds is super pedantic and refuses to build if there is a copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed the change
@robieta Any idea what this error is? Using default tag: latest |
@robieta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Any update on the internal testing? |
I was out the second half of last week. It's failing some tests, but unclear if it is unrelated flakes. I fired off a new batch. |
Ok thanks for the update! |
Any updates on the testing status? |
ping @robieta |
I'm trying to get it through CI. Because so many projects have PyTorch as a dependency it is an involved process to separate flakes from legitimate breakages. Never fear, I am actively landing this PR. |
Much appreciated! |
FYI @robieta here's a clip from a network graph that I generated based on this PR. This is from the network efficientnet. I combined the op_ids with the module names from the nn.module object associated with each aten:: op. It makes it really easy to understand the network architecture, you don't even really need to look at the network source code. |
@alexmsettle Testing is in good enough shape (I'm quite confident that all remaining failures are unrelated) that I just started to land. |
Summary: in the NVTX markers. This feature adds additional information to the NVTX marker string eg seq_ids=[101, 102, 103]. This indicates the sequence id of the op which produced the input tensor based on its position index in the array. In the above example input tensor 0 was produced by the node with sequence id 101, input tensor 1 is from node 102, input tensor 2 is from node with sequence id 103. This is the same way the sizes array is organized. If you know the sequence id of the node and the sequence ids of the input edges, then you have enough information to construct the network graph. Fixes #66105 Pull Request resolved: #70264 Reviewed By: chaekit Differential Revision: D34792707 Pulled By: robieta fbshipit-source-id: 4407b853c929a737505803b0db77a8ecd966cce2
Hey @alexmsettle. |
Awesome! Thanks for driving this to completion @robieta |
Thanks! Likewise, thanks for adding this awesome feature. |
in the NVTX markers. This feature adds additional information
to the NVTX marker string eg seq_ids=[101, 102, 103]. This indicates
the sequence id of the op which produced the input tensor based on its
position index in the array. In the above example input tensor 0 was produced by
the node with sequence id 101, input tensor 1 is from node 102, input tensor 2 is from
node with sequence id 103. This is the same way the sizes array is
organized. If you know the sequence id of the node and the sequence ids
of the input edges, then you have enough information to construct the
network graph.
Fixes #66105