-
Notifications
You must be signed in to change notification settings - Fork 685
Added optimizer implementation #3699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3699
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 087072a with merge base 1343224 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D57216865 |
4bd8ac7
to
415f8b7
Compare
Summary: This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being: 1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans. 2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name. 3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data. 4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called. Differential Revision: D57216865
This pull request was exported from Phabricator. Differential Revision: D57216865 |
415f8b7
to
5cea852
Compare
Summary: This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being: 1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans. 2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name. 3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data. 4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called. Reviewed By: iseeyuan Differential Revision: D57216865
This pull request was exported from Phabricator. Differential Revision: D57216865 |
Summary: This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being: 1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans. 2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name. 3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data. 4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called. Reviewed By: iseeyuan Differential Revision: D57216865
5cea852
to
9267abc
Compare
This pull request was exported from Phabricator. Differential Revision: D57216865 |
Summary: This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being: 1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans. 2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name. 3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data. 4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called. Reviewed By: iseeyuan Differential Revision: D57216865
9267abc
to
452fee6
Compare
This pull request was exported from Phabricator. Differential Revision: D57216865 |
452fee6
to
cfda836
Compare
Summary: This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being: 1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans. 2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name. 3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data. 4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called. Reviewed By: iseeyuan Differential Revision: D57216865
This pull request was exported from Phabricator. Differential Revision: D57216865 |
Summary: This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being: 1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans. 2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name. 3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data. 4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called. Reviewed By: iseeyuan Differential Revision: D57216865
cfda836
to
087072a
Compare
This pull request was exported from Phabricator. Differential Revision: D57216865 |
This pull request has been merged in d44877b. |
Summary:
This adds the optimizer logic, reusing much of the logic from LiteInterpreter. The main differences being:
Differential Revision: D57216865