Added optimizer implementation #3699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

davidlin54 wants to merge 1 commit into pytorch:main from davidlin54:export-D57216865

Contributor

davidlin54 commented May 21, 2024

Summary:
This adds the optimizer logic, reusing much of the logic from LiteInterpreter. The main differences being:

SGDParamGroup takes in a Span<char*> and a Span which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
SGD step takes in a Span<char*> and a Span which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I believe since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Differential Revision: D57216865

pytorch-bot bot commented May 21, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3699

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 087072a with merge base 1343224 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented May 21, 2024

This pull request was exported from Phabricator. Differential Revision: D57216865

facebook-github-bot added the fb-exported label

davidlin54 force-pushed the export-D57216865 branch from 4bd8ac7 to 415f8b7 Compare

May 22, 2024 17:56

davidlin54 pushed a commit to davidlin54/executorch that referenced this pull request


          Added optimizer implementation (pytorch#3699)

415f8b7

Summary:

This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being:
1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Differential Revision: D57216865

Contributor

facebook-github-bot commented May 22, 2024

This pull request was exported from Phabricator. Differential Revision: D57216865

iseeyuan approved these changes

View reviewed changes

davidlin54 force-pushed the export-D57216865 branch from 415f8b7 to 5cea852 Compare

May 23, 2024 20:02

davidlin54 pushed a commit to davidlin54/executorch that referenced this pull request


          Added optimizer implementation (pytorch#3699)

5cea852

Summary:

This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being:
1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Reviewed By: iseeyuan

Differential Revision: D57216865

Contributor

facebook-github-bot commented May 23, 2024

This pull request was exported from Phabricator. Differential Revision: D57216865

davidlin54 pushed a commit to davidlin54/executorch that referenced this pull request


          Added optimizer implementation (pytorch#3699)

9267abc

Summary:

This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being:
1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Reviewed By: iseeyuan

Differential Revision: D57216865

davidlin54 force-pushed the export-D57216865 branch from 5cea852 to 9267abc Compare

May 23, 2024 23:05

Contributor

facebook-github-bot commented May 23, 2024

This pull request was exported from Phabricator. Differential Revision: D57216865

davidlin54 pushed a commit to davidlin54/executorch that referenced this pull request


          Added optimizer implementation (pytorch#3699)

452fee6

Summary:

This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being:
1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Reviewed By: iseeyuan

Differential Revision: D57216865

davidlin54 force-pushed the export-D57216865 branch from 9267abc to 452fee6 Compare

May 23, 2024 23:06

Contributor

facebook-github-bot commented May 23, 2024

This pull request was exported from Phabricator. Differential Revision: D57216865

davidlin54 force-pushed the export-D57216865 branch from 452fee6 to cfda836 Compare

May 24, 2024 16:35

davidlin54 pushed a commit to davidlin54/executorch that referenced this pull request


          Added optimizer implementation (pytorch#3699)

cfda836

Summary:

This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being:
1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Reviewed By: iseeyuan

Differential Revision: D57216865

Contributor

facebook-github-bot commented May 24, 2024

This pull request was exported from Phabricator. Differential Revision: D57216865


          Added optimizer implementation (pytorch#3699)

087072a

Summary:

This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being:
1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Reviewed By: iseeyuan

Differential Revision: D57216865

davidlin54 force-pushed the export-D57216865 branch from cfda836 to 087072a Compare

May 24, 2024 16:36

Contributor

facebook-github-bot commented May 24, 2024

This pull request was exported from Phabricator. Differential Revision: D57216865

facebook-github-bot closed this in

d44877b

Contributor

facebook-github-bot commented May 24, 2024

This pull request has been merged in d44877b.

facebook-github-bot added the Merged label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported Merged