Skip to content

Conversation

davidlin54
Copy link
Contributor

Summary:
This adds the optimizer logic, reusing much of the logic from LiteInterpreter. The main differences being:

  1. SGDParamGroup takes in a Span<char*> and a Span which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
  2. SGD step takes in a Span<char*> and a Span which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
  3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I believe since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
  4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Differential Revision: D57216865

Copy link

pytorch-bot bot commented May 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3699

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 087072a with merge base 1343224 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 21, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57216865

davidlin54 pushed a commit to davidlin54/executorch that referenced this pull request May 22, 2024
Summary:

This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being:
1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Differential Revision: D57216865
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57216865

davidlin54 pushed a commit to davidlin54/executorch that referenced this pull request May 23, 2024
Summary:

This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being:
1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Reviewed By: iseeyuan

Differential Revision: D57216865
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57216865

davidlin54 pushed a commit to davidlin54/executorch that referenced this pull request May 23, 2024
Summary:

This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being:
1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Reviewed By: iseeyuan

Differential Revision: D57216865
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57216865

davidlin54 pushed a commit to davidlin54/executorch that referenced this pull request May 23, 2024
Summary:

This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being:
1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Reviewed By: iseeyuan

Differential Revision: D57216865
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57216865

davidlin54 pushed a commit to davidlin54/executorch that referenced this pull request May 24, 2024
Summary:

This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being:
1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Reviewed By: iseeyuan

Differential Revision: D57216865
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57216865

Summary:

This adds the optimizer logic, reusing much of the logic from [LiteInterpreter](https://fburl.com/code/t5dqeyje). The main differences being:
1. SGDParamGroup takes in a Span<char*> and a Span<Tensor> which represents named parameters. unlike LI or core PT, portable tensors don't use the autograd framework and we won't be supporting it either. instead, we're likely to use the backwards graph to calculate the gradients of the parameters. in that case, we need a way to map the gradients to its appropriate parameter. We expect that the sizes of the two spans are equal, and the index of a specific parameter is the same in both spans.
2. SGD step takes in a Span<char*> and a Span<Tensor> which represents the named gradients. We use this to match the gradient to the appropriate parameter. Similar to above, we expect that the spans are equal sizes and the index of a gradient data is the same as its parameter name.
3. Uses the out variant operations rather than the inplace or functional variants since those are already implemented. I *believe* since we're only using clone, add (same sized tensor), and mul_scalar, there isn't any harm in overwriting the data.
4. For the momentum buffer, I allocate memory for the underlying data and TensorImpl. This gets cleaned up when the SGD destructor is called.

Reviewed By: iseeyuan

Differential Revision: D57216865
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57216865

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in d44877b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants