Skip to content

[Importer] Add C2 importer support for RWQ SLWS/SLS #2292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 8, 2019

Conversation

jfix71
Copy link
Contributor

@jfix71 jfix71 commented Jan 23, 2019

Description: Add support in the Caffe2 model loader for SparseLengthsWeightedSum8BitsRowwise and SparseLengthsSum8BitsRowwise.

Testing: Added unit tests

Documentation: N/A

Related to #1698

@jfix71
Copy link
Contributor Author

jfix71 commented Jan 23, 2019

Note that the way this and the SLWS Node are implemented requires duplicating the Data input to the SLWS node -- right now it strips out the fused scale/offset to create three separate Constant inputs to the node (unfused Data, Scales, and Offsets) when loading the model in order to create the SLWS Node.

This could be problematic, as Data could be very large and duplicating it could use too much host memory. We may want to use something like "offline Constants" here (deferring loading/transforming the fused Data Constant until copying the Constants to an accelerator) if it becomes an issue.

Copy link
Contributor

@rdzhabarov rdzhabarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. It would be great to compare C2 output and Glow for these ops (to ensure correctness).

@jfix71 jfix71 force-pushed the importer_rw_q_sls branch from 0cd5e54 to 70300cf Compare January 23, 2019 23:39
Copy link
Contributor

@artemrakhov-glow artemrakhov-glow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stoklund once made a comment that it is more efficient to store each row's scale and offset together with the data, so that we'd not have to do 2 extra fetches of size 1.

Caffe2 actually has 2 operators implementing quantized SLS, and I suspect for this exact reason.

  1. This one stores scales and offsets separately. It would be very easy to support now, because you've added RowwiseQuantizedSparseLengthsWeightedSumNode.
    https://caffe2.ai/docs/operators-catalogue.html#sparselengthssum8bitsrowwise
  2. This one fuses scales and offsets, I think we have no other choice than supporting it as a different new node.
    https://caffe2.ai/docs/operators-catalogue.html#sparselengthssumfused8bitrowwise

@jfix71
Copy link
Contributor Author

jfix71 commented Jan 24, 2019

@artemrakhov My understanding from @stoklund was that fusing was for efficiency at execution for an accelerator, not compile time. Backends that need it fused could always re-fuse it later on.

I was thinking this could be made more efficient/possible with huge Constants using the "offline Constants" approach we previously discussed, which seemed to be something we may need for other uses in the future too. If so, this unfusing wouldn't happen in Caffe2ModelLoader anymore -- instead you would add an "unfuse" expression, and later on a backend could add a "fuse" expression, and they would cancel each other out, meaning the fused data stays fused the whole time.

For fused Tensors in Glow, like Caffe2 uses -- it feels pretty hacky to me, just modifying the shape to include per-row scales/offsets and assume all users of fused Tensors know what's going on. If we want to go down that path, I'd prefer creating a fused Tensor class which has the correct shape and implements less efficient at(), raw() etc. that take into account the scales/offsets data fused in.

@artemrakhov-glow
Copy link
Contributor

I agree, fusing is done for efficiency at execution, not compile time. At execution time, we have to fetch a lot of rows from Data and we are limited by memory bandwidth. Having to fetch 2 more single numbers per each row will increase memory pressure. This is true for every inference platform.

I see your point about fusing later per Backend request. But in reality this may take a long time and consume extra memory, due to the size of these Tensors. We'll definitely need a concept of deferred (lazy) optimizations, which will cancel each other out in this case.

I don't think that fused op is hacky. Caffe2 actually implements both. I think it's fine for all backends to be aware of this fused format.

@stale
Copy link

stale bot commented Jan 30, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@jfix71
Copy link
Contributor Author

jfix71 commented Feb 2, 2019

Update: After an offline discussion we decided we should implement both fused and unfused nodes. Ideally higher up in the stack we will ensure that Glow receives the weights either in fused or unfused format, depending on what the backend of the host prefers.

I will update this PR to have protos that are not fused + update the Caffe2Importer case, and then will put up future PRs to support the fused version of RWQ-SLWS/protos.

@stale
Copy link

stale bot commented Feb 7, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@jfix71
Copy link
Contributor Author

jfix71 commented Feb 8, 2019

I've updated the PR to not used fused scale/offsets. In a future PR I will add in new nodes for fused, as well as Caffe2 importer support for fused.

@jfix71 jfix71 force-pushed the importer_rw_q_sls branch from 679c8c5 to c30905a Compare February 8, 2019 04:22
@jfix71 jfix71 changed the title [Importer] Add C2 importer support for fused RWQ SLWS/SLS [Importer] Add C2 importer support for RWQ SLWS/SLS Feb 8, 2019
@jfix71 jfix71 merged commit b1c130c into pytorch:master Feb 8, 2019
@jfix71 jfix71 deleted the importer_rw_q_sls branch February 8, 2019 18:11
jfix71 added a commit that referenced this pull request Feb 13, 2019
*Description*: As noted in #2292, we decided to implement both fused and unfused versions of rowwise-quantized SLWS.

*Testing*: Added OperatorTests and Caffe2ImporterTests.

*Documentation*: Added.

Closes #1698
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants