Skip to content

Need a way to pass trainable Variables to Optimizer #307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JimClarke5 opened this issue Apr 29, 2021 · 12 comments
Open

Need a way to pass trainable Variables to Optimizer #307

JimClarke5 opened this issue Apr 29, 2021 · 12 comments

Comments

@JimClarke5
Copy link
Contributor

In woking with Model training, an issue on Optimizer has shown its head.

Currently, when calling minimize(loss) on the Optimizer instance, the Optimizer code walks the entire Graph and pulls out all the defined Variables in the graph. The idea is when you call minimize(loss), the Optimizer builds gradients based on all the variables. However, when working with Model, this "all variables approach" breaks down, because some variables are not referenced in the loss operand execution path. This produces the following error:

org.tensorflow.exceptions.TFInvalidArgumentException: Cannot compute the partial derivative for node 'model/mse_total' as it's unreachable from the output node(s).

This specific error is because the MSE metric's internal variables are not within the loss execution path. This pattern of "non-trainable variables (weights)" is in most Metric classes, and in the Model itself, so it is wide spread. What we need is a way to distinguish between trainable and non-trainable variables. Trainable variables would then be used to calculate the gradient values in the Optimizer.

In Python tensorflow, the Keras Layers track the trainable variables as an attribute list, the Model then passes the collected lists to the Optimizer's minimize method.

There are a couple of options here:

  1. Mimic TF Keras, and have each Layer identify its trainable variables, Then, pass the trainable variables as a List<Variable<?> list using a call like, Optimizer.minimize(loss, trainableVariables), then have the Optimizer minimize routine call addGradients with this variable list, rather than walk the whole Graph, to compute the gradients.
  2. Within Optimzier.minimize(loss), walk the loss operand execution path to locate any variables contributing to the loss calculation, then pass these to addGradients. A solution based on this option may be facilitated using Add graph walking functions to Graph and GraphOperation #232, "Add graph walking functions to Graph and GraphOperation".
@Craigacp
Copy link
Collaborator

I think we'll need both. Keras will want the former as then you can freeze a layer so it's not trainable (which is very useful for transfer learning), and non-Keras users (which is everyone until all the bits land) will want the latter to make it easier to use the optimizers without hitting the issue you're seeing.

I'm happy to implement the latter option tomorrow to get you unblocked.

@rnett
Copy link
Contributor

rnett commented Apr 29, 2021

I would definitely like option 1 to be implemented at some point. Taking a list is necessary for eager mode (presumably it will be gotten from a gradient tape) and helps support 3rd party model implementations.

Also note that tf.Variable has it's own notion of trainable, and I planed on including that when I port it, so option 2 will eventually be able to take that into account.

@JimClarke5
Copy link
Contributor Author

I added a parameter to Optimizer.optimize to pass in the trainable variables, now I get the error:

org.tensorflow.exceptions.TensorFlowException: No gradient defined for op: Cast. Please see https://www.tensorflow.org/code/tensorflow/cc/gradients/README.md for instructions on how to add C++ gradients.
	at org.tensorflow.framework.model.ModelTest.testModel(ModelTest.java:79)

I assume that this is because the loss operation tree contains a Cast operation.

@rnett
Copy link
Contributor

rnett commented May 11, 2021

Ugh, can't believe that's missing. It's easy enough I'll PR it to tensorflow core, but it will be a while before it makes it into a release.

@JimClarke5
Copy link
Contributor Author

I wonder if there are more of these?

@rnett
Copy link
Contributor

rnett commented May 11, 2021

Yeah, I know of Concat, function calls, and resource variable reads. I've added resource variable reads and function calls to tensorflow core, but it's a pain to do and the release cycle lags, thus #292.

@rnett
Copy link
Contributor

rnett commented Sep 20, 2021

FYI @JimClarke5 Cast has a gradient in 2.6 (which is used in snapshots).

@JimClarke5
Copy link
Contributor Author

JimClarke5 commented Sep 20, 2021

@rnett

Yes, I saw that last week. However, it turns out I was passing the wrong arguments to my loss operation as I should have included the trainable variables in the loss operation. When I did that the Cast error went away, even though CastGrad existed. So though CastGrad was present, the error message still complained about missing the cast gradient , so that was not the true error.

Right now I am adding a testMinimize method to all the Optimizer Unit Tests based on similar tests in TF Python, with trainable weights. However, I am not always getting the same result, but sometimes I do. TF Python uses GradientTape which in turn uses Eager.Tape, while we directly call the C API.
I am still trying to flush out what’s different between the two implementations besides one being a recordable tape.

@Craigacp
Copy link
Collaborator

Craigacp commented Sep 20, 2021

You should compare the C API gradient code (https://github.com/tensorflow/tensorflow/blob/5dcfc51118817f27fad5246812d83e5dccdc5f72/tensorflow/cc/framework/gradients.cc#L438) to the Python TF v1 gradient code (https://github.com/tensorflow/tensorflow/blob/1bb4e4fce0c992e2d995e1663138ad98e957e050/tensorflow/python/ops/gradients_impl.py#L44). The latter is full of locks and control dependencies, and the former has none of those. So stochastically you get the wrong gradients out. I've got C API and Python gradient graphs for the simple model in the failing test in Tensorboard, but I've not had chance to run it down properly.

@JimClarke5
Copy link
Contributor Author

The Python and C/C++ versions are totally different than the Python implementation for v1, I don't see any similarity other than the fact that they calculate gradients.

I am wondering if we should implement a Java version of GradientTape?

@Craigacp
Copy link
Collaborator

Craigacp commented Sep 20, 2021

I agree the C API version is completely different, but the fact that it's lacking the locking and dependencies is the reason it doesn't produce the same output as the Python v1 stuff.

I think we would need to wrap the C++ API to get the gradient tape to work as there isn't a C API for it (tensorflow/tensorflow#47748). I know @saudet started it (https://github.com/tensorflow/java/pull/283/files), but that only gets off the ground into Java, we'd need to do a lot of work to mirror the Python code. It's not clear that that is less effort than fixing the C API, though it has a much better long term payoff.

@karllessard do you know how the C API gradient code came to be? As far as I can tell only TF-Java has ever used it, so maybe they haven't figured out it's broken? No-one at Google responded to the issue I opened - tensorflow/tensorflow#48855.

@karllessard
Copy link
Collaborator

@karllessard do you know how the C API gradient code came to be?

I've added myself the various methods to add the gradients from the C API a long time ago, like this one, especially for Java support, so I don't think other clients that TF Java is using it. We can noticed that the code has been updated since and that the C++ layer supporting it has changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants