Skip to content

Init Scope #338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
Aug 23, 2021
Merged

Init Scope #338

merged 48 commits into from
Aug 23, 2021

Conversation

rnett
Copy link
Contributor

@rnett rnett commented Jun 16, 2021

Reworks how initialization is handled (somewhat, mostly the API). I was originally planning on doing name-based for interop with tensorflow, but they don't do name based either so I kept the current method.

The API has changed: instead of adding init ops yourself, they will automatically be added when created with an init scope, which you get using Ops.initScope() or Scope.initScope(). This allowed for better error checking (i.e. init ops can't depend on non init ops), eliding control dependencies on them (since they are created at init time), but required making Scopes part of OpBuilders (which was a good thing mostly, and prevents future "forgot to call apply" bugs). I also added currently unused methods to do initialization in a different execution environment, which will be used by functions.

So for an example, creating a variable w/ an initial value becomes tf.initScope().variable(tf.initScope().constant(4f)), which automatically registers it for initialization by sessions.

I also added helpers to Session to create and initialize and a requirement that if the graph has init ops, initialization must b ran before running anything else. I don't really like having separate initialized factory methods, thoughts on having it be initialized on construction by default but having Session(Graph, boolean) constructors where you can control it?

I have yet to update framework to use this, does it work for what you were working on w/ variables @JimClarke5?

The new generated op changes also aren't committed yet, for size reasons.

Copy link
Collaborator

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is going to break Tribuo in ways that are hard to fix, because it removes the init op we use to find initializers in serialized GraphDefs and there doesn't seem to be a way to get the initialization op name back out. I think you should consider how this interacts with serialized GraphDefs which are the only way to persist a graph structure that we've got at the moment. The current init mechanism mirrors what TF Python does in terms of what gets persisted into a graphdef.

You can see how Tribuo currently uses the init ops here - https://github.com/oracle/tribuo/blob/main/Interop/Tensorflow/src/main/java/org/tribuo/interop/tensorflow/TensorFlowTrainer.java#L488 through line 514.

@rnett
Copy link
Contributor Author

rnett commented Jun 17, 2021

The exporting and importing is a problem, can you link where Tribuo does the exporting? The intent was that when importing Python models, you would find the Restore op and use Graph.registerRestoreOp, I just need to add something similar for Java import and export.

@rnett
Copy link
Contributor Author

rnett commented Jun 17, 2021

Looking more into SaverDef, I'm not going to be able to use that. I'm planning to create essentially a tf.init() op with all the init ops as control dependencies and a set name pattern on export, and on import find that op and use Graph.registerRestoreOp. This would mean that using importGraphDef would load all of the graph's initializers if the graph was exported from Java, which I think would solve your issue @Craigacp?

I also made Session.restore count as initialization for the checks.

@Craigacp
Copy link
Collaborator

Tribuo uses GraphDef as it's representation of a graph before it's been trained, as its trainers are designed to be thread-safe so there can be multiple concurrent calls to train in flight. That means we need to be able to recreate the graph independently for each call to train. The Graph is only valid inside a call to train, then it's destroyed and recreated inside the Model object. All the logic is in the train call here - https://github.com/oracle/tribuo/blob/main/Interop/Tensorflow/src/main/java/org/tribuo/interop/tensorflow/TensorFlowTrainer.java#L466. It starts with a GraphDef, instantiates a Graph containing that GraphDef, adds the appropriate output and gradient update operations to that graph, runs init for the graph, runs init for the Tribuo added operations (which basically just inits the gradient slots), then starts training the model and stepping the gradient optimiser. The two phase init is because I can't easily aggregate all the necessary init variables from the supplied GraphDef with the ones Tribuo adds because I can't find them.

@rnett
Copy link
Contributor Author

rnett commented Jun 18, 2021

@rnett
Copy link
Contributor Author

rnett commented Jun 18, 2021

@JimClarke5 I've only done enough updates to framework to get the tests to pass, you might want to do a full pass once this is merged.

@karllessard
Copy link
Collaborator

@rnett , can you rebase this PR or fix the conflicts please?

rnett added 21 commits July 7, 2021 14:05
Signed-off-by: Ryan Nett <[email protected]>
Signed-off-by: Ryan Nett <[email protected]>
Signed-off-by: Ryan Nett <[email protected]>
Signed-off-by: Ryan Nett <[email protected]>
Signed-off-by: Ryan Nett <[email protected]>
Signed-off-by: Ryan Nett <[email protected]>
Signed-off-by: Ryan Nett <[email protected]>
Signed-off-by: Ryan Nett <[email protected]>
Signed-off-by: Ryan Nett <[email protected]>
Signed-off-by: Ryan Nett <[email protected]>
Signed-off-by: Ryan Nett <[email protected]>
@rnett
Copy link
Contributor Author

rnett commented Jul 7, 2021

Ok, done.

rnett added 2 commits July 25, 2021 20:46
Signed-off-by: Ryan Nett <[email protected]>
Signed-off-by: Ryan Nett <[email protected]>
@rnett
Copy link
Contributor Author

rnett commented Jul 26, 2021

Snapshot builds appear to be broken, the mac artifact is missing from https://oss.sonatype.org/content/repositories/snapshots/org/tensorflow/tensorflow-core-api/0.4.0-SNAPSHOT/, and the CI is failing because of that.

cc @karllessard

@JimClarke5
Copy link
Contributor

@rnett I am trying to build on macOS, pulling rnett:rn_init_scope, and I keep getting these errors:

[ERROR]   SavedModelBundleTest.exportFunctionWithVariables:208 expected: <17.781294> but was: <18.12133>
[ERROR]   SavedModelBundleTest.exportMultipleFunctions:127 expected: <4.093658> but was: <5.931101>
[ERROR]   SessionTest.saveAndRestore:234 expected: <org.tensorflow.internal.types.TFloat32Mapper$DenseTFloat32@2024886b> but was: <org.tensorflow.internal.types.TFloat32Mapper$DenseTFloat32@8a1de29>

Any suggestions.

@JimClarke5
Copy link
Contributor

JimClarke5 commented Aug 10, 2021

@rnett does the variable and init operand have to both be under the same initScope?

Also can you re-init the variable later? I am looking at Metrics resetStates which resets the variables to their initial values. It would be nice not to have to invoke run just to make sure this happens.

For example:

Upon original init:
Variable<T> truePositives = tf.withName(truePositivesName).withInitScope().variable(zero);

I assume, this will cause variable truePositives to be set to zero on the next session run.

Later on I want to reset truePositives back to zero. This is on its own and is not a part of any other operand group.
tf.withName(truePositivesName).withInitScope().assign(truePositives, zero)

I desire to have the assign operand pushed onto the init stack, then executed at the beginning of the next session run.
Will this work?

@rnett
Copy link
Contributor Author

rnett commented Aug 11, 2021

@JimClarke5 It should, although I'm not sure we want to purposefully introduce that kind of eager-like graph modification to session semantics. Imo it would be odd to intentionally have a Java function that's not a session run affect the session state, the "run all new initializers" was intended more as a failsafe. You could run into the thread safety issues w/ modifying a graph while the sessions is open, too. Ideally, I think the solution would be to have things like Metrics use Eager variables to store their state, but that makes managing them more complicated.

rnett added 2 commits August 11, 2021 16:59
@JimClarke5
Copy link
Contributor

I have totally refactored Metrics to take advantage of this PR and it's a resounding success.
Good job @rnett

I have removed the Ops parameter in the CTORs and now pass it in the new Metric interface calls.
There is no longer a need for code to create control op dependencies for initialization nor a need to initialize the
metrics objects once they are created.
This will help with the Model class where the Ops is created after the metrics are created.
Also, the resetStates method returns an Op that will need to be run explicitly either alone or maybe as a control op
to another Op graph structure. (I decided to take @rnett's advice not to treat it within an initScope.)

Here is the main interface for Metric now:

interface Metric {

  List<Op> updateStateList(
      Ops tf, Operand<? extends TNumber> values, Operand<? extends TNumber> sampleWeights);

  List<Op> updateStateList(
      Ops tf,
      Operand<? extends TNumber> labels,
      Operand<? extends TNumber> predictions,
      Operand<? extends TNumber> sampleWeights);

  <T extends TNumber> Operand<T> result(Ops tf, Class<T> type);

  Op resetStates(Ops tf);

  Op updateState(
      Ops tf, Operand<? extends TNumber> values, Operand<? extends TNumber> sampleWeights);

  Op updateState(
      Ops tf,
      Operand<? extends TNumber> labels,
      Operand<? extends TNumber> predictions,
      Operand<? extends TNumber> sampleWeights);

  <T extends TNumber> Operand<T> callOnce(
      Ops tf,
      Operand<? extends TNumber> values,
      Operand<? extends TNumber> sampleWeights,
      Class<T> type);
}

Once this PR is merged I will create another PR for these metrics' changes.

My next goal is to rework Layers.

@rnett rnett requested a review from karllessard August 15, 2021 21:04
@JimClarke5
Copy link
Contributor

JimClarke5 commented Aug 17, 2021

I have finished converting layers, and optimizers and everything works. I have refactored both to not include Ops/Graph in the CTORs. Once this PR is merged, I will create separate PRs to check in my changes.

Except for these failures in the overall build, I am good to go

[ERROR] Failures: 
[ERROR]   SavedModelBundleTest.exportFunctionWithVariables:208 expected: <17.614902> but was: <17.720098>
[ERROR]   SavedModelBundleTest.exportMultipleFunctions:127 expected: <4.6954327> but was: <5.659465>
[ERROR]   SessionTest.saveAndRestore:234 expected: <org.tensorflow.internal.types.TFloat32Mapper$DenseTFloat32@ab941f5f> but was: <org.tensorflow.internal.types.TFloat32Mapper$DenseTFloat32@91507d4d>

I do notice that the exact values are different from when this error first appeared for me.

@rnett
Copy link
Contributor Author

rnett commented Aug 17, 2021

Can you rebase on this branch? The latest commit here fixed those for me.

@JimClarke5
Copy link
Contributor

@rnett I have pulled your latest code and all is good now. I am ok with this PR.

Copy link
Collaborator

@karllessard karllessard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @Craigacp any additional comment before I merge this?

@Craigacp
Copy link
Collaborator

LGTM, @Craigacp any additional comment before I merge this?

Apart from the NoOp issue that I just mentioned it all looks fine.

I am worried about the level of change this is to the codebase and I can't quite see the reason for it, but that doesn't mean there isn't a good one.

@rnett
Copy link
Contributor Author

rnett commented Aug 21, 2021

@Craigacp It's mostly to support functions, which are Graph's with Eager init scopes, which wasn't possible to implement using fake graph ops like we were using for init previously.

@karllessard
Copy link
Collaborator

I am worried about the level of change this is to the codebase and I can't quite see the reason for it, but that doesn't mean there isn't a good one.

According to @JimClarke5 comment, the improvements in the framework are substantial. Personally, I think if we can lift the burden of initialization from the users, it's a plus but saved models were already handling this for inference so changes are just impacting users that were doing their training in Java (@rnett, am I right here?)

@rnett
Copy link
Contributor Author

rnett commented Aug 21, 2021

According to @JimClarke5 comment, the improvements in the framework are substantial. Personally, I think if we can lift the burden of initialization from the users, it's a plus but saved models were already handling this for inference so changes are just impacting users that were doing their training in Java (@rnett, am I right here?)

Yeah, the SavedModel handles it for Tf2 models. This also makes it a big easier to handle TF1 models and we can export our initializers easier (I'm going to make a new PR for those).

@JimClarke5
Copy link
Contributor

This PR will be beneficial whenever a model is created by the developer, which includes all the major model actions fit, evaluate and predict. Actions, such as predict, may not necessarily require that the model be saved first.

Copy link
Collaborator

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I'm fine with this being merged.

I do wonder if we could get away with just tracking the init ops better as we own all of them without creating a privileged init scope that people could use incorrectly (and isn't type safe), but I don't understand the eager/graph function use case well enough to know if that would cause problems.

@karllessard karllessard merged commit 242931c into tensorflow:master Aug 23, 2021
@karllessard
Copy link
Collaborator

Thanks @rnett !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants