Skip to content

[refactor] Move checkpoint saving into trainer #4034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jun 4, 2020
1 change: 1 addition & 0 deletions com.unity.ml-agents/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ vector observations to be used simultaneously. (#3981) Thank you @shakenes !
#### ml-agents / ml-agents-envs / gym-unity (Python)
- Unity Player logs are now written out to the results directory. (#3877)
- Run configuration YAML files are written out to the results directory at the end of the run. (#3815)
- The `--save-freq` CLI option has been removed, and replaced by a `checkpoint_interval` option in the trainer configuration YAML. (#4034)
- When trying to load/resume from a checkpoint created with an earlier verison of ML-Agents,
a warning will be thrown. (#4035)
### Bug Fixes
Expand Down
4 changes: 4 additions & 0 deletions docs/Migrating.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ double-check that the versions are in the same. The versions can be found in
- `use_visual` and `allow_multiple_visual_obs` in the `UnityToGymWrapper` constructor
were replaced by `allow_multiple_obs` which allows one or more visual observations and
vector observations to be used simultaneously.
- `--save-freq` has been removed from the CLI and is now configurable in the trainer configuration
file.
- `--lesson` has been removed from the CLI. Lessons will resume when using `--resume`.
To start at a different lesson, modify your Curriculum configuration.

Expand All @@ -49,6 +51,8 @@ vector observations to be used simultaneously.
- If you use the `UnityToGymWrapper`, remove `use_visual` and `allow_multiple_visual_obs`
from the constructor and add `allow_multiple_obs = True` if the environment contains either
both visual and vector observations or multiple visual observations.
- If you were setting `--save-freq` in the CLI, add a `checkpoint_interval` value in your
trainer configuration, and set it equal to `save-freq * n_agents_in_scene`.

## Migrating from 0.15 to Release 1

Expand Down
3 changes: 2 additions & 1 deletion docs/Training-Configuration-File.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@ choice of the trainer (which we review on subsequent sections).
| `summary_freq` | (default = `50000`) Number of experiences that needs to be collected before generating and displaying training statistics. This determines the granularity of the graphs in Tensorboard. |
| `time_horizon` | (default = `64`) How many steps of experience to collect per-agent before adding it to the experience buffer. When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state. As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon). In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. This number should be large enough to capture all the important behavior within a sequence of an agent's actions. <br><br> Typical range: `32` - `2048` |
| `max_steps` | (default = `500000`) Total number of steps (i.e., observation collected and action taken) that must be taken in the environment (or across all environments if using multiple in parallel) before ending the training process. If you have multiple agents with the same behavior name within your environment, all steps taken by those agents will contribute to the same `max_steps` count. <br><br>Typical range: `5e5` - `1e7` |
| `keep_checkpoints` | (default = `5`) The maximum number of model checkpoints to keep. Checkpoints are saved after the number of steps specified by the save-freq option. Once the maximum number of checkpoints has been reached, the oldest checkpoint is deleted when saving a new checkpoint. |
| `keep_checkpoints` | (default = `5`) The maximum number of model checkpoints to keep. Checkpoints are saved after the number of steps specified by the checkpoint_interval option. Once the maximum number of checkpoints has been reached, the oldest checkpoint is deleted when saving a new checkpoint. |
| `checkpoint_interval` | (default = `500000`) The number of experiences collected between each checkpoint by the trainer. A maximum of `keep_checkpoints` checkpoints are saved before old ones are deleted. |
| `init_path` | (default = None) Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. <br><br>You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run. |
| `threaded` | (default = `true`) By default, model updates can happen while the environment is being stepped. This violates the [on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms) assumption of PPO slightly in exchange for a training speedup. To maintain the strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`. There is usually no reason to turn `threaded` off for SAC. |
| `hyperparameters -> learning_rate` | (default = `3e-4`) Initial learning rate for gradient descent. Corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase. <br><br>Typical range: `1e-5` - `1e-3` |
Expand Down
1 change: 1 addition & 0 deletions docs/Training-ML-Agents.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,7 @@ behaviors:
time_horizon: 64
summary_freq: 10000
keep_checkpoints: 5
checkpoint_interval: 50000
threaded: true
init_path: null

Expand Down
3 changes: 0 additions & 3 deletions docs/Using-Tensorboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,6 @@ runs you want to display. You can select multiple run-ids to compare statistics.
The TensorBoard window also provides options for how to display and smooth
graphs.

When you run the training program, `mlagents-learn`, you can use the
`--save-freq` option to specify how frequently to save the statistics.

## The ML-Agents Toolkit training statistics

The ML-Agents training program saves the following statistics:
Expand Down
7 changes: 0 additions & 7 deletions ml-agents/mlagents/trainers/cli_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,13 +105,6 @@ def _create_parser() -> argparse.ArgumentParser:
"current environment.",
action=DetectDefault,
)
argparser.add_argument(
"--save-freq",
default=50000,
type=int,
help="How often (in steps) to save the model during training",
action=DetectDefault,
)
argparser.add_argument(
"--seed",
default=-1,
Expand Down
2 changes: 1 addition & 1 deletion ml-agents/mlagents/trainers/ghost/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ def advance(self) -> None:
except AgentManagerQueue.Empty:
pass

self.next_summary_step = self.trainer.next_summary_step
self._next_summary_step = self.trainer._next_summary_step
self.trainer.advance()
if self.get_step - self.last_team_change > self.steps_to_train_team:
self.controller.change_training_team(self.get_step)
Expand Down
1 change: 0 additions & 1 deletion ml-agents/mlagents/trainers/learn.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,6 @@ def run_training(run_seed: int, options: RunOptions) -> None:
trainer_factory,
write_path,
checkpoint_settings.run_id,
checkpoint_settings.save_freq,
maybe_meta_curriculum,
not checkpoint_settings.inference,
run_seed,
Expand Down
1 change: 0 additions & 1 deletion ml-agents/mlagents/trainers/ppo/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,6 @@ def add_policy(
self.collected_rewards[_reward_signal] = defaultdict(lambda: 0)
# Needed to resume loads properly
self.step = policy.get_current_step()
self.next_summary_step = self._get_next_summary_step()

def get_policy(self, name_behavior_id: str) -> TFPolicy:
"""
Expand Down
1 change: 0 additions & 1 deletion ml-agents/mlagents/trainers/sac/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,6 @@ def add_policy(
self.reward_signal_update_steps = int(
max(1, self.step / self.reward_signal_steps_per_update)
)
self.next_summary_step = self._get_next_summary_step()

def get_policy(self, name_behavior_id: str) -> TFPolicy:
"""
Expand Down
2 changes: 1 addition & 1 deletion ml-agents/mlagents/trainers/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,7 @@ def _set_default_hyperparameters(self):
init_path: Optional[str] = None
output_path: str = "default"
keep_checkpoints: int = 5
checkpoint_interval: int = 500000
max_steps: int = 500000
time_horizon: int = 64
summary_freq: int = 50000
Expand Down Expand Up @@ -267,7 +268,6 @@ class MeasureType:

@attr.s(auto_attribs=True)
class CheckpointSettings:
save_freq: int = parser.get_default("save_freq")
run_id: str = parser.get_default("run_id")
initialize_from: str = parser.get_default("initialize_from")
load_model: bool = parser.get_default("load_model")
Expand Down
8 changes: 0 additions & 8 deletions ml-agents/mlagents/trainers/tests/test_learn.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@ def basic_options(extra_args=None):
seed: 9870
checkpoint_settings:
run_id: uselessrun
save_freq: 654321
debug: false
"""

Expand Down Expand Up @@ -83,7 +82,6 @@ def test_run_training(
trainer_factory_mock.return_value,
"results/ppo",
"ppo",
50000,
None,
True,
0,
Expand Down Expand Up @@ -122,7 +120,6 @@ def test_commandline_args(mock_file):
assert opt.checkpoint_settings.resume is False
assert opt.checkpoint_settings.inference is False
assert opt.checkpoint_settings.run_id == "ppo"
assert opt.checkpoint_settings.save_freq == 50000
assert opt.env_settings.seed == -1
assert opt.env_settings.base_port == 5005
assert opt.env_settings.num_envs == 1
Expand All @@ -136,7 +133,6 @@ def test_commandline_args(mock_file):
"--resume",
"--inference",
"--run-id=myawesomerun",
"--save-freq=123456",
"--seed=7890",
"--train",
"--base-port=4004",
Expand All @@ -150,7 +146,6 @@ def test_commandline_args(mock_file):
assert opt.env_settings.env_path == "./myenvfile"
assert opt.parameter_randomization is None
assert opt.checkpoint_settings.run_id == "myawesomerun"
assert opt.checkpoint_settings.save_freq == 123456
assert opt.env_settings.seed == 7890
assert opt.env_settings.base_port == 4004
assert opt.env_settings.num_envs == 2
Expand All @@ -169,7 +164,6 @@ def test_yaml_args(mock_file):
assert opt.env_settings.env_path == "./oldenvfile"
assert opt.parameter_randomization is None
assert opt.checkpoint_settings.run_id == "uselessrun"
assert opt.checkpoint_settings.save_freq == 654321
assert opt.env_settings.seed == 9870
assert opt.env_settings.base_port == 4001
assert opt.env_settings.num_envs == 4
Expand All @@ -183,7 +177,6 @@ def test_yaml_args(mock_file):
"--resume",
"--inference",
"--run-id=myawesomerun",
"--save-freq=123456",
"--seed=7890",
"--train",
"--base-port=4004",
Expand All @@ -197,7 +190,6 @@ def test_yaml_args(mock_file):
assert opt.env_settings.env_path == "./myenvfile"
assert opt.parameter_randomization is None
assert opt.checkpoint_settings.run_id == "myawesomerun"
assert opt.checkpoint_settings.save_freq == 123456
assert opt.env_settings.seed == 7890
assert opt.env_settings.base_port == 4004
assert opt.env_settings.num_envs == 2
Expand Down
1 change: 0 additions & 1 deletion ml-agents/mlagents/trainers/tests/test_ppo.py
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,6 @@ def test_add_get_policy(ppo_optimizer, dummy_config):

# Make sure the summary steps were loaded properly
assert trainer.get_step == 2000
assert trainer.next_summary_step > 2000

# Test incorrect class of policy
policy = mock.Mock()
Expand Down
49 changes: 48 additions & 1 deletion ml-agents/mlagents/trainers/tests/test_rl_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,12 @@ def _process_trajectory(self, trajectory):

def create_rl_trainer():
mock_brainparams = create_mock_brain()
trainer = FakeTrainer(mock_brainparams, TrainerSettings(max_steps=100), True, 0)
trainer = FakeTrainer(
mock_brainparams,
TrainerSettings(max_steps=100, checkpoint_interval=10, summary_freq=20),
True,
0,
)
trainer.set_is_policy_updating(True)
return trainer

Expand Down Expand Up @@ -107,3 +112,45 @@ def test_advance(mocked_clear_update_buffer):
# Check that the buffer has been cleared
assert not trainer.should_still_train
assert mocked_clear_update_buffer.call_count > 0


@mock.patch("mlagents.trainers.trainer.trainer.Trainer.save_model")
@mock.patch("mlagents.trainers.trainer.trainer.StatsReporter.write_stats")
def test_summary_checkpoint(mock_write_summary, mock_save_model):
trainer = create_rl_trainer()
trajectory_queue = AgentManagerQueue("testbrain")
policy_queue = AgentManagerQueue("testbrain")
trainer.subscribe_trajectory_queue(trajectory_queue)
trainer.publish_policy_queue(policy_queue)
time_horizon = 10
summary_freq = trainer.trainer_settings.summary_freq
checkpoint_interval = trainer.trainer_settings.checkpoint_interval
trajectory = mb.make_fake_trajectory(
length=time_horizon,
max_step_complete=True,
vec_obs_size=1,
num_vis_obs=0,
action_space=[2],
)
# Check that we can turn off the trainer and that the buffer is cleared
num_trajectories = 5
for _ in range(0, num_trajectories):
trajectory_queue.put(trajectory)
trainer.advance()
# Check that there is stuff in the policy queue
policy_queue.get_nowait()

# Check that we have called write_summary the appropriate number of times
calls = [
mock.call(step)
for step in range(summary_freq, num_trajectories * time_horizon, summary_freq)
]
mock_write_summary.assert_has_calls(calls, any_order=True)

calls = [
mock.call(trainer.brain_name)
for step in range(
checkpoint_interval, num_trajectories * time_horizon, checkpoint_interval
)
]
mock_save_model.assert_has_calls(calls, any_order=True)
1 change: 0 additions & 1 deletion ml-agents/mlagents/trainers/tests/test_sac.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,6 @@ def test_add_get_policy(sac_optimizer, dummy_config):

# Make sure the summary steps were loaded properly
assert trainer.get_step == 2000
assert trainer.next_summary_step > 2000

# Test incorrect class of policy
policy = mock.Mock()
Expand Down
2 changes: 0 additions & 2 deletions ml-agents/mlagents/trainers/tests/test_simple_rl.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,6 @@ def _check_environment_trains(
# Create controller and begin training.
with tempfile.TemporaryDirectory() as dir:
run_id = "id"
save_freq = 99999
seed = 1337
StatsReporter.writers.clear() # Clear StatsReporters so we don't write to file
debug_writer = DebugWriter()
Expand All @@ -142,7 +141,6 @@ def _check_environment_trains(
training_seed=seed,
sampler_manager=SamplerManager(None),
resampling_interval=None,
save_freq=save_freq,
)

# Begin training
Expand Down
2 changes: 0 additions & 2 deletions ml-agents/mlagents/trainers/tests/test_trainer_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ def basic_trainer_controller():
trainer_factory=trainer_factory_mock,
output_path="test_model_path",
run_id="test_run_id",
save_freq=100,
meta_curriculum=None,
train=True,
training_seed=99,
Expand All @@ -34,7 +33,6 @@ def test_initialization_seed(numpy_random_seed, tensorflow_set_seed):
trainer_factory=trainer_factory_mock,
output_path="",
run_id="1",
save_freq=1,
meta_curriculum=None,
train=True,
training_seed=seed,
Expand Down
Loading