@@ -40,19 +40,18 @@ ML-Agents provides two reward signals by default, the Extrinsic (environment) re
40
40
Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward
41
41
environments.
42
42
43
- #### Number of Updates for Reward Signal (Optional)
43
+ #### Steps Per Update for Reward Signal (Optional)
44
44
45
- ` reward_signal_num_update ` for the reward signals corresponds to the number of mini batches sampled
46
- and used for updating the reward signals during each
47
- update. By default, we update the reward signals once every time the main policy is updated.
45
+ ` reward_signal_steps_per_update ` for the reward signals corresponds to the number of steps per mini batch sampled
46
+ and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated.
48
47
However, to imitate the training procedure in certain imitation learning papers (e.g.
49
48
[ Kostrikov et. al] ( http://arxiv.org/abs/1809.02925 ) , [ Blondé et. al] ( http://arxiv.org/abs/1809.02064 ) ),
50
- we may want to update the policy N times, then update the reward signal (GAIL) M times.
51
- We can change ` train_interval ` and ` num_update ` of SAC to N, as well as ` reward_signal_num_update `
52
- under ` reward_signals ` to M to accomplish this. By default, ` reward_signal_num_update ` is set to
53
- ` num_update ` .
49
+ we may want to update the reward signal (GAIL) M times for every update of the policy .
50
+ We can change ` steps_per_update ` of SAC to N, as well as ` reward_signal_steps_per_update `
51
+ under ` reward_signals ` to N / M to accomplish this. By default, ` reward_signal_steps_per_update ` is set to
52
+ ` steps_per_update ` .
54
53
55
- Typical Range: ` num_update `
54
+ Typical Range: ` steps_per_update `
56
55
57
56
### Buffer Size
58
57
@@ -106,17 +105,22 @@ there may not be any new interesting information between steps, and `train_inter
106
105
107
106
Typical Range: ` 1 ` - ` 5 `
108
107
109
- ### Number of Updates
108
+ ### Steps Per Update
110
109
111
- ` num_update ` corresponds to the number of mini batches sampled and used for training during each
112
- training event. In SAC, a single "update" corresponds to grabbing a batch of size ` batch_size ` from the experience
113
- replay buffer, and using this mini batch to update the models. Typically, this can be left at 1.
114
- However, to imitate the training procedure in certain papers (e.g.
115
- [ Kostrikov et. al] ( http://arxiv.org/abs/1809.02925 ) , [ Blondé et. al] ( http://arxiv.org/abs/1809.02064 ) ),
116
- we may want to update N times with different mini batches before grabbing additional samples.
117
- We can change ` train_interval ` and ` num_update ` to N to accomplish this.
110
+ ` steps_per_update ` corresponds to the average ratio of agent steps (actions) taken to updates made of the agent's
111
+ policy. In SAC, a single "update" corresponds to grabbing a batch of size ` batch_size ` from the experience
112
+ replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after
113
+ exactly ` steps_per_update ` steps an update will be made, only that the ratio will hold true over many steps.
114
+
115
+ Typically, ` steps_per_update ` should be greater than or equal to 1. Note that setting ` steps_per_update ` lower will
116
+ improve sample efficiency (reduce the number of steps required to train)
117
+ but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example
118
+ environments) ` steps_per_update ` equal to the number of agents in the scene is a good balance.
119
+ For slow environments (steps take 0.1 seconds or more) reducing ` steps_per_update ` may improve training speed.
120
+ We can also change ` steps_per_update ` to lower than 1 to update more often than once per step, though this will
121
+ usually result in a slowdown unless the environment is very slow.
118
122
119
- Typical Range: ` 1 `
123
+ Typical Range: ` 1 ` - ` 20 `
120
124
121
125
### Tau
122
126
0 commit comments