Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
e19b038
ghost controller
andrewcoh Feb 29, 2020
3335cc8
Merge branch 'master' into self-play-mutex
andrewcoh Mar 16, 2020
49f5cf4
Merge branch 'master' into self-play-mutex
andrewcoh Mar 18, 2020
33ff2ff
team id centric ghost trainer
andrewcoh Mar 18, 2020
3f69db7
ELO calculation done in ghost controller
andrewcoh Mar 18, 2020
e19f9e5
removed opponent elo from stat collection
andrewcoh Mar 18, 2020
4e1e139
passing all tests locally
andrewcoh Mar 19, 2020
1741c54
fixed controller behavior when first team discovered isnt 0
andrewcoh Mar 19, 2020
cc17ea1
no negative team id in docs
andrewcoh Mar 19, 2020
43417e1
save step on trainer step count/swap on ghost
andrewcoh Mar 19, 2020
124f886
urllib parse
andrewcoh Mar 19, 2020
8778cec
Update docs/Training-Self-Play.md
andrewcoh Mar 19, 2020
33c5ea9
remove whitespace
andrewcoh Mar 19, 2020
c2eea64
Merge branch 'master' into self-play-mutex
andrewcoh Mar 19, 2020
bd86108
docstrings/ghost_swap -> team_change
andrewcoh Mar 20, 2020
82bdfc4
replaced ghost_swap with team_change in tests
andrewcoh Mar 20, 2020
cb855db
docstrings for all ghost trainer functions
andrewcoh Mar 20, 2020
fb5ccd0
SELF-PLAY NOW SUPPORTS MULTIAGENT TRAINERS
andrewcoh Mar 21, 2020
c3890f5
next learning team from get step
andrewcoh Mar 21, 2020
cad0a2d
comment for self.ghost_step
andrewcoh Mar 21, 2020
f68f7aa
fixed export so both teams have current model
andrewcoh Mar 22, 2020
4c9ba86
updated self-play doc for asymmetric games/changed current_self->curr…
andrewcoh Mar 23, 2020
ffe2cfd
count trainer steps in controller by team id
andrewcoh Mar 23, 2020
c2ae207
added team_change as a yaml config
andrewcoh Mar 23, 2020
7e0ff7b
removed team-change CLI
andrewcoh Mar 23, 2020
d2dd975
fixed tests that expected old hyperparam team-change
andrewcoh Mar 23, 2020
6aae133
doc update for team_change
andrewcoh Mar 23, 2020
d560b5f
removed not max step reached as condition for ELO
andrewcoh Mar 24, 2020
2bf9271
Merge branch 'master' into self-play-mutex
andrewcoh Mar 24, 2020
29435bb
warning for team change hyperparam
andrewcoh Mar 25, 2020
97f1b7d
simple rl asymm ghost tests
andrewcoh Mar 25, 2020
d123fe7
Merge branch 'master' into self-play-mutex
andrewcoh Mar 25, 2020
2cb5a2d
renamed controller methods/doc fixes
andrewcoh Mar 25, 2020
27e924e
current_best_ratio -> latest_model_ratio
andrewcoh Mar 25, 2020
f3332c3
added Foerster paper title to doc
andrewcoh Mar 26, 2020
aca54be
doc fix
andrewcoh Mar 26, 2020
0e52b20
Merge branch 'master' into self-play-mutex
andrewcoh Mar 26, 2020
95469d2
doc fix
andrewcoh Mar 27, 2020
10bd9dd
Merge branch 'master' into self-play-mutex
andrewcoh Mar 27, 2020
01f9de3
Merge branch 'master' into self-play-mutex
andrewcoh Mar 30, 2020
61649ea
using mlagents_env.logging instead of logging
andrewcoh Mar 30, 2020
972ed63
doc fix
andrewcoh Mar 31, 2020
7e0a3ba
modified doc to not include strikers vs goalie
andrewcoh Apr 1, 2020
6c5342d
removed "unpredictable behavior"
andrewcoh Apr 1, 2020
9149413
Merge branch 'master' into self-play-mutex
andrewcoh Apr 1, 2020
02455a4
added to mig doc/address comments
andrewcoh Apr 1, 2020
df8b87f
raise warning when latest_model_ratio not btwn 0, 1
andrewcoh Apr 1, 2020
1333fb9
removed Goalie from learning environment examples
andrewcoh Apr 1, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions config/trainer_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -137,9 +137,10 @@ Tennis:
time_horizon: 1000
self_play:
window: 10
play_against_current_self_ratio: 0.5
play_against_latest_model_ratio: 0.5
save_steps: 50000
swap_steps: 50000
team_change: 100000

Soccer:
normalize: false
Expand All @@ -152,9 +153,10 @@ Soccer:
num_layers: 2
self_play:
window: 10
play_against_current_self_ratio: 0.5
play_against_latest_model_ratio: 0.5
save_steps: 50000
swap_steps: 50000
team_change: 100000

CrawlerStatic:
normalize: true
Expand Down
1 change: 0 additions & 1 deletion docs/Learning-Environment-Examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,7 +349,6 @@ return.
* Goal:
* Get the ball into the opponent's goal while preventing
the ball from entering own goal.
* Goalie:
* Agents: The environment contains four agents, with the same
Behavior Parameters : Soccer.
* Agent Reward Function (dependent):
Expand Down
1 change: 1 addition & 0 deletions docs/Migrating.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ The versions can be found in
### Important changes
* The `--load` and `--train` command-line flags have been deprecated and replaced with `--resume` and `--inference`.
* Running with the same `--run-id` twice will now throw an error.
* The `play_against_current_self_ratio` self-play trainer hyperparameter has been renamed to `play_against_latest_model_ratio`

### Steps to Migrate
* Replace the `--load` flag with `--resume` when calling `mlagents-learn`, and don't use the `--train` flag as training
Expand Down
100 changes: 81 additions & 19 deletions docs/Training-Self-Play.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,40 @@
# Training with Self-Play

ML-Agents provides the functionality to train symmetric, adversarial games with [Self-Play](https://openai.com/blog/competitive-self-play/).
A symmetric game is one in which opposing agents are *equal* in form and function. In reinforcement learning,
this means both agents have the same observation and action spaces.
With self-play, an agent learns in adversarial games by competing against fixed, past versions of itself
to provide a more stable, stationary learning environment. This is compared
to competing against its current self in every episode, which is a constantly changing opponent.
ML-Agents provides the functionality to train both symmetric and asymmetric adversarial games with
[Self-Play](https://openai.com/blog/competitive-self-play/).
A symmetric game is one in which opposing agents are equal in form, function and objective. Examples of symmetric games
are our Tennis and Soccer example environments. In reinforcement learning, this means both agents have the same observation and
action spaces and learn from the same reward function and so *they can share the same policy*. In asymmetric games,
this is not the case. An example of an asymmetric games are Hide and Seek. Agents in these
types of games do not always have the same observation or action spaces and so sharing policy networks is not
necessarily ideal.

With self-play, an agent learns in adversarial games by competing against fixed, past versions of its opponent
(which could be itself as in symmetric games) to provide a more stable, stationary learning environment. This is compared
to competing against the current, best opponent in every episode, which is constantly changing (because it's learning).

Self-play can be used with our implementations of both [Proximal Policy Optimization (PPO)](Training-PPO.md) and [Soft Actor-Critc (SAC)](Training-SAC.md).
However, from the perspective of an individual agent, these scenarios appear to have non-stationary dynamics because the opponent is often changing.
This can cause significant issues in the experience replay mechanism used by SAC. Thus, we recommend that users use PPO. For further reading on
this issue in particular, see the paper [Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/1702.08887.pdf).
For more general information on training with ML-Agents, see [Training ML-Agents](Training-ML-Agents.md).
For more algorithm specific instruction, please see the documentation for [PPO](Training-PPO.md) or [SAC](Training-SAC.md).

Self-play is triggered by including the self-play hyperparameter hierarchy in the trainer configuration file. Detailed description of the self-play hyperparameters are contained below. Furthermore, to distinguish opposing agents, set the team ID to different integer values in the behavior parameters script on the agent prefab.

![Team ID](images/team_id.png)

See the trainer configuration and agent prefabs for our Tennis environment for an example.
***Team ID must be 0 or an integer greater than 0.***
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I know you said this was due to mypy and I never followed up with you on it. Similar to my other comment, if this was just done to make mypy happy, we can always get around that. Let's follow up on it afterwards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, mypy wouldn't let me initialize the learning team ID int to None in the GhostController so I used -1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd probably have to change the type to Optional[int] and handle the None case.


In symmetric games, since all agents (even on opposing teams) will share the same policy, they should have the same 'Behavior Name' in their
Behavior Parameters Script. In asymmetric games, they should have a different Behavior Name in their Behavior Parameters script.
Note, in asymmetric games, the agents must have both different Behavior Names *and* different team IDs! Then, specify the trainer configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means you can't have zerg?teamId=0 and protoss?teamId=0? Is this a fundamental limitation? Sounds like something people are likely to get tripped up on.

If it's a removable restriction, don't let it block this PR, but can you log a jira for followup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be something we support when we introduce a true multiagent trainer i.e. multiple behavior names that are on the same team.

for each Behavior Name in your scene as you would normally, and remember to include the self-play hyperparameter hierarchy!

For examples of how to use this feature, you can see the trainer configurations and agent prefabs for our Tennis and Soccer environments.
Tennis and Soccer provide examples of symmetric games. To train an asymmetric game, specify trainer configurations for each of your behavior names
and include the self-play hyperparameter hierarchy in both.


## Best Practices Training with Self-Play

Expand All @@ -24,7 +43,8 @@ issues faced by reinforcement learning. In general, the tradeoff is between
the skill level and generality of the final policy and the stability of learning.
Training against a set of slowly or unchanging adversaries with low diversity
results in a more stable learning process than training against a set of quickly
changing adversaries with high diversity. With this context, this guide discusses the exposed self-play hyperparameters and intuitions for tuning them.
changing adversaries with high diversity. With this context, this guide discusses
the exposed self-play hyperparameters and intuitions for tuning them.


## Hyperparameters
Expand All @@ -39,31 +59,70 @@ The reward signal should still be used as described in the documentation for the

### Save Steps

The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots. For example, if `save_steps`=10000 then a snapshot of the current policy will be saved every 10000 trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots. For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.

A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent.

Recommended Range : 10000-100000

### Swap Steps
### Team Change

The `team_change` parameter corresponds to the number of *trainer_steps* between switching the learning team.
This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team
becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar
performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents
per team switch.

A larger value of `team-change` will allow the agent to train longer against it's opponents. The longer an agent trains against the same set of opponents
the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies
and so the agent may fail against the next batch of opponents.

The `swap_steps` parameter corresponds to the number of *trainer steps* between swapping the opponents policy with a different snapshot. As in the `save_steps` discussion, note that trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
The value of `team-change` will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team. So, we
recommend setting this value as a function of the `save_steps` parameter discussed previously.

Recommended Range : 4x-10x where x=`save_steps`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of specifying the value here, would it be easier to specify it as a multiple of save_steps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings either way. If you think that's more intuitive then that works for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strong feelings, I just think it makes it easier to twist one knob at a time, instead of having to twist 2 in unison.



### Swap Steps

The `swap_steps` parameter corresponds to the number of *ghost steps* (not trainer steps) between swapping the opponents policy with a different snapshot.
A 'ghost step' refers to a step taken by an agent *that is following a fixed policy and not learning*. The reason for this distinction is that in asymmetric games,
we may have teams with an unequal number of agents e.g. a 2v1 scenario. The team with two agents collects
twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number
of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if
a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents`
agents during `team-change` total steps is:

```
swap_steps = (num_agents / num_opponent_agents) * (team_change / x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be doing the math in the code? I think math is hard...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know how many agents are on each team

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we know the number of steps coming in for each team though?

Also, would this need to be different if the agents and opponents were running at different decision intervals?

```

As an example, in a 2v1 scenario, if we want the swap to occur `x=4` times during `team-change=200000` steps,
the `swap_steps` for the team of one agent is:

```
swap_steps = (1 / 2) * (200000 / 4) = 25000
```
The `swap_steps` for the team of two agents is:
```
swap_steps = (2 / 1) * (200000 / 4) = 100000
```
Note, with equal team sizes, the first term is equal to 1 and `swap_steps` can be calculated by just dividing the total steps by the desired number of swaps.

A larger value of `swap_steps` means that an agent will play against the same fixed opponent for a longer number of training iterations. This results in a more stable training scenario, but leaves the agent open to the risk of overfitting it's behavior for this particular opponent. Thus, when a new opponent is swapped, the agent may lose more often than expected.

Recommended Range : 10000-100000

### Play against current self ratio
### Play against latest model ratio

The `play_against_current_self_ratio` parameter corresponds to the probability
an agent will play against its ***current*** self. With probability
1 - `play_against_current_self_ratio`, the agent will play against a snapshot of itself
from a past iteration.
The `play_against_latest_model_ratio` parameter corresponds to the probability
an agent will play against the latest opponent policy. With probability
1 - `play_against_latest_model_ratio`, the agent will play against a snapshot of its
opponent from a past iteration.

A larger value of `play_against_current_self_ratio` indicates that an agent will be playing against itself more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy.
A larger value of `play_against_latest_model_ratio` indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy.

Recommended Range : 0.0 - 1.0
Range : 0.0 - 1.0

### Window

Expand All @@ -83,5 +142,8 @@ using TensorBoard, see
In adversarial games, the cumulative environment reward may not be a meaningful metric by which to track learning progress. This is because cumulative reward is entirely dependent on the skill of the opponent. An agent at a particular skill level will get more or less reward against a worse or better agent, respectively.

We provide an implementation of the ELO rating system, a method for calculating the relative skill level between two players from a given population in a zero-sum game. For more information on ELO, please see [the ELO wiki](https://en.wikipedia.org/wiki/Elo_rating_system).

In a proper training run, the ELO of the agent should steadily increase. The absolute value of the ELO is less important than the change in ELO over training iterations.

Note, this implementation will support any number of teams but ELO is only applicable to games with two teams. It is ongoing work to implement
a reliable metric for measuring progress in scenarios with three or more teams. These scenarios can still train, though as of now, reward and qualitative observations
are the only metric by which we can judge performance.
52 changes: 31 additions & 21 deletions ml-agents/mlagents/trainers/behavior_id_utils.py
Original file line number Diff line number Diff line change
@@ -1,36 +1,46 @@
from typing import Dict, NamedTuple
from typing import NamedTuple
from urllib.parse import urlparse, parse_qs


class BehaviorIdentifiers(NamedTuple):
name_behavior_id: str
"""
BehaviorIdentifiers is a named tuple of the identifiers that uniquely distinguish
an agent encountered in the trainer_controller. The named tuple consists of the
fully qualified behavior name, the name of the brain name (corresponds to trainer
in the trainer controller) and the team id. In the future, this can be extended
to support further identifiers.
"""

behavior_id: str
brain_name: str
behavior_ids: Dict[str, int]
team_id: int

@staticmethod
def from_name_behavior_id(name_behavior_id: str) -> "BehaviorIdentifiers":
"""
Parses a name_behavior_id of the form name?team=0&param1=i&...
Parses a name_behavior_id of the form name?team=0
into a BehaviorIdentifiers NamedTuple.
This allows you to access the brain name and distinguishing identifiers
without parsing more than once.
This allows you to access the brain name and team id of an agent
:param name_behavior_id: String of behavior params in HTTP format.
:returns: A BehaviorIdentifiers object.
"""

ids: Dict[str, int] = {}
if "?" in name_behavior_id:
name, identifiers = name_behavior_id.rsplit("?", 1)
if "&" in identifiers:
list_of_identifiers = identifiers.split("&")
else:
list_of_identifiers = [identifiers]

for identifier in list_of_identifiers:
key, value = identifier.split("=")
ids[key] = int(value)
else:
name = name_behavior_id

parsed = urlparse(name_behavior_id)
name = parsed.path
ids = parse_qs(parsed.query)
team_id: int = 0
if "team" in ids:
team_id = int(ids["team"][0])
return BehaviorIdentifiers(
name_behavior_id=name_behavior_id, brain_name=name, behavior_ids=ids
behavior_id=name_behavior_id, brain_name=name, team_id=team_id
)


def create_name_behavior_id(name: str, team_id: int) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere? Would it be better as a method (or property) of BehaviorIdentifiers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used here and here. In both instances, it's used so that the correct policies are pushed onto the correct queues if the learning team changes right before a swap. I'm not sure it's appropriate to be a method/property of BehaviorIdentifiers because it's not really operating on data contained in a BehaviorIdentifier tuple.

"""
Reconstructs fully qualified behavior name from name and team_id
:param name: brain name
:param team_id: team ID
:return: name_behavior_id
"""
return name + "?team=" + str(team_id)
92 changes: 92 additions & 0 deletions ml-agents/mlagents/trainers/ghost/controller.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
from mlagents_envs.logging_util import get_logger
from typing import Deque, Dict
from collections import deque
from mlagents.trainers.ghost.trainer import GhostTrainer

logger = get_logger(__name__)


class GhostController:
"""
GhostController contains a queue of team ids. GhostTrainers subscribe to the GhostController and query
it to get the current learning team. The GhostController cycles through team ids every 'swap_interval'
which corresponds to the number of trainer steps between changing learning teams.
The GhostController is a unique object and there can only be one per training run.
"""

def __init__(self, maxlen: int = 10):
"""
Create a GhostController.
:param maxlen: Maximum number of GhostTrainers allowed in this GhostController
"""

# Tracks last swap step for each learning team because trainer
# steps of all GhostTrainers do not increment together
self._queue: Deque[int] = deque(maxlen=maxlen)
self._learning_team: int = -1
# Dict from team id to GhostTrainer for ELO calculation
self._ghost_trainers: Dict[int, GhostTrainer] = {}

@property
def get_learning_team(self) -> int:
"""
Returns the current learning team.
:return: The learning team id
"""
return self._learning_team

def subscribe_team_id(self, team_id: int, trainer: GhostTrainer) -> None:
"""
Given a team_id and trainer, add to queue and trainers if not already.
The GhostTrainer is used later by the controller to get ELO ratings of agents.
:param team_id: The team_id of an agent managed by this GhostTrainer
:param trainer: A GhostTrainer that manages this team_id.
"""
if team_id not in self._ghost_trainers:
self._ghost_trainers[team_id] = trainer
if self._learning_team < 0:
self._learning_team = team_id
else:
self._queue.append(team_id)

def change_training_team(self, step: int) -> None:
"""
The current learning team is added to the end of the queue and then updated with the
next in line.
:param step: The step of the trainer for debugging
"""
self._queue.append(self._learning_team)
self._learning_team = self._queue.popleft()
logger.debug(
"Learning team {} swapped on step {}".format(self._learning_team, step)
)

# Adapted from https://github.com/Unity-Technologies/ml-agents/pull/1975 and
# https://metinmediamath.wordpress.com/2013/11/27/how-to-calculate-the-elo-rating-including-example/
# ELO calculation
# TODO : Generalize this to more than two teams
def compute_elo_rating_changes(self, rating: float, result: float) -> float:
"""
Calculates ELO. Given the rating of the learning team and result. The GhostController
queries the other GhostTrainers for the ELO of their agent that is currently being deployed.
Note, this could be the current agent or a past snapshot.
:param rating: Rating of the learning team.
:param result: Win, loss, or draw from the perspective of the learning team.
:return: The change in ELO.
"""
opponent_rating: float = 0.0
for team_id, trainer in self._ghost_trainers.items():
if team_id != self._learning_team:
opponent_rating = trainer.get_opponent_elo()
r1 = pow(10, rating / 400)
r2 = pow(10, opponent_rating / 400)

summed = r1 + r2
e1 = r1 / summed

change = result - e1
for team_id, trainer in self._ghost_trainers.items():
if team_id != self._learning_team:
trainer.change_opponent_elo(change)

return change
Loading