diff --git a/config/trainer_config.yaml b/config/trainer_config.yaml index 313230fab2..b5fecb4ca1 100644 --- a/config/trainer_config.yaml +++ b/config/trainer_config.yaml @@ -137,9 +137,10 @@ Tennis: time_horizon: 1000 self_play: window: 10 - play_against_current_self_ratio: 0.5 + play_against_latest_model_ratio: 0.5 save_steps: 50000 swap_steps: 50000 + team_change: 100000 Soccer: normalize: false @@ -152,9 +153,10 @@ Soccer: num_layers: 2 self_play: window: 10 - play_against_current_self_ratio: 0.5 + play_against_latest_model_ratio: 0.5 save_steps: 50000 swap_steps: 50000 + team_change: 100000 CrawlerStatic: normalize: true diff --git a/docs/Learning-Environment-Examples.md b/docs/Learning-Environment-Examples.md index 6d911aba0d..a2f090e58c 100644 --- a/docs/Learning-Environment-Examples.md +++ b/docs/Learning-Environment-Examples.md @@ -349,7 +349,6 @@ return. * Goal: * Get the ball into the opponent's goal while preventing the ball from entering own goal. - * Goalie: * Agents: The environment contains four agents, with the same Behavior Parameters : Soccer. * Agent Reward Function (dependent): diff --git a/docs/Migrating.md b/docs/Migrating.md index 18f8fb2efe..fd9c9d84a1 100644 --- a/docs/Migrating.md +++ b/docs/Migrating.md @@ -12,6 +12,7 @@ The versions can be found in ### Important changes * The `--load` and `--train` command-line flags have been deprecated and replaced with `--resume` and `--inference`. * Running with the same `--run-id` twice will now throw an error. +* The `play_against_current_self_ratio` self-play trainer hyperparameter has been renamed to `play_against_latest_model_ratio` ### Steps to Migrate * Replace the `--load` flag with `--resume` when calling `mlagents-learn`, and don't use the `--train` flag as training diff --git a/docs/Training-Self-Play.md b/docs/Training-Self-Play.md index 0f40f42062..4b2076efde 100644 --- a/docs/Training-Self-Play.md +++ b/docs/Training-Self-Play.md @@ -1,13 +1,22 @@ # Training with Self-Play -ML-Agents provides the functionality to train symmetric, adversarial games with [Self-Play](https://openai.com/blog/competitive-self-play/). -A symmetric game is one in which opposing agents are *equal* in form and function. In reinforcement learning, -this means both agents have the same observation and action spaces. -With self-play, an agent learns in adversarial games by competing against fixed, past versions of itself -to provide a more stable, stationary learning environment. This is compared -to competing against its current self in every episode, which is a constantly changing opponent. +ML-Agents provides the functionality to train both symmetric and asymmetric adversarial games with +[Self-Play](https://openai.com/blog/competitive-self-play/). +A symmetric game is one in which opposing agents are equal in form, function and objective. Examples of symmetric games +are our Tennis and Soccer example environments. In reinforcement learning, this means both agents have the same observation and +action spaces and learn from the same reward function and so *they can share the same policy*. In asymmetric games, +this is not the case. An example of an asymmetric games are Hide and Seek. Agents in these +types of games do not always have the same observation or action spaces and so sharing policy networks is not +necessarily ideal. + +With self-play, an agent learns in adversarial games by competing against fixed, past versions of its opponent +(which could be itself as in symmetric games) to provide a more stable, stationary learning environment. This is compared +to competing against the current, best opponent in every episode, which is constantly changing (because it's learning). Self-play can be used with our implementations of both [Proximal Policy Optimization (PPO)](Training-PPO.md) and [Soft Actor-Critc (SAC)](Training-SAC.md). +However, from the perspective of an individual agent, these scenarios appear to have non-stationary dynamics because the opponent is often changing. +This can cause significant issues in the experience replay mechanism used by SAC. Thus, we recommend that users use PPO. For further reading on +this issue in particular, see the paper [Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/1702.08887.pdf). For more general information on training with ML-Agents, see [Training ML-Agents](Training-ML-Agents.md). For more algorithm specific instruction, please see the documentation for [PPO](Training-PPO.md) or [SAC](Training-SAC.md). @@ -15,7 +24,17 @@ Self-play is triggered by including the self-play hyperparameter hierarchy in th ![Team ID](images/team_id.png) -See the trainer configuration and agent prefabs for our Tennis environment for an example. +***Team ID must be 0 or an integer greater than 0.*** + +In symmetric games, since all agents (even on opposing teams) will share the same policy, they should have the same 'Behavior Name' in their +Behavior Parameters Script. In asymmetric games, they should have a different Behavior Name in their Behavior Parameters script. +Note, in asymmetric games, the agents must have both different Behavior Names *and* different team IDs! Then, specify the trainer configuration +for each Behavior Name in your scene as you would normally, and remember to include the self-play hyperparameter hierarchy! + +For examples of how to use this feature, you can see the trainer configurations and agent prefabs for our Tennis and Soccer environments. +Tennis and Soccer provide examples of symmetric games. To train an asymmetric game, specify trainer configurations for each of your behavior names +and include the self-play hyperparameter hierarchy in both. + ## Best Practices Training with Self-Play @@ -24,7 +43,8 @@ issues faced by reinforcement learning. In general, the tradeoff is between the skill level and generality of the final policy and the stability of learning. Training against a set of slowly or unchanging adversaries with low diversity results in a more stable learning process than training against a set of quickly -changing adversaries with high diversity. With this context, this guide discusses the exposed self-play hyperparameters and intuitions for tuning them. +changing adversaries with high diversity. With this context, this guide discusses +the exposed self-play hyperparameters and intuitions for tuning them. ## Hyperparameters @@ -39,31 +59,70 @@ The reward signal should still be used as described in the documentation for the ### Save Steps -The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots. For example, if `save_steps`=10000 then a snapshot of the current policy will be saved every 10000 trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13. +The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots. For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13. A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent. Recommended Range : 10000-100000 -### Swap Steps +### Team Change + +The `team_change` parameter corresponds to the number of *trainer_steps* between switching the learning team. +This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team +becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar +performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents +per team switch. + +A larger value of `team-change` will allow the agent to train longer against it's opponents. The longer an agent trains against the same set of opponents +the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies +and so the agent may fail against the next batch of opponents. -The `swap_steps` parameter corresponds to the number of *trainer steps* between swapping the opponents policy with a different snapshot. As in the `save_steps` discussion, note that trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13. +The value of `team-change` will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team. So, we +recommend setting this value as a function of the `save_steps` parameter discussed previously. +Recommended Range : 4x-10x where x=`save_steps` + + +### Swap Steps + +The `swap_steps` parameter corresponds to the number of *ghost steps* (not trainer steps) between swapping the opponents policy with a different snapshot. +A 'ghost step' refers to a step taken by an agent *that is following a fixed policy and not learning*. The reason for this distinction is that in asymmetric games, +we may have teams with an unequal number of agents e.g. a 2v1 scenario. The team with two agents collects +twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number +of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if +a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents` +agents during `team-change` total steps is: + +``` +swap_steps = (num_agents / num_opponent_agents) * (team_change / x) +``` + +As an example, in a 2v1 scenario, if we want the swap to occur `x=4` times during `team-change=200000` steps, +the `swap_steps` for the team of one agent is: + +``` +swap_steps = (1 / 2) * (200000 / 4) = 25000 +``` +The `swap_steps` for the team of two agents is: +``` +swap_steps = (2 / 1) * (200000 / 4) = 100000 +``` +Note, with equal team sizes, the first term is equal to 1 and `swap_steps` can be calculated by just dividing the total steps by the desired number of swaps. A larger value of `swap_steps` means that an agent will play against the same fixed opponent for a longer number of training iterations. This results in a more stable training scenario, but leaves the agent open to the risk of overfitting it's behavior for this particular opponent. Thus, when a new opponent is swapped, the agent may lose more often than expected. Recommended Range : 10000-100000 -### Play against current self ratio +### Play against latest model ratio -The `play_against_current_self_ratio` parameter corresponds to the probability -an agent will play against its ***current*** self. With probability -1 - `play_against_current_self_ratio`, the agent will play against a snapshot of itself -from a past iteration. +The `play_against_latest_model_ratio` parameter corresponds to the probability +an agent will play against the latest opponent policy. With probability +1 - `play_against_latest_model_ratio`, the agent will play against a snapshot of its +opponent from a past iteration. -A larger value of `play_against_current_self_ratio` indicates that an agent will be playing against itself more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy. +A larger value of `play_against_latest_model_ratio` indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy. -Recommended Range : 0.0 - 1.0 +Range : 0.0 - 1.0 ### Window @@ -83,5 +142,8 @@ using TensorBoard, see In adversarial games, the cumulative environment reward may not be a meaningful metric by which to track learning progress. This is because cumulative reward is entirely dependent on the skill of the opponent. An agent at a particular skill level will get more or less reward against a worse or better agent, respectively. We provide an implementation of the ELO rating system, a method for calculating the relative skill level between two players from a given population in a zero-sum game. For more information on ELO, please see [the ELO wiki](https://en.wikipedia.org/wiki/Elo_rating_system). - In a proper training run, the ELO of the agent should steadily increase. The absolute value of the ELO is less important than the change in ELO over training iterations. + +Note, this implementation will support any number of teams but ELO is only applicable to games with two teams. It is ongoing work to implement +a reliable metric for measuring progress in scenarios with three or more teams. These scenarios can still train, though as of now, reward and qualitative observations +are the only metric by which we can judge performance. diff --git a/ml-agents/mlagents/trainers/behavior_id_utils.py b/ml-agents/mlagents/trainers/behavior_id_utils.py index f10030d234..76022bdf86 100644 --- a/ml-agents/mlagents/trainers/behavior_id_utils.py +++ b/ml-agents/mlagents/trainers/behavior_id_utils.py @@ -1,36 +1,46 @@ -from typing import Dict, NamedTuple +from typing import NamedTuple +from urllib.parse import urlparse, parse_qs class BehaviorIdentifiers(NamedTuple): - name_behavior_id: str + """ + BehaviorIdentifiers is a named tuple of the identifiers that uniquely distinguish + an agent encountered in the trainer_controller. The named tuple consists of the + fully qualified behavior name, the name of the brain name (corresponds to trainer + in the trainer controller) and the team id. In the future, this can be extended + to support further identifiers. + """ + + behavior_id: str brain_name: str - behavior_ids: Dict[str, int] + team_id: int @staticmethod def from_name_behavior_id(name_behavior_id: str) -> "BehaviorIdentifiers": """ - Parses a name_behavior_id of the form name?team=0¶m1=i&... + Parses a name_behavior_id of the form name?team=0 into a BehaviorIdentifiers NamedTuple. - This allows you to access the brain name and distinguishing identifiers - without parsing more than once. + This allows you to access the brain name and team id of an agent :param name_behavior_id: String of behavior params in HTTP format. :returns: A BehaviorIdentifiers object. """ - ids: Dict[str, int] = {} - if "?" in name_behavior_id: - name, identifiers = name_behavior_id.rsplit("?", 1) - if "&" in identifiers: - list_of_identifiers = identifiers.split("&") - else: - list_of_identifiers = [identifiers] - - for identifier in list_of_identifiers: - key, value = identifier.split("=") - ids[key] = int(value) - else: - name = name_behavior_id - + parsed = urlparse(name_behavior_id) + name = parsed.path + ids = parse_qs(parsed.query) + team_id: int = 0 + if "team" in ids: + team_id = int(ids["team"][0]) return BehaviorIdentifiers( - name_behavior_id=name_behavior_id, brain_name=name, behavior_ids=ids + behavior_id=name_behavior_id, brain_name=name, team_id=team_id ) + + +def create_name_behavior_id(name: str, team_id: int) -> str: + """ + Reconstructs fully qualified behavior name from name and team_id + :param name: brain name + :param team_id: team ID + :return: name_behavior_id + """ + return name + "?team=" + str(team_id) diff --git a/ml-agents/mlagents/trainers/ghost/controller.py b/ml-agents/mlagents/trainers/ghost/controller.py new file mode 100644 index 0000000000..d152b9f0c2 --- /dev/null +++ b/ml-agents/mlagents/trainers/ghost/controller.py @@ -0,0 +1,92 @@ +from mlagents_envs.logging_util import get_logger +from typing import Deque, Dict +from collections import deque +from mlagents.trainers.ghost.trainer import GhostTrainer + +logger = get_logger(__name__) + + +class GhostController: + """ + GhostController contains a queue of team ids. GhostTrainers subscribe to the GhostController and query + it to get the current learning team. The GhostController cycles through team ids every 'swap_interval' + which corresponds to the number of trainer steps between changing learning teams. + The GhostController is a unique object and there can only be one per training run. + """ + + def __init__(self, maxlen: int = 10): + """ + Create a GhostController. + :param maxlen: Maximum number of GhostTrainers allowed in this GhostController + """ + + # Tracks last swap step for each learning team because trainer + # steps of all GhostTrainers do not increment together + self._queue: Deque[int] = deque(maxlen=maxlen) + self._learning_team: int = -1 + # Dict from team id to GhostTrainer for ELO calculation + self._ghost_trainers: Dict[int, GhostTrainer] = {} + + @property + def get_learning_team(self) -> int: + """ + Returns the current learning team. + :return: The learning team id + """ + return self._learning_team + + def subscribe_team_id(self, team_id: int, trainer: GhostTrainer) -> None: + """ + Given a team_id and trainer, add to queue and trainers if not already. + The GhostTrainer is used later by the controller to get ELO ratings of agents. + :param team_id: The team_id of an agent managed by this GhostTrainer + :param trainer: A GhostTrainer that manages this team_id. + """ + if team_id not in self._ghost_trainers: + self._ghost_trainers[team_id] = trainer + if self._learning_team < 0: + self._learning_team = team_id + else: + self._queue.append(team_id) + + def change_training_team(self, step: int) -> None: + """ + The current learning team is added to the end of the queue and then updated with the + next in line. + :param step: The step of the trainer for debugging + """ + self._queue.append(self._learning_team) + self._learning_team = self._queue.popleft() + logger.debug( + "Learning team {} swapped on step {}".format(self._learning_team, step) + ) + + # Adapted from https://github.com/Unity-Technologies/ml-agents/pull/1975 and + # https://metinmediamath.wordpress.com/2013/11/27/how-to-calculate-the-elo-rating-including-example/ + # ELO calculation + # TODO : Generalize this to more than two teams + def compute_elo_rating_changes(self, rating: float, result: float) -> float: + """ + Calculates ELO. Given the rating of the learning team and result. The GhostController + queries the other GhostTrainers for the ELO of their agent that is currently being deployed. + Note, this could be the current agent or a past snapshot. + :param rating: Rating of the learning team. + :param result: Win, loss, or draw from the perspective of the learning team. + :return: The change in ELO. + """ + opponent_rating: float = 0.0 + for team_id, trainer in self._ghost_trainers.items(): + if team_id != self._learning_team: + opponent_rating = trainer.get_opponent_elo() + r1 = pow(10, rating / 400) + r2 = pow(10, opponent_rating / 400) + + summed = r1 + r2 + e1 = r1 / summed + + change = result - e1 + for team_id, trainer in self._ghost_trainers.items(): + if team_id != self._learning_team: + trainer.change_opponent_elo(change) + + return change diff --git a/ml-agents/mlagents/trainers/ghost/trainer.py b/ml-agents/mlagents/trainers/ghost/trainer.py index 9042f693c1..0dfd287bf3 100644 --- a/ml-agents/mlagents/trainers/ghost/trainer.py +++ b/ml-agents/mlagents/trainers/ghost/trainer.py @@ -1,7 +1,7 @@ # # Unity ML-Agents Toolkit # ## ML-Agent Learning (Ghost Trainer) -from typing import Deque, Dict, List, Any, cast +from typing import Deque, Dict, List, cast import numpy as np @@ -14,20 +14,44 @@ from mlagents.trainers.trajectory import Trajectory from mlagents.trainers.agent_processor import AgentManagerQueue from mlagents.trainers.stats import StatsPropertyType -from mlagents.trainers.behavior_id_utils import BehaviorIdentifiers +from mlagents.trainers.behavior_id_utils import ( + BehaviorIdentifiers, + create_name_behavior_id, +) logger = get_logger(__name__) class GhostTrainer(Trainer): + """ + The GhostTrainer trains agents in adversarial games (there are teams in opposition) using a self-play mechanism. + In adversarial settings with self-play, at any time, there is only a single learning team. The other team(s) is + "ghosted" which means that its agents are executing fixed policies and not learning. The GhostTrainer wraps + a standard RL trainer which trains the learning team and ensures that only the trajectories collected + by the learning team are used for training. The GhostTrainer also maintains past policy snapshots to be used + as the fixed policies when the team is not learning. The GhostTrainer is 1:1 with brain_names as the other + trainers, and is responsible for one or more teams. Note, a GhostTrainer can have only one team in + asymmetric games where there is only one team with a particular behavior i.e. Hide and Seek. + The GhostController manages high level coordination between multiple ghost trainers. The learning team id + is cycled throughout a training run. + """ + def __init__( - self, trainer, brain_name, reward_buff_cap, trainer_parameters, training, run_id + self, + trainer, + brain_name, + controller, + reward_buff_cap, + trainer_parameters, + training, + run_id, ): """ - Responsible for collecting experiences and training trainer model via self_play. + Creates a GhostTrainer. :param trainer: The trainer of the policy/policies being trained with self_play :param brain_name: The name of the brain associated with trainer config + :param controller: GhostController that coordinates all ghost trainers and calculates ELO :param reward_buff_cap: Max reward history to track in the reward buffer :param trainer_parameters: The parameters for the trainer (dictionary). :param training: Whether the trainer is set for training. @@ -39,11 +63,16 @@ def __init__( ) self.trainer = trainer + self.controller = controller + + self._internal_trajectory_queues: Dict[str, AgentManagerQueue[Trajectory]] = {} + self._internal_policy_queues: Dict[str, AgentManagerQueue[Policy]] = {} + + self._team_to_name_to_policy_queue: Dict[ + int, Dict[str, AgentManagerQueue[Policy]] + ] = {} - self.internal_policy_queues: List[AgentManagerQueue[Policy]] = [] - self.internal_trajectory_queues: List[AgentManagerQueue[Trajectory]] = [] - self.ignored_trajectory_queues: List[AgentManagerQueue[Trajectory]] = [] - self.learning_policy_queues: Dict[str, AgentManagerQueue[Policy]] = {} + self._name_to_parsed_behavior_id: Dict[str, BehaviorIdentifiers] = {} # assign ghost's stats collection to wrapped trainer's self._stats_reporter = self.trainer.stats_reporter @@ -52,23 +81,56 @@ def __init__( self_play_parameters = trainer_parameters["self_play"] self.window = self_play_parameters.get("window", 10) - self.play_against_current_self_ratio = self_play_parameters.get( - "play_against_current_self_ratio", 0.5 + self.play_against_latest_model_ratio = self_play_parameters.get( + "play_against_latest_model_ratio", 0.5 ) + if ( + self.play_against_latest_model_ratio > 1.0 + or self.play_against_latest_model_ratio < 0.0 + ): + logger.warning( + "The play_against_latest_model_ratio is not between 0 and 1." + ) + self.steps_between_save = self_play_parameters.get("save_steps", 20000) self.steps_between_swap = self_play_parameters.get("swap_steps", 20000) + self.steps_to_train_team = self_play_parameters.get("team_change", 100000) + if self.steps_to_train_team > self.get_max_steps: + logger.warning( + "The max steps of the GhostTrainer for behavior name {} is less than team change. This team will not face \ + opposition that has been trained if the opposition is managed by a different GhostTrainer as in an \ + asymmetric game.".format( + self.brain_name + ) + ) + + # Counts the The number of steps of the ghost policies. Snapshot swapping + # depends on this counter whereas snapshot saving and team switching depends + # on the wrapped. This ensures that all teams train for the same number of trainer + # steps. + self.ghost_step: int = 0 + + # A list of dicts from brain name to a single snapshot for this trainer's policies + self.policy_snapshots: List[Dict[str, List[float]]] = [] + + # A dict from brain name to the current snapshot of this trainer's policies + self.current_policy_snapshot: Dict[str, List[float]] = {} - self.policies: Dict[str, TFPolicy] = {} - self.policy_snapshots: List[Any] = [] self.snapshot_counter: int = 0 - self.learning_behavior_name: str = None - self.current_policy_snapshot = None - self.last_save = 0 - self.last_swap = 0 + self.policies: Dict[str, TFPolicy] = {} + + # wrapped_training_team and learning team need to be separate + # in the situation where new agents are created destroyed + # after learning team switches. These agents need to be added + # to trainers properly. + self._learning_team: int = None + self.wrapped_trainer_team: int = None + self.last_save: int = 0 + self.last_swap: int = 0 + self.last_team_change: int = 0 # Chosen because it is the initial ELO in Chess self.initial_elo: float = self_play_parameters.get("initial_elo", 1200.0) - self.current_elo: float = self.initial_elo self.policy_elos: List[float] = [self.initial_elo] * ( self.window + 1 ) # for learning policy @@ -77,8 +139,8 @@ def __init__( @property def get_step(self) -> int: """ - Returns the number of steps the trainer has performed - :return: the step count of the trainer + Returns the number of steps the wrapped trainer has performed + :return: the step count of the wrapped trainer """ return self.trainer.get_step @@ -92,9 +154,45 @@ def reward_buffer(self) -> Deque[float]: """ return self.trainer.reward_buffer + @property + def current_elo(self) -> float: + """ + Gets ELO of current policy which is always last in the list + :return: ELO of current policy + """ + return self.policy_elos[-1] + + def change_current_elo(self, change: float) -> None: + """ + Changes elo of current policy which is always last in the list + :param change: Amount to change current elo by + """ + self.policy_elos[-1] += change + + def get_opponent_elo(self) -> float: + """ + Get elo of current opponent policy + :return: ELO of current opponent policy + """ + return self.policy_elos[self.current_opponent] + + def change_opponent_elo(self, change: float) -> None: + """ + Changes elo of current opponent policy + :param change: Amount to change current opponent elo by + """ + self.policy_elos[self.current_opponent] -= change + def _process_trajectory(self, trajectory: Trajectory) -> None: - if trajectory.done_reached and not trajectory.max_step_reached: - # Assumption is that final reward is 1/.5/0 for win/draw/loss + """ + Determines the final result of an episode and asks the GhostController + to calculate the ELO change. The GhostController changes the ELO + of the opponent policy since this may be in a different GhostTrainer + i.e. in asymmetric games. We assume the last reward determines the winner. + :param trajectory: Trajectory. + """ + if trajectory.done_reached: + # Assumption is that final reward is >0/0/<0 for win/draw/loss final_reward = trajectory.steps[-1].reward result = 0.5 if final_reward > 0: @@ -102,188 +200,262 @@ def _process_trajectory(self, trajectory: Trajectory) -> None: elif final_reward < 0: result = 0.0 - change = compute_elo_rating_changes( - self.current_elo, self.policy_elos[self.current_opponent], result + change = self.controller.compute_elo_rating_changes( + self.current_elo, result ) - self.current_elo += change - self.policy_elos[self.current_opponent] -= change - opponents = np.array(self.policy_elos, dtype=np.float32) + self.change_current_elo(change) self._stats_reporter.add_stat("Self-play/ELO", self.current_elo) - self._stats_reporter.add_stat( - "Self-play/Mean Opponent ELO", opponents.mean() - ) - self._stats_reporter.add_stat("Self-play/Std Opponent ELO", opponents.std()) def advance(self) -> None: """ Steps the trainer, passing trajectories to wrapped trainer and calling trainer advance """ - for traj_queue, internal_traj_queue in zip( - self.trajectory_queues, self.internal_trajectory_queues - ): - try: - # We grab at most the maximum length of the queue. - # This ensures that even if the queue is being filled faster than it is - # being emptied, the trajectories in the queue are on-policy. - for _ in range(traj_queue.maxlen): - t = traj_queue.get_nowait() - # adds to wrapped trainers queue - internal_traj_queue.put(t) - self._process_trajectory(t) - except AgentManagerQueue.Empty: - pass + for trajectory_queue in self.trajectory_queues: + parsed_behavior_id = self._name_to_parsed_behavior_id[ + trajectory_queue.behavior_id + ] + if parsed_behavior_id.team_id == self._learning_team: + # With a future multiagent trainer, this will be indexed by 'role' + internal_trajectory_queue = self._internal_trajectory_queues[ + parsed_behavior_id.brain_name + ] + try: + # We grab at most the maximum length of the queue. + # This ensures that even if the queue is being filled faster than it is + # being emptied, the trajectories in the queue are on-policy. + for _ in range(trajectory_queue.maxlen): + t = trajectory_queue.get_nowait() + # adds to wrapped trainers queue + internal_trajectory_queue.put(t) + self._process_trajectory(t) + except AgentManagerQueue.Empty: + pass + else: + # Dump trajectories from non-learning policy + try: + for _ in range(trajectory_queue.maxlen): + t = trajectory_queue.get_nowait() + # count ghost steps + self.ghost_step += len(t.steps) + except AgentManagerQueue.Empty: + pass self.next_summary_step = self.trainer.next_summary_step - self.trainer.advance() - - for internal_q in self.internal_policy_queues: - # Get policies that correspond to the policy queue in question + if self.get_step - self.last_team_change > self.steps_to_train_team: + self.controller.change_training_team(self.get_step) + self.last_team_change = self.get_step + + next_learning_team = self.controller.get_learning_team + + # CASE 1: Current learning team is managed by this GhostTrainer. + # If the learning team changes, the following loop over queues will push the + # new policy into the policy queue for the new learning agent if + # that policy is managed by this GhostTrainer. Otherwise, it will save the current snapshot. + # CASE 2: Current learning team is managed by a different GhostTrainer. + # If the learning team changes to a team managed by this GhostTrainer, this loop + # will push the current_snapshot into the correct queue. Otherwise, + # it will continue skipping and swap_snapshot will continue to handle + # pushing fixed snapshots + # Case 3: No team change. The if statement just continues to push the policy + # into the correct queue (or not if not learning team). + for brain_name in self._internal_policy_queues: + internal_policy_queue = self._internal_policy_queues[brain_name] try: - policy = cast(TFPolicy, internal_q.get_nowait()) - self.current_policy_snapshot = policy.get_weights() - self.learning_policy_queues[internal_q.behavior_id].put(policy) + policy = cast(TFPolicy, internal_policy_queue.get_nowait()) + self.current_policy_snapshot[brain_name] = policy.get_weights() except AgentManagerQueue.Empty: pass - + if next_learning_team in self._team_to_name_to_policy_queue: + name_to_policy_queue = self._team_to_name_to_policy_queue[ + next_learning_team + ] + if brain_name in name_to_policy_queue: + behavior_id = create_name_behavior_id( + brain_name, next_learning_team + ) + policy = self.get_policy(behavior_id) + policy.load_weights(self.current_policy_snapshot[brain_name]) + name_to_policy_queue[brain_name].put(policy) + + # Note save and swap should be on different step counters. + # We don't want to save unless the policy is learning. if self.get_step - self.last_save > self.steps_between_save: - self._save_snapshot(self.trainer.policy) + self._save_snapshot() self.last_save = self.get_step - if self.get_step - self.last_swap > self.steps_between_swap: + if ( + self._learning_team != next_learning_team + or self.ghost_step - self.last_swap > self.steps_between_swap + ): + self._learning_team = next_learning_team self._swap_snapshots() - self.last_swap = self.get_step - - # Dump trajectories from non-learning policy - for traj_queue in self.ignored_trajectory_queues: - try: - for _ in range(traj_queue.maxlen): - traj_queue.get_nowait() - except AgentManagerQueue.Empty: - pass + self.last_swap = self.ghost_step def end_episode(self): + """ + Forwarding call to wrapped trainers end_episode + """ self.trainer.end_episode() def save_model(self, name_behavior_id: str) -> None: + """ + Forwarding call to wrapped trainers save_model + """ self.trainer.save_model(name_behavior_id) def export_model(self, name_behavior_id: str) -> None: - self.trainer.export_model(name_behavior_id) + """ + Forwarding call to wrapped trainers export_model. + First loads the current snapshot. + """ + parsed_behavior_id = self._name_to_parsed_behavior_id[name_behavior_id] + brain_name = parsed_behavior_id.brain_name + policy = self.trainer.get_policy(brain_name) + policy.load_weights(self.current_policy_snapshot[brain_name]) + self.trainer.export_model(brain_name) def create_policy(self, brain_parameters: BrainParameters) -> TFPolicy: + """ + Creates policy with the wrapped trainer's create_policy function + """ return self.trainer.create_policy(brain_parameters) - def add_policy(self, name_behavior_id: str, policy: TFPolicy) -> None: + def add_policy( + self, parsed_behavior_id: BehaviorIdentifiers, policy: TFPolicy + ) -> None: """ - Adds policy to trainer. For the first policy added, add a trainer - to the policy and set the learning behavior name to name_behavior_id. + Adds policy to trainer. The first policy encountered sets the wrapped + trainer team. This is to ensure that all agents from the same multi-agent + team are grouped. All policies associated with this team are added to the + wrapped trainer to be trained. :param name_behavior_id: Behavior ID that the policy should belong to. :param policy: Policy to associate with name_behavior_id. """ + name_behavior_id = parsed_behavior_id.behavior_id + team_id = parsed_behavior_id.team_id + self.controller.subscribe_team_id(team_id, self) self.policies[name_behavior_id] = policy policy.create_tf_graph() - # First policy encountered - if not self.learning_behavior_name: - weights = policy.get_weights() - self.current_policy_snapshot = weights - self.trainer.add_policy(name_behavior_id, policy) - self._save_snapshot(policy) # Need to save after trainer initializes policy - self.learning_behavior_name = name_behavior_id - behavior_id_parsed = BehaviorIdentifiers.from_name_behavior_id( - self.learning_behavior_name - ) - team_id = behavior_id_parsed.behavior_ids["team"] - self._stats_reporter.add_property(StatsPropertyType.SELF_PLAY_TEAM, team_id) - else: - # for saving/swapping snapshots - policy.init_load_weights() + self._name_to_parsed_behavior_id[name_behavior_id] = parsed_behavior_id + # for saving/swapping snapshots + policy.init_load_weights() + + # First policy or a new agent on the same team encountered + if self.wrapped_trainer_team is None or team_id == self.wrapped_trainer_team: + self.current_policy_snapshot[ + parsed_behavior_id.brain_name + ] = policy.get_weights() + + self._save_snapshot() # Need to save after trainer initializes policy + self.trainer.add_policy(parsed_behavior_id, policy) + self._learning_team = self.controller.get_learning_team + self.wrapped_trainer_team = team_id def get_policy(self, name_behavior_id: str) -> TFPolicy: + """ + Gets policy associated with name_behavior_id + :param name_behavior_id: Fully qualified behavior name + :return: Policy associated with name_behavior_id + """ return self.policies[name_behavior_id] - def _save_snapshot(self, policy: TFPolicy) -> None: - weights = policy.get_weights() - try: - self.policy_snapshots[self.snapshot_counter] = weights - except IndexError: - self.policy_snapshots.append(weights) + def _save_snapshot(self) -> None: + """ + Saves a snapshot of the current weights of the policy and maintains the policy_snapshots + according to the window size + """ + for brain_name in self.current_policy_snapshot: + current_snapshot_for_brain_name = self.current_policy_snapshot[brain_name] + + try: + self.policy_snapshots[self.snapshot_counter][ + brain_name + ] = current_snapshot_for_brain_name + except IndexError: + self.policy_snapshots.append( + {brain_name: current_snapshot_for_brain_name} + ) self.policy_elos[self.snapshot_counter] = self.current_elo self.snapshot_counter = (self.snapshot_counter + 1) % self.window def _swap_snapshots(self) -> None: - for q in self.policy_queues: - name_behavior_id = q.behavior_id - # here is the place for a sampling protocol - if name_behavior_id == self.learning_behavior_name: + """ + Swaps the appropriate weight to the policy and pushes it to respective policy queues + """ + + for team_id in self._team_to_name_to_policy_queue: + if team_id == self._learning_team: continue - elif np.random.uniform() < (1 - self.play_against_current_self_ratio): + elif np.random.uniform() < (1 - self.play_against_latest_model_ratio): x = np.random.randint(len(self.policy_snapshots)) snapshot = self.policy_snapshots[x] else: snapshot = self.current_policy_snapshot x = "current" - self.policy_elos[-1] = self.current_elo + self.current_opponent = -1 if x == "current" else x - logger.debug( - "Step {}: Swapping snapshot {} to id {} with {} learning".format( - self.get_step, x, name_behavior_id, self.learning_behavior_name + name_to_policy_queue = self._team_to_name_to_policy_queue[team_id] + for brain_name in self._team_to_name_to_policy_queue[team_id]: + behavior_id = create_name_behavior_id(brain_name, team_id) + policy = self.get_policy(behavior_id) + policy.load_weights(snapshot[brain_name]) + name_to_policy_queue[brain_name].put(policy) + logger.debug( + "Step {}: Swapping snapshot {} to id {} with team {} learning".format( + self.ghost_step, x, behavior_id, self._learning_team + ) ) - ) - policy = self.get_policy(name_behavior_id) - policy.load_weights(snapshot) - q.put(policy) def publish_policy_queue(self, policy_queue: AgentManagerQueue[Policy]) -> None: """ - Adds a policy queue to the list of queues to publish to when this Trainer - makes a policy update + Adds a policy queue for every member of the team to the list of queues to publish to when this Trainer + makes a policy update. Creates an internal policy queue for the wrapped + trainer to push to. The GhostTrainer pushes all policies to the env. :param queue: Policy queue to publish to. """ super().publish_policy_queue(policy_queue) - if policy_queue.behavior_id == self.learning_behavior_name: - + parsed_behavior_id = self._name_to_parsed_behavior_id[policy_queue.behavior_id] + try: + self._team_to_name_to_policy_queue[parsed_behavior_id.team_id][ + parsed_behavior_id.brain_name + ] = policy_queue + except KeyError: + self._team_to_name_to_policy_queue[parsed_behavior_id.team_id] = { + parsed_behavior_id.brain_name: policy_queue + } + if parsed_behavior_id.team_id == self.wrapped_trainer_team: + # With a future multiagent trainer, this will be indexed by 'role' internal_policy_queue: AgentManagerQueue[Policy] = AgentManagerQueue( - policy_queue.behavior_id + parsed_behavior_id.brain_name ) - self.internal_policy_queues.append(internal_policy_queue) - self.learning_policy_queues[policy_queue.behavior_id] = policy_queue + self._internal_policy_queues[ + parsed_behavior_id.brain_name + ] = internal_policy_queue self.trainer.publish_policy_queue(internal_policy_queue) def subscribe_trajectory_queue( self, trajectory_queue: AgentManagerQueue[Trajectory] ) -> None: """ - Adds a trajectory queue to the list of queues for the trainer to ingest Trajectories from. + Adds a trajectory queue for every member of the team to the list of queues for the trainer + to ingest Trajectories from. Creates an internal trajectory queue to push trajectories from + the learning team. The wrapped trainer subscribes to this queue. :param queue: Trajectory queue to publish to. """ - - if trajectory_queue.behavior_id == self.learning_behavior_name: - super().subscribe_trajectory_queue(trajectory_queue) - + super().subscribe_trajectory_queue(trajectory_queue) + parsed_behavior_id = self._name_to_parsed_behavior_id[ + trajectory_queue.behavior_id + ] + if parsed_behavior_id.team_id == self.wrapped_trainer_team: + # With a future multiagent trainer, this will be indexed by 'role' internal_trajectory_queue: AgentManagerQueue[ Trajectory - ] = AgentManagerQueue(trajectory_queue.behavior_id) + ] = AgentManagerQueue(parsed_behavior_id.brain_name) - self.internal_trajectory_queues.append(internal_trajectory_queue) + self._internal_trajectory_queues[ + parsed_behavior_id.brain_name + ] = internal_trajectory_queue self.trainer.subscribe_trajectory_queue(internal_trajectory_queue) - else: - self.ignored_trajectory_queues.append(trajectory_queue) - - -# Taken from https://github.com/Unity-Technologies/ml-agents/pull/1975 and -# https://metinmediamath.wordpress.com/2013/11/27/how-to-calculate-the-elo-rating-including-example/ -# ELO calculation - - -def compute_elo_rating_changes(rating1: float, rating2: float, result: float) -> float: - r1 = pow(10, rating1 / 400) - r2 = pow(10, rating2 / 400) - - summed = r1 + r2 - e1 = r1 / summed - - change = result - e1 - return change diff --git a/ml-agents/mlagents/trainers/policy/tf_policy.py b/ml-agents/mlagents/trainers/policy/tf_policy.py index 828b52ff83..16c2549292 100644 --- a/ml-agents/mlagents/trainers/policy/tf_policy.py +++ b/ml-agents/mlagents/trainers/policy/tf_policy.py @@ -145,6 +145,10 @@ def init_load_weights(self): self.assign_ops.append(tf.assign(var, assign_ph)) def load_weights(self, values): + if len(self.assign_ops) == 0: + logger.warning( + "Calling load_weights in tf_policy but assign_ops is empty. Did you forget to call init_load_weights?" + ) with self.graph.as_default(): feed_dict = {} for assign_ph, value in zip(self.assign_phs, values): diff --git a/ml-agents/mlagents/trainers/ppo/trainer.py b/ml-agents/mlagents/trainers/ppo/trainer.py index b1e270186b..6464c9173f 100644 --- a/ml-agents/mlagents/trainers/ppo/trainer.py +++ b/ml-agents/mlagents/trainers/ppo/trainer.py @@ -14,6 +14,7 @@ from mlagents.trainers.ppo.optimizer import PPOOptimizer from mlagents.trainers.trajectory import Trajectory from mlagents.trainers.exception import UnityTrainerException +from mlagents.trainers.behavior_id_utils import BehaviorIdentifiers logger = get_logger(__name__) @@ -237,10 +238,12 @@ def create_policy(self, brain_parameters: BrainParameters) -> TFPolicy: return policy - def add_policy(self, name_behavior_id: str, policy: TFPolicy) -> None: + def add_policy( + self, parsed_behavior_id: BehaviorIdentifiers, policy: TFPolicy + ) -> None: """ Adds policy to trainer. - :param name_behavior_id: Behavior ID that the policy should belong to. + :param parsed_behavior_id: Behavior identifiers that the policy should belong to. :param policy: Policy to associate with name_behavior_id. """ if self.policy: diff --git a/ml-agents/mlagents/trainers/sac/trainer.py b/ml-agents/mlagents/trainers/sac/trainer.py index d9121f2db5..6550ac5049 100644 --- a/ml-agents/mlagents/trainers/sac/trainer.py +++ b/ml-agents/mlagents/trainers/sac/trainer.py @@ -18,6 +18,7 @@ from mlagents.trainers.trajectory import Trajectory, SplitObservations from mlagents.trainers.brain import BrainParameters from mlagents.trainers.exception import UnityTrainerException +from mlagents.trainers.behavior_id_utils import BehaviorIdentifiers logger = get_logger(__name__) @@ -337,7 +338,9 @@ def update_reward_signals(self) -> None: for stat, stat_list in batch_update_stats.items(): self._stats_reporter.add_stat(stat, np.mean(stat_list)) - def add_policy(self, name_behavior_id: str, policy: TFPolicy) -> None: + def add_policy( + self, parsed_behavior_id: BehaviorIdentifiers, policy: TFPolicy + ) -> None: """ Adds policy to trainer. :param brain_parameters: specifications for policy construction diff --git a/ml-agents/mlagents/trainers/stats.py b/ml-agents/mlagents/trainers/stats.py index ad5bf4c3e1..09d66e3861 100644 --- a/ml-agents/mlagents/trainers/stats.py +++ b/ml-agents/mlagents/trainers/stats.py @@ -28,7 +28,6 @@ def empty() -> "StatsSummary": class StatsPropertyType(Enum): HYPERPARAMETERS = "hyperparameters" SELF_PLAY = "selfplay" - SELF_PLAY_TEAM = "selfplayteam" class StatsWriter(abc.ABC): @@ -114,19 +113,7 @@ def write_stats( ) if self.self_play and "Self-play/ELO" in values: elo_stats = values["Self-play/ELO"] - mean_opponent_elo = values["Self-play/Mean Opponent ELO"] - std_opponent_elo = values["Self-play/Std Opponent ELO"] - logger.info( - "{} Team {}: ELO: {:0.3f}. " - "Mean Opponent ELO: {:0.3f}. " - "Std Opponent ELO: {:0.3f}. ".format( - category, - self.self_play_team, - elo_stats.mean, - mean_opponent_elo.mean, - std_opponent_elo.mean, - ) - ) + logger.info("{} ELO: {:0.3f}. ".format(category, elo_stats.mean)) else: logger.info( "{}: Step: {}. No episode was completed since last summary. {}".format( @@ -146,9 +133,6 @@ def add_property( elif property_type == StatsPropertyType.SELF_PLAY: assert isinstance(value, bool) self.self_play = value - elif property_type == StatsPropertyType.SELF_PLAY_TEAM: - assert isinstance(value, int) - self.self_play_team = value def _dict_to_str(self, param_dict: Dict[str, Any], num_tabs: int) -> str: """ diff --git a/ml-agents/mlagents/trainers/tests/test_ghost.py b/ml-agents/mlagents/trainers/tests/test_ghost.py index 53ba30769d..d0a7bdd397 100644 --- a/ml-agents/mlagents/trainers/tests/test_ghost.py +++ b/ml-agents/mlagents/trainers/tests/test_ghost.py @@ -5,10 +5,11 @@ import yaml from mlagents.trainers.ghost.trainer import GhostTrainer +from mlagents.trainers.ghost.controller import GhostController +from mlagents.trainers.behavior_id_utils import BehaviorIdentifiers from mlagents.trainers.ppo.trainer import PPOTrainer from mlagents.trainers.brain import BrainParameters from mlagents.trainers.agent_processor import AgentManagerQueue -from mlagents.trainers.behavior_id_utils import BehaviorIdentifiers from mlagents.trainers.tests import mock_brain as mb from mlagents.trainers.tests.test_trajectory import make_fake_trajectory @@ -119,17 +120,26 @@ def test_process_trajectory(dummy_config): dummy_config["summary_path"] = "./summaries/test_trainer_summary" dummy_config["model_path"] = "./models/test_trainer_models/TestModel" ppo_trainer = PPOTrainer(brain_name, 0, dummy_config, True, False, 0, "0") - trainer = GhostTrainer(ppo_trainer, brain_name, 0, dummy_config, True, "0") + controller = GhostController(100) + trainer = GhostTrainer( + ppo_trainer, brain_name, controller, 0, dummy_config, True, "0" + ) # first policy encountered becomes policy trained by wrapped PPO policy = trainer.create_policy(brain_params_team0) - trainer.add_policy(brain_params_team0.brain_name, policy) + parsed_behavior_id0 = BehaviorIdentifiers.from_name_behavior_id( + brain_params_team0.brain_name + ) + trainer.add_policy(parsed_behavior_id0, policy) trajectory_queue0 = AgentManagerQueue(brain_params_team0.brain_name) trainer.subscribe_trajectory_queue(trajectory_queue0) # Ghost trainer should ignore this queue because off policy policy = trainer.create_policy(brain_params_team1) - trainer.add_policy(brain_params_team1.brain_name, policy) + parsed_behavior_id1 = BehaviorIdentifiers.from_name_behavior_id( + brain_params_team1.brain_name + ) + trainer.add_policy(parsed_behavior_id1, policy) trajectory_queue1 = AgentManagerQueue(brain_params_team1.brain_name) trainer.subscribe_trajectory_queue(trajectory_queue1) @@ -166,9 +176,11 @@ def test_publish_queue(dummy_config): vector_action_space_type=0, ) - brain_name = BehaviorIdentifiers.from_name_behavior_id( + parsed_behavior_id0 = BehaviorIdentifiers.from_name_behavior_id( brain_params_team0.brain_name - ).brain_name + ) + + brain_name = parsed_behavior_id0.brain_name brain_params_team1 = BrainParameters( brain_name="test_brain?team=1", @@ -181,18 +193,24 @@ def test_publish_queue(dummy_config): dummy_config["summary_path"] = "./summaries/test_trainer_summary" dummy_config["model_path"] = "./models/test_trainer_models/TestModel" ppo_trainer = PPOTrainer(brain_name, 0, dummy_config, True, False, 0, "0") - trainer = GhostTrainer(ppo_trainer, brain_name, 0, dummy_config, True, "0") + controller = GhostController(100) + trainer = GhostTrainer( + ppo_trainer, brain_name, controller, 0, dummy_config, True, "0" + ) # First policy encountered becomes policy trained by wrapped PPO # This queue should remain empty after swap snapshot policy = trainer.create_policy(brain_params_team0) - trainer.add_policy(brain_params_team0.brain_name, policy) + trainer.add_policy(parsed_behavior_id0, policy) policy_queue0 = AgentManagerQueue(brain_params_team0.brain_name) trainer.publish_policy_queue(policy_queue0) # Ghost trainer should use this queue for ghost policy swap policy = trainer.create_policy(brain_params_team1) - trainer.add_policy(brain_params_team1.brain_name, policy) + parsed_behavior_id1 = BehaviorIdentifiers.from_name_behavior_id( + brain_params_team1.brain_name + ) + trainer.add_policy(parsed_behavior_id1, policy) policy_queue1 = AgentManagerQueue(brain_params_team1.brain_name) trainer.publish_policy_queue(policy_queue1) diff --git a/ml-agents/mlagents/trainers/tests/test_simple_rl.py b/ml-agents/mlagents/trainers/tests/test_simple_rl.py index 1873dce123..082a210503 100644 --- a/ml-agents/mlagents/trainers/tests/test_simple_rl.py +++ b/ml-agents/mlagents/trainers/tests/test_simple_rl.py @@ -322,7 +322,7 @@ def test_simple_ghost(use_discrete): override_vals = { "max_steps": 2500, "self_play": { - "play_against_current_self_ratio": 1.0, + "play_against_latest_model_ratio": 1.0, "save_steps": 2000, "swap_steps": 2000, }, @@ -341,7 +341,7 @@ def test_simple_ghost_fails(use_discrete): override_vals = { "max_steps": 2500, "self_play": { - "play_against_current_self_ratio": 1.0, + "play_against_latest_model_ratio": 1.0, "save_steps": 2000, "swap_steps": 4000, }, @@ -357,6 +357,57 @@ def test_simple_ghost_fails(use_discrete): ) +@pytest.mark.parametrize("use_discrete", [True, False]) +def test_simple_asymm_ghost(use_discrete): + # Make opponent for asymmetric case + brain_name_opp = BRAIN_NAME + "Opp" + env = SimpleEnvironment( + [BRAIN_NAME + "?team=0", brain_name_opp + "?team=1"], use_discrete=use_discrete + ) + override_vals = { + "max_steps": 2000, + "self_play": { + "play_against_latest_model_ratio": 1.0, + "save_steps": 5000, + "swap_steps": 5000, + "team_change": 2000, + }, + } + config = generate_config(PPO_CONFIG, override_vals) + config[brain_name_opp] = config[BRAIN_NAME] + _check_environment_trains(env, config) + + +@pytest.mark.parametrize("use_discrete", [True, False]) +def test_simple_asymm_ghost_fails(use_discrete): + # Make opponent for asymmetric case + brain_name_opp = BRAIN_NAME + "Opp" + env = SimpleEnvironment( + [BRAIN_NAME + "?team=0", brain_name_opp + "?team=1"], use_discrete=use_discrete + ) + # This config should fail because the team that us not learning when both have reached + # max step should be executing the initial, untrained poliy. + override_vals = { + "max_steps": 2000, + "self_play": { + "play_against_latest_model_ratio": 0.0, + "save_steps": 5000, + "swap_steps": 5000, + "team_change": 2000, + }, + } + config = generate_config(PPO_CONFIG, override_vals) + config[brain_name_opp] = config[BRAIN_NAME] + _check_environment_trains(env, config, success_threshold=None) + processed_rewards = [ + default_reward_processor(rewards) for rewards in env.final_rewards.values() + ] + success_threshold = 0.99 + assert any(reward > success_threshold for reward in processed_rewards) and any( + reward < success_threshold for reward in processed_rewards + ) + + @pytest.fixture(scope="session") def simple_record(tmpdir_factory): def record_demo(use_discrete, num_visual=0, num_vector=1): diff --git a/ml-agents/mlagents/trainers/tests/test_stats.py b/ml-agents/mlagents/trainers/tests/test_stats.py index 632c0abb9c..a99c6aede4 100644 --- a/ml-agents/mlagents/trainers/tests/test_stats.py +++ b/ml-agents/mlagents/trainers/tests/test_stats.py @@ -213,7 +213,6 @@ def test_selfplay_console_writer(self): category = "category1" console_writer = ConsoleWriter() console_writer.add_property(category, StatsPropertyType.SELF_PLAY, True) - console_writer.add_property(category, StatsPropertyType.SELF_PLAY_TEAM, 1) statssummary1 = StatsSummary(mean=1.0, std=1.0, num=1) console_writer.write_stats( category, @@ -221,8 +220,6 @@ def test_selfplay_console_writer(self): "Environment/Cumulative Reward": statssummary1, "Is Training": statssummary1, "Self-play/ELO": statssummary1, - "Self-play/Mean Opponent ELO": statssummary1, - "Self-play/Std Opponent ELO": statssummary1, }, 10, ) @@ -230,7 +227,3 @@ def test_selfplay_console_writer(self): self.assertIn( "Mean Reward: 1.000. Std of Reward: 1.000. Training.", cm.output[0] ) - self.assertIn( - "category1 Team 1: ELO: 1.000. Mean Opponent ELO: 1.000. Std Opponent ELO: 1.000.", - cm.output[1], - ) diff --git a/ml-agents/mlagents/trainers/trainer/trainer.py b/ml-agents/mlagents/trainers/trainer/trainer.py index ee84803961..4fdc8bea08 100644 --- a/ml-agents/mlagents/trainers/trainer/trainer.py +++ b/ml-agents/mlagents/trainers/trainer/trainer.py @@ -13,6 +13,7 @@ from mlagents.trainers.brain import BrainParameters from mlagents.trainers.policy import Policy from mlagents.trainers.exception import UnityTrainerException +from mlagents.trainers.behavior_id_utils import BehaviorIdentifiers logger = get_logger(__name__) @@ -138,7 +139,9 @@ def create_policy(self, brain_parameters: BrainParameters) -> TFPolicy: pass @abc.abstractmethod - def add_policy(self, name_behavior_id: str, policy: TFPolicy) -> None: + def add_policy( + self, parsed_behavior_id: BehaviorIdentifiers, policy: TFPolicy + ) -> None: """ Adds policy to trainer. """ diff --git a/ml-agents/mlagents/trainers/trainer_controller.py b/ml-agents/mlagents/trainers/trainer_controller.py index fdba2a4312..cd6f434284 100644 --- a/ml-agents/mlagents/trainers/trainer_controller.py +++ b/ml-agents/mlagents/trainers/trainer_controller.py @@ -156,9 +156,8 @@ def _create_trainer_and_manager( self, env_manager: EnvManager, name_behavior_id: str ) -> None: - brain_name = BehaviorIdentifiers.from_name_behavior_id( - name_behavior_id - ).brain_name + parsed_behavior_id = BehaviorIdentifiers.from_name_behavior_id(name_behavior_id) + brain_name = parsed_behavior_id.brain_name try: trainer = self.trainers[brain_name] except KeyError: @@ -166,7 +165,7 @@ def _create_trainer_and_manager( self.trainers[brain_name] = trainer policy = trainer.create_policy(env_manager.external_brains[name_behavior_id]) - trainer.add_policy(name_behavior_id, policy) + trainer.add_policy(parsed_behavior_id, policy) agent_manager = AgentManager( policy, diff --git a/ml-agents/mlagents/trainers/trainer_util.py b/ml-agents/mlagents/trainers/trainer_util.py index b0eb397f70..cf70fdb818 100644 --- a/ml-agents/mlagents/trainers/trainer_util.py +++ b/ml-agents/mlagents/trainers/trainer_util.py @@ -10,6 +10,7 @@ from mlagents.trainers.ppo.trainer import PPOTrainer from mlagents.trainers.sac.trainer import SACTrainer from mlagents.trainers.ghost.trainer import GhostTrainer +from mlagents.trainers.ghost.controller import GhostController logger = get_logger(__name__) @@ -39,6 +40,7 @@ def __init__( self.seed = seed self.meta_curriculum = meta_curriculum self.multi_gpu = multi_gpu + self.ghost_controller = GhostController() def generate(self, brain_name: str) -> Trainer: return initialize_trainer( @@ -50,6 +52,7 @@ def generate(self, brain_name: str) -> Trainer: self.keep_checkpoints, self.train_model, self.load_model, + self.ghost_controller, self.seed, self.meta_curriculum, self.multi_gpu, @@ -65,6 +68,7 @@ def initialize_trainer( keep_checkpoints: int, train_model: bool, load_model: bool, + ghost_controller: GhostController, seed: int, meta_curriculum: MetaCurriculum = None, multi_gpu: bool = False, @@ -81,6 +85,7 @@ def initialize_trainer( :param keep_checkpoints: How many model checkpoints to keep :param train_model: Whether to train the model (vs. run inference) :param load_model: Whether to load the model or randomly initialize + :param ghost_controller: The object that coordinates ghost trainers :param seed: The random seed to use :param meta_curriculum: Optional meta_curriculum, used to determine a reward buffer length for PPOTrainer :return: @@ -158,6 +163,7 @@ def initialize_trainer( trainer = GhostTrainer( trainer, brain_name, + ghost_controller, min_lesson_length, trainer_parameters, train_model,