-
Notifications
You must be signed in to change notification settings - Fork 9.7k
RL Algorithms #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello, I just discovered pytorch yesterday, so I still have to go indepth, but my first intention was to evaluate if I can recode the rltorch package with pytorch. Basically, rltorch is very simple but more general than openAI Gym since it allows one to decompose any problem in environment, sensor, feedback and policy and thus can be also used for other problems (like supervised classification with RL, etc...). Then, making a openAI Gym wrapper is very easy. In that case, one way to test the platform is to reimplement one of the environments (e.g cartpole) and to compare the implemented environment with the openAI Gym one. I think that this could be done in a few days. The second point concerns the implementation of policy gradient algorithms. I think that the 'dpnn' mechanism is very nice. I suppose it can be implemented in pytorch by extending the 'Container' and/or 'Module' classes to incorporate a 'reinforce' method (but I have to check to be sure) In term of algorithms, I would like to start with: policy gradient, recurrent policy gradient, predictive policy (see rltorch), ucbpolicy (for online learning), and imitation policy (supervised) What do you think ? |
Hey, So I've been also playing with some RL lately and I started decomposing the code into different chunks and thinking how would they fit into pytorch. I think right now I ended up with something quite closely following the design of rltorch. However, if this framework works well in all/most RL cases, we could go a step further and also add define an Right now, we have only a basic class Sense(object):
def __init__(self, env):
self.env = env
def observe(self):
raise NonImplementedError
class Environment(object):
def __init__(self):
self.actions = set()
def take_action(self, action):
# update environment state
return feedback
# Optionally a number of senses you can use to observe the world
class Agent(object):
def __init__(self, env):
# initialize internal models
# gather senses from env
pass
def forward(self):
# use envs to observe the environment state
# create input for the internal models
# predict action / choose random one
pass
def backward(self, feedback):
# generate gradients for internal models
# can use experience replay instead of feedback
pass
def new_session(self):
# clear any saved state e.g. screen history
pass
# I'm omitting all the plugin-calling code for clarity
class RLTrainer(Trainer):
def __init__(self, optimizer, env, agent, session_length=1000):
self.optimizer = optimizer
self.env = env
self.agent = agent
def train(self):
self.agent.new_session()
for i in range(session_length):
action = self.agent.forward()
feedback = self.env.take_action(action)
optimizer.zero_grad()
agent.backward(feedback)
optimizer.step() An example implementation: class GameEnvironment(Environment):
def __init__(self):
self.actions = set(MOVE_FORWARD, MOVE_BACKWARD, SHOOT)
self.game = ... # initialize the game engine
def take_action(self, action):
self.game.update(action)
def screen_buffer(self):
return _ScreenBuffer(self)
class _ScreenBuffer(Sense):
def __init__(self, env):
self.env = env
def observe(self):
return torch.FloatTensor(self.env.game.screen_buffer)
def size(self):
return torch.Size([1, 3, 100, 100])
class DQNAgent(Agent):
def __init__(self, env):
self.actions = env.actions
self.screen_buffer = env.screen_buffer()
self.dqn = DQN(self.screen_buffer.size(), len(self.actions))
self.last_screens = ...
self.replay_memory = ...
def forward(self):
self.remember_frame(self.screen_buffer.observe()) # updates self.last_frames
if self._should_pick_random():
action_idx = random() * self.num_actions
else:
action_idx = self._predict()
self.last_action = self.actions[action_idx]
return self.last_action
def _should_pick_random():
return random() > threshold
def _predict(self):
output = self.dqn.forward(self.last_frames)
return output.max(1)[1]
def backward(self, feedback):
self.replay_memory.store(
self.last_frames,
self.last_action,
feedback,
self.screen_buffer.observe()
)
# sample from replay_memory and do backward on the dqn Also, could you please point me to the code of |
OK, some differences between what you propose and rltorch:
I think it is interesting to have a separate class for defining the task to solve,. Here is the code I imagine:
Concerning the finished method, I think that one clever thing would be to define a method that list all authorized actions (or authorized domains for continuous RL). If this method returns and empty set, then the episode is finished. Other types of feedback can be defined (see https://github.com/ludc/rltorch/tree/master/torch/environments/classiclearning/classification) The second point concerns the definition of the agent (or policy). Since the agent can receive different types of feedback (at different time steps in the process), here is what I imagine:
This definition is very general, and then can be instantiated for particular 'gradient-based' agent (what you propose in your previous post) Concerning the RLTrainer, I have no preference....
I can write all these classes in 'real' python during the week end, but it will be almost the same thing that the core classes of rltorch. How do you want to proceed ? |
Concerning the dpnn approach, the idea is the following: You make a model that includes stochastic layers (for example, the MultinomialLayer takes a discrete set of pobabilities as an input and output a one hot vector by sampling following this distribution). Then the gradient can be estimated by providing a ''reward-like'' feedback to this layer before the backward. So, imagine you have a loss function, an input x and an output y to predict, you can do:
When using with reinforcement, you directly have a reward (no loss), so you start your backpropagation with a empty delta, but the idea remains the same. (see https://github.com/Element-Research/dpnn#nn.Reinforce) Basically, the goal is to include stochastic modules into the computation graph, and this can be done by adding a reinforce method for these modules (but other ways can be imagined I suppose) |
Looks good, nice discussion guys. Thanks for making the connection @soumith. The way we did this in https://github.com/twitter/torch-twrl is a little bit different. Here we have an agent defined by a learning update, a model and a policy. The agent is completey separate from the environment, and from the monitoring code. This separation allows for building in modular chunks. You can see a nice visualization here: https://blog.twitter.com/2016/reinforcement-learning-for-torch-introducing-torch-twrl I have been working on DDPG in pytorch, and will try to model it after the breakdown on the twrl package. Twrl was modelled after the RLLab and torch-rl. We should aim to enable capabilities expected by OpenAI Gym as it is the common test bed these days. I have been working on a simple continuous action space example with DDPG. Not sure if this is helpful, but figured that an easy implementation of a common, popular RL algorithm on OpenAI gym would be the most effective example for RL on pytorch |
Concerning the separation between agent, environment, monitoring code, I totally agree. Concerning enabling a strong (and easy) connection with openAI Gym, I agree also. So, I think that we just have to agree on simple core classes (actually, the way PG or other algorithms will be implemented is a totally separate problem), right ? |
Concerning the core classes (not agent/policy here), this is what I have in mind, considering the structure proposed by @apaszke ?
|
@ludc About the stochastic modules, it's actually not as simple. It could work, however you'd need to perform the work of Some comments to the API you proposed:
If you're feeling like reimplementing rltorch or sth similar so soon, then go on. I will have to work on some other stuff as well, so I can wait and review the changes if you wish. @korymath I'll try to take a look at torch-twrl soon and see how it feels. |
@ludc and @korymath are interested in building out some RL algorithms and doing OpenAI Gym integration.
Kory from his repo: https://github.com/korymath/examples/tree/master/rl hasn't yet started on anything concrete.
If each of you declares here what you are doing, before you start developing it, then i think the other person can avoid overlap.
The text was updated successfully, but these errors were encountered: