Reinforcement learning with sparse rewards
This is a continuation of a series of posts on reinforcement learning. This will continue the discussion from the last post and use the…
This is a continuation of a series of posts on reinforcement learning. This will continue the discussion from the last post and use the OpenAI environment discussed in an earlier post.
Reinforcement learning shouldn’t be hard. The idea is simple enough:
Try some things randomly and save down the states and the rewards
Train a network to predict the reward
Use the network to choose the highest reward, allowing for some randomness
Continue to train based on those experiences
You’ll read tutorials that get everything to work swimmingly. You’ll glance over the hyper-parameters and believe they’re intuitive (why wouldn’t you use a 4 stacked convolutional layers with 4x4 kernels and a step size of 2x2?). Besides, machine learning models are robust and you’re sure other parameters would work similarly well. You take solace in the fact that through emulation you have virtually unlimited amount of training samples. Surely all these great results are just the result of faster GPUs and great machine learning libraries.
But if you try to solve a problem from scratch or even apply the exact same process to a similar environment you’ll know this is not true.
Reinforcement learning is hard. There was a great article about this just last week. Below are the main problems the article mentions and my commentary.
Inefficiency: Not a major concern in my opinion. The long run trend of hardware will make this a smaller problem over time. Besides, we should not anthropomorphize what is actually going on. You can teach a human how to move a box from point A to point B with one training example, but not so with a computer. But why should we expect to be able to do so? Reinforcement learning is doing something entirely else, and the fact that its data hungry should not dissuade anyone from its usefulness.
Difficulty determining reward function and unintended consequences: It is certainly true that defining rewards is difficult but this is present in human activities as well. One of the hardest things managers, organizations and parents struggle with is setting incentives for others. Managers want to get the most out of their best, but not over-work them. Organizations balance short term and long term goals. Parents try to be both supportive and disciplinary. It would be naive to think there is some kind of intuitive objective function we can just hand over to a machine and have no unintended consequences or trade-offs.
Local optima: This is true across all machine learning, but the problem is well defined and well studied. One solution is to train multiple models in tandem and use ensemble methods or a kind of model dropout method or just add more randomness. Faster GPUs make experimentation easier as well, which should help.
Over-fitting to weird patterns and unstable results: This problem is the most frustrating for someone just trying to get reinforcement learning to work. This is what I’ll try and discuss in this post.
My experience has been that your bot will fairly effectively learn based on what its been shown. I have gone down that rabbit-hole of optimizing neural networks, adding and removing layers, adding normalization or dropout, only to realize later that the network is not my problem. Now before I invest too much time into training my model, I check my training examples, I establish a benchmark and I check my predictions. And most importantly, I cut off training as soon as I realize the bot is not learning anything meaningful.
Understand your environment and check your training examples
Ideally we should be able to develop a completely general algorithm to handle any environment, any reward structure and any action space. But in practice this is just not feasible and we need to understand our environment to effectively create useful bots.
In the last post, I spoke about Experience Replay, which is basically a buffer that holds your experiences (state, action taken, reward, next states and whether game is over). Sampling randomly from the buffer is an effective means to get a variety of experiences that get refreshed over time. But many games have sparse rewards, so 99% of your experiences just result in a 0 reward. You model will do great, as it’ll just predict 0 and for the most part, it’ll be right. So you want to focus on examples in which things happened. Here is an example of snake being played randomly on a 20x20 board:
Snake is a custom environment I built for OpenAI and can be found here. I’m sure many are familiar with snake but you basically move a blue snake trying to reach the green squares and avoid the ends or your own tail. You grow with every piece of fruit you eat. Here is an example of a longer game:
It’s important to know the nature of your environment before you begin trying to learn. In a 20x20 snake, the odds of you getting a reward are very low and most of the time nothing will happen. Rather than try to determine a mathematical solution as to what my expected reward should be, I can just sample a number of games:
from collections import Counterimport numpy as npimport gymimport snake
env = gym.make("Snake-v0")
num_games = 1000num_steps = []rewards = []
for num_game in range(num_games): num_step = 0 done = False env.reset() while not done: num_step += 1 action = env.action_space.sample() state, reward, done, _ = env.step(action) rewards.append(reward) num_steps.append(num_step)
print("Number of games played: {}".format(num_games))
print("Average number of steps: {:0.2f}".format(np.mean(num_steps)))
print("Number of steps distribution \n\t10%: {:0.2f} \t25%: {:0.2f} \t50%: {:0.2f} \t75%: {:0.2f} \t99%: {:0.2f}".format( *np.percentile(num_steps, [10, 25, 50, 75, 99])))
print("Distribution of rewards: {}".format(Counter(rewards)))
These are the results:
Number of games played: 1000
Average number of steps: 34.18
Number of steps distribution
10%: 5.00 25%: 10.00 50%: 22.00 75%: 48.00 99%: 143.12
Distribution of rewards: Counter({0: 33123, -1: 1000, 1: 60})
So ~33k steps there is no reward. I have a -1 reward on 1k of them (as expected since I played 1k games). And I got a reward on 60. If we just sample these experiences randomly, we’re obviously going to be very biased towards nothing happening. And it’s not wrong. Most of the time, nothing happens. But we need to try to learn exactly what’s going on in those 60 cases where we received a reward and those 1k times leading up to my death.
You can separate out the examples in which you have a certain score and sample a number from each score. But this is a bit arbitrary since you’d have to know what your score distribution is, and it may change as your bot learns. The better way is to just sort your experiences based on absolute reward and then draw samples where you’re biased towards the high rewards. Sorting every time we want to sample is not very efficient but for our purposes the performance benefit of adding complexity would be negligible. Below is a modification of the ExperienceReplay class from the last post in which unusual experiences are over-sampled.
When we sample, we sort the buffer by absolute value of rewards where very high or low rewards are in the front. Then we apply a weighting to each one based on the unusual_sample_factor and index.
So for instance, this is what unusual_sample_factor of 0.9 would look like on 8 samples:
rewards: [1, 0, 0, 0, 0, 1, -1, 0]sorted rewards: [1, 1, -1, 0, 0, 0, 0, 0]probs: [ 0.18 0.16 0.14 0.13 0.12 0.1 0.09 0.08]
Depending on how big your experience buffer is and how much you want to favor unusual experiences in your buffer, you could change the unusual_sample_factor. At 1.0, every item is equally likely. As it approaches 0, then the first samples are more likely. There are also other methods that you can use such as weighting the probability by the absolute value of the reward, but this is likely good enough. And the unusual rewards are still equally likely to get discarded as the add function just indiscriminately discards older experiences. You can change this as well.
Reducing the interval
Another useful technique is to reduce the interval at which you sample. Don’t focus on every frame. This works well with games that are not turn based. Also, it’s a practicality for many games. If you’re playing Mario at 60 frames a second, 10 sequential frames will be a lot less useful than 10 frames at 0.5 second intervals.
Bench-marking performance
Run a lot of trials before you begin and record average performance as well as the distribution. If your average reward is 10 +/- 20, then you shouldn’t be too excited seeing scores move around 20–30. Also have more comparable bots. Have a bot that always always outputs the same action. Do one that’s random but biased (e.g. 60% UP, 20% DOWN, 20% LEFT, 20% RIGHT). I would create a separate function that scores a bot and you can just feed it various models. Below is an implementation of a simple function and class that can be used for benchmarking.
You would use this by creating feeding a function to the CustomModel and scoring it. See below
from benchmark import CustomModel, test_model
model_random = CustomModel(lambda x: np.random.random(4), "random")model_0 = CustomModel(lambda x: [1,0,0,0], name="always_0")model_1 = CustomModel(lambda x: [0,1,0,0], name="always_1")model_2 = CustomModel(lambda x: [0,0,1,0], name="always_2")model_3 = CustomModel(lambda x: [0,0,0,1], name="always_3")model_bias_0 = CustomModel(lambda x: np.random.random(4) * [2,1,1,1], name="bias_0")model_bias_1 = CustomModel(lambda x: np.random.random(4) * [1,2,1,1], name="bias_1")model_bias_2 = CustomModel(lambda x: np.random.random(4) * [1,1,2,1], name="bias_2")model_bias_3 = CustomModel(lambda x: np.random.random(4) * [1,1,1,2], name="bias_3")
models = [ model_random, model_0, model_1, model_2, model_3, model_bias_0, model_bias_1, model_bias_2, model_bias_3]
for model in models: print("model: {} score: {:0.2f}".format(model.name, test_model(env, model)))
Below I create 9 benchmarks. One chooses randomly, 4 always choose the same direction, and 4 are just biased to one direction. You can create more and use the best score as your benchmark.
Checking predictions
What is your bot predicting? You can look at loss numbers, and that’s pretty easy to pinpoint when it’s going out of control. But look at your prediction values as well. If your game returns 0 most of the time with a reward of 1 every 10 steps or so (assuming perfect play), then you should expect your predictions to come in between 0 to 1 for the most part. However when you’re bootstrapping as we did in the last post, rewards can spiral out of control and turn incredibly positive or negative. You can handle this by decreasing the learning rate, or decreasing the frequency in which you update the target network. But most importantly, identify that this is happening early and don’t waste time feeding the model more experiences.
Finally try different neural network architectures
If you’re getting good training examples, rewards are coming out as expected and your bot is actually getting better, but the loss numbers aren’t decreasing, then you may want to look at trying different architectures. You can try all the normal techniques (dropout, batch normalization, ensemble methods, increasing number of layers or neurons). You may also want to consider using an RNN like GRU or LSTM, which I’ll discuss in a future post. But if your bot is not improving, chances are there is something else going on and tinkering with the model is a waste of time.