Deep Reinforcement Learning

Lets talk about one of the difficult areas in ML - Deep Reinforcement Learning. Two of the most popularly used approaches in the space policy gradients and deep Q-networks. An agent interacts with the user within an environment and receives rewards. Policy search is finding a good set of parameters in the policy space . One way to explore the policy space is via policy gradient approach which evaluates the gradients of the rewards w.r.t the parameters and then moves in the direction of maximizing reward.

The policy themselves can be defined via let's say Neural Networks. In the case of Supervised ML, we already know the best action from the set of actions and the NN could be trained by minimizing the cross-entropy loss between the estimated and target distributions. However, in RL, as we focus on long term reward, the reward itself could be delayed or sparse. This is known as the classic credit assignment problem. This problem is generally solved by summing up all the rewards that come after an action to evaluate it, usually after some discounting factor γ . The game is played multiple times so that on average good actions look better than bad actions. This is called as computing an action advantage. Once the game is played multiple times, the action returns are normalized (lets say by z-scoring).

Reinforce: Let us now look at one of the popularly used algorithms for Policy Gradient. Reinforce is a Monte-Carlo variant of Policy Gradient. The objective is to learn a policy that maximizes the cumulative future reward starting from any given time t.
J(θ) = E[Σ_{t=0}^{T} r_t+1]
The maximization problem can be solved by taking the gradient of the objective w.r.t. the policy parameter θ. The algorithm proceeds as follows:
  1. Let the NN play the game multiple times and at each step compute the gradient that would make the chosen action very likely.
  2. Once you have several episodes, compute each action's advantage.
  3. If the advantage is positive apply the gradients to make the action more likely in future and if negative apply the negative of the gradient to penalize the action.
  4. Finally, compute the resulting gradient vector and use it to perform the gradient descent step.
One of the main drawbacks of Policy gradient algorithms is that they do not scale well to larger and complex tasks. It is highly sample inefficient and needs to explore the policy space for a sufficiently large amount of time before it can make any progress.

Deep Q-Networks: The other popularly used approach for RL is Deep Q Learning. Before delving into that, let's talk about MDPs. MDPs consists of states and actions and the transition probabilities depend upon action taken from the state. To determine the best strategy from a state, Bellman Optimality Equation can be applied to estimate the optimal value for each state V*(s) which is the sum of all discounted future rewards that can be obtained from this state. The equation is as follows:
V*(s) = maxa Σ T(s,a,s')[R(s,a,s')+γ·V∗(s')] ∀ s
where T is the transition probability from state s to s', R is the reward obtained when the state transition happens from state s to s' via action a and γ is the discount factor. This equation helps in obtaining the optimal value for each state by recursively computing the value of each state via the Value Iteration algorithm. The optimal value for a state is useful but it does not give insights on the optimal policy that can be used by the agent. To solve this problem, Bellman introduced computing another related quantity known as Q-value. The Q value computes the optimal value for a state-action pair and is computed very similarly as follows:
Qk+1(s,a)← Σ 'T(s,a,s')[R(s,a,s')+ γ·maxa'Qk(s',a')]∀(s′a) 
Once, we have the Q values, choosing the optimal policy means picking up the action with the largest Q value. A key problem in this approach is learning the transition probabilities T from each state. The Temporal Difference algorithm comes to the rescue in such a case. The algorithm is very similar to the Value Iteration algorithm except that it is adapted to take into account the partial knowledge of the MDP. The agent uses an exploration policy for example uniform random to gather information of the state values. The algorithm proceeds as follows:
Vk+1(s)←(1− α)Vk(s)+ α(r+ γ·Vk(s'))
TD learning is similar to SGD that it works on a single example at a time and needs the learning rate to decrease over time for better convergence.

Q-learning is a similar adaptation of Q value-iteration algorithm to have the formula as follows:
Q(s,a)← αr+ γ·maxa' Q(s',a')
Unlike the previous approach of policy gradient, Q-learning is an off-policy algorithm where the policy being learned is not necessarily the one being executed. Note that the exploration policy is generally an ε-greedy one with a decreasing ε over time.
The main problem with Q-learning is that it cannot scale very well. If there are a lot of states, then the agent needs to know the possible combinations of state and actions to know the Q-values. Deep Q-Learning presents a method for approximate Q-learning where we directly try to learn the function to approximate the Q-value given the parameters. An important question is how to train the DQN. Using the Bellman’s Eqn, we know that the approximate Q-value should be as close to the discounted sum of rewards there on. To estimate this, we execute the DQN on the next state and all possible actions and pick the highest action and discount that. Thus we can write down the final learning step as follows and it can be trained using gradient descent:
Q_target(s,a) = r + γ max a’ Q_θ (s’,a’) 
In order to make this Deep Q-learning to working in practice, there are many variants that have been proposed. One of the modification proposed in the DeepMind 2013 famous Atari paper is to use two DQNs so that one can be used to define the targets whereas the other can be used to move the agent online. Double DQN is the 2015 and other DeepMind papers like Rainbow further discuss how to make DQN practicable.

Comments

Popular posts from this blog

MinHash, Bloom Filters and LSH

My learnings of Karpathy 1-3

Perceptron Algorithm