Multi Armed Bandit (MAB)

November 24, 2014

Hola, I'm back! Was on a hiatus post baby and now I will try to write more regularly. So, lets get started. The multi-armed bandit problem was formulated in the 50s and is as follows: Suppose there are N slot machines (arms) and the player has to decide which machine to play. Each machine has an unknown probability of distributing a reward and the player has to decide which machines to play and in what order in order to maximize the reward. Note that the probability of reward generation of each machine is unknown otherwise the reward is maximized by simply picking up the machine with the largest probability. This problem has huge applications. For example, ad recommendation, news recommendation, in a clinical trial while trying to come up with the best treatment or in the portfolio management of the stocks.

The problem of choosing the best arm is complicated by the stochastic nature of the slot machines. A sub-optimal machine could generate a reward by random chance and similarly an optimal machine could not perform by chance. This gives rise to the principle of Exploration vs Exploitation . The exploration-exploitation tradeoff is the dilemma faced by the player in exploiting the current best strategy or in exploring the space and trying out a sub-optimal strategy in order to understand the users response. There are many mathematical models in place in trying to come up with an optimal solution and some of the popularly used approximate solutions are as follows:

ε greedy The approach is as follows: With a probability (1 - ε) play the machine with the largest probability and with a probability ε try a random machine.

Upper Confidence Bound (UCB) In the beginning play each machine atleast once. After that at each time t proceed by picking up the machine i that maximizes:

x̄i + √ (2lnt/T)

where x̄i is the mean of the reward of machine i and T is the number of times the machine has been played. This is one of the most popular algorithm for the MAB problem and there are strong theoretical guarantees on the regret bounds for this algorithm.

Thompson Sampling This approach is gaining importance recently with emperical results shows their better performance as compared to the other methods. The idea behind this approach is to sample from the posterior for the mean value for each arm.

My Learnings in ML

Multi Armed Bandit (MAB)

Comments

Post a Comment

Popular posts from this blog

MinHash, Bloom Filters and LSH

SVM

My learnings of Karpathy 1-3