Motivation. Suppose you have to choose which route to take every day driving to work. You devise a complicated strategy, but one day your neighbor and coworker, who takes the same route to work every day, says “I don’t have a strategy, I just take this one route every day.” Wouldn’t it be regretful if it turns out, in total, your route took more time than your co-worker?
def. Online Decision Making Game.
- Player has actions to choose from,
- At time …
- Player constructs a distribution over the set of actions
- Adversary chooses a loss for each action taken, , for every action where represents no loss and represents a loss. (In our example, think of it as traffic conditions causing a delay.)
- Player’s distribution is realized into action . The player incurs a loss of .
- The player’s goal is to minimize total loss, which we will define shortly.
- We play for time . Number of iterations is predetermined.
We characterize the loss if we always chose action (like how our neighbor always takes the same route) for time as:
Thus we can define also:
- is the minimum total loss if we could only choose one action every time
- is the total loss of player playing strategy of , .
def. External Regret. For a player playing strategy , the regret for this strategy is:
i.e. the difference between total loss of algorithm and the best total loss of one-action-every-time strategy.
Algorithms for Online Games
alg. Greedy algorithm. This algorithm chooses the action whose one-action-fits-all-time loss is smallest
- Initially:
- At time , choose such that we take the minimum possible . In other words:
- Breaks ties determinimistically, with the action with lowest index.
Motivation. This algorithm is really bad. Instead, we can try to confuse our adversary by mixing our strategy, i.e. a randomized algorithm.
alg. Randomized Weighted Majority.
- Initially, play with probability for all
- At time …
- where is the discount factor
- Calculate this new weight for every strategy , and then play with probability
- where
This is a much better algorithm; in fact we can show how small its regret is.
thm. (Regret Bound of Randomized Weighted Majority) For , the loss of RWM algorithm satisfies:
Proof. Let denote the fraction of weights that are discounted because they incurred a loss. We show this is equal to the expected loss at timestep :
Now, we can express using way, by splitting the summation into those that incurred a loss and those that didn’t.
We can now construct the inequality:
And with some algebra and inequality:
■