Rewards Instead of Goals
In many problems of practical interest, the task may be of an ongoing nature, such as:
-
A boundary following robot
-
Operations in factories
-
Board games
In such cases, we need to design learning algorithms for autonomous agents that sense and act in their surroundings with optimal actions in order to achieve specific goals.
Each time an agent performs an action in the environment, a trainer/user may occasionally provide a reward or penalty to indicate the desirability of the resulting state.
For example, when training an agent to play a game, a trainer might provide:
-
a positive reward when the game is won
-
a negative reward when it is lost
-
0 reward in all other states
The task of the agent is:
-
to learn from this indirect, delayed reward
-
to choose sequences of actions that produce the greatest cumulative reward
So we need an action policy that maximizes the reward.
One problem with an ongoing non-terminating task is that the future reward might be infinite.
A solution: discount future rewards by some factor; that is, the agent will prefer future rewards in the immediate future to those in the distant future.
\(r(n_i, a) = -c(n_i, n_j) + p(n_j)\)
-
\(r(n_i, a) \rightarrow\) The reward received by the agent when it takes an action \(a\) on \(n_i\)
-
\(p(n_j) \rightarrow\) The value of any special reward given for reaching node \(n_j\)
A Policy Function (\(\Pi\))
A policy function maps the set of nodes, \(N\), to a set of possible actions, \(A\).
Given a policy \(\Pi\), we can impute a value \(V^\Pi(n)\) for each node \(n\) in the state space:
-
\(V^\Pi(n) \rightarrow\) The total discounted reward the agent will receive if it begins at \(n\) and follows policy \(\Pi\).
\(V^\Pi(n_i) = r(n_i, \Pi(n_i)) + \gamma V^\Pi(n_j)\)
-
\(\gamma \rightarrow\) discount factor \((0 \lt \gamma \lt 1)\)
-
\(V^\Pi(n_j) = r(n_j, \Pi(n_j)) + \gamma V^\Pi(n_k)\) (recursive definition)
An Optimal Policy (\(\Pi^*\))
An optimal policy maximizes future discounted reward at every node.
Value Iteration
It is a method for learning an optimal policy (\(\Pi^*\)).
Steps:
-
Assign randomly an estimated value \(\hat{V}(n)\) to every node \(n\).
-
At some node \(n_i\), select an action \(a\) that maximizes the sum of the immediate reward and the estimated value of the successor node, given as:
\[r(n_i, a) + \gamma\hat{V}(n_j)\] -
Update the estimated value \(\hat{V}(n_i)\) as:
\[\hat{V}(n_i) = \hat{V}(n_i) + \beta\{[r(n_i, a) + \gamma\hat{V}(n_j)] - \hat{V}(n_i)\}\]where \(0 \lt \beta \lt 1\)
-
If no value changes by more than \(\delta\), halt
-
End
Delayed Reinforcement Learning
Rewards depend on a sequence of actions [Kaelbling, Littman & Moore, 1996]