Rewards Instead of Goals

In many problems of practical interest, the task may be of an ongoing nature, such as:

A boundary following robot
Operations in factories
Board games

In such cases, we need to design learning algorithms for autonomous agents that sense and act in their surroundings with optimal actions in order to achieve specific goals.

Each time an agent performs an action in the environment, a trainer/user may occasionally provide a reward or penalty to indicate the desirability of the resulting state.

For example, when training an agent to play a game, a trainer might provide:

a positive reward when the game is won
a negative reward when it is lost
0 reward in all other states

The task of the agent is:

to learn from this indirect, delayed reward
to choose sequences of actions that produce the greatest cumulative reward

So we need an action policy that maximizes the reward.

One problem with an ongoing non-terminating task is that the future reward might be infinite.

A solution: discount future rewards by some factor; that is, the agent will prefer future rewards in the immediate future to those in the distant future.

\(r(n_i, a) = -c(n_i, n_j) + p(n_j)\)

\(r(n_i, a) \rightarrow\) The reward received by the agent when it takes an action \(a\) on \(n_i\)
\(p(n_j) \rightarrow\) The value of any special reward given for reaching node \(n_j\)

A Policy Function (\(\Pi\))

A policy function maps the set of nodes, \(N\), to a set of possible actions, \(A\).

\[\Pi : N \rightarrow A\]

Given a policy \(\Pi\), we can impute a value \(V^\Pi(n)\) for each node \(n\) in the state space:

\(V^\Pi(n) \rightarrow\) The total discounted reward the agent will receive if it begins at \(n\) and follows policy \(\Pi\).

\(V^\Pi(n_i) = r(n_i, \Pi(n_i)) + \gamma V^\Pi(n_j)\)

\(\gamma \rightarrow\) discount factor \((0 \lt \gamma \lt 1)\)
\(V^\Pi(n_j) = r(n_j, \Pi(n_j)) + \gamma V^\Pi(n_k)\) (recursive definition)

An Optimal Policy (\(\Pi^*\))

An optimal policy maximizes future discounted reward at every node.

\[V^{\Pi^*}(n_i) = {\max_{a}} (r(n_i, a) + \gamma V^{\Pi^*}(n_j))\]

Value Iteration

It is a method for learning an optimal policy (\(\Pi^*\)).

Steps:

Assign randomly an estimated value \(\hat{V}(n)\) to every node \(n\).
At some node \(n_i\), select an action \(a\) that maximizes the sum of the immediate reward and the estimated value of the successor node, given as:

\[r(n_i, a) + \gamma\hat{V}(n_j)\]
Update the estimated value \(\hat{V}(n_i)\) as:

\[\hat{V}(n_i) = \hat{V}(n_i) + \beta\{[r(n_i, a) + \gamma\hat{V}(n_j)] - \hat{V}(n_i)\}\]

where \(0 \lt \beta \lt 1\)
If no value changes by more than \(\delta\), halt
End

Convergence

The algorithm converges to optimal values under certain circumstances:

one such condition is that each node is visited infinitely often

Delayed Reinforcement Learning

Rewards depend on a sequence of actions [Kaelbling, Littman & Moore, 1996]