Learning Heuristic Functions

The necessary algorithms for learning heuristic functions are aimed at:

  • explicit representations of the state space graph (category 1 algorithm)

  • implicit representations of the state space graph (category 2 algorithm)

Explicit Graphs (Category 1 Algorithms)

If the agent does not have a good heuristic function to estimate the cost of/to the goal, it is sometimes possible to learn such a function

In explicit representations of the state space, it is possible to store the explicit list of all the nodes or possible nodes

It is assumed that in explicit representation, the agent has a good model of the effect of its actions and that it knows the cost of moving from any node to its successor nodes.

Steps

  1. Initialize the heuristic function \(h(n)\) to 0 for all nodes \(n\)

  2. Repeat step 3 to 5 for various start nodes and one goal node

  3. Conduct A* search

  4. Expand node \(n_i\) to produce a set of successors \(\delta(n_i)\)

  5. Modify \(h(n_i)\) as follows:

    \[h(n_i) = \min_{n_j \in \delta(n_i)} \{h(n_j) + c(n_i, n_j)\}\]

    Assume \(h(n_g) = 0\)

  6. Stop

ni expansion

The \(h\) values can be stored in a table of nodes (because there is a manageably small no. of items)

Training instances can be represented as \(\{ (s_1, g), (s_2, g), \dots \}\), where:

  • \(s_1\) start state 1, \(s_2\) start state 2, …​

  • \(g\) goal state

Searching for the goal node for the first time may not be completed faster than does by uniform cost search \( ( f(n) = g(n) ) \)

As the learned \(h\) values get updated during searches from different start nodes to the same goal node, subsequent searches are accelerated.

As deeper and deeper nodes in the search tree are expanded, better and better estimates of \( h^* \) at various nodes are propagated backward from the goal node.

Note
Learning Real Time A* (LRTA) uses this same update rule for \( h \), developed by Korf, 1990.

Implicit Graphs (Category 2 Algorithms)

An implicit graph representation is a graph whose vertices or edges are not represented as explicit objects in the computers memory, but are rather determined algorithmically from some other input, such as a computable function.

Category 2 algorithms are used for implicitly defined state spaces graphs.

Assumptions

Category 2 algorithms assume the following:

  • A model of the effects of the agent’s actions is available

  • General graph search with \(f(n) = g(n) + h(n)\) can be performed

  • \(h(n)\) can be decomposed as a sum of some sub-functions:

    \[h(n) = w_1 h_1(n) + w_2 h_2(n) + \dots + w_k h_k(n)\]

The algorithms are designed to learn the appropriate values of the \(w_i\)'s

Example: Sub-Functions for 8-Puzzle

8 puzzle

Let \(h(n) = w(n) + p(n)\), where:

  • \(w(n) \rightarrow\) the number of tiles out of place

  • \(p(n) \rightarrow\) sum of the distances of each tile from its goal position

Method 1 for Learning Weights

  • In this method, we first initialize the weights randomly and conduct a search using \(h\) with these weight values. When a goal node is reached, we use the final known value of \(h(n_g) = 0\) to back up \(h\) values of all nodes, \(n_i\), on the path from the start node to the goal node.

  • Using these values as training instances, we can adjust the weights to minimize the sum of squared errors between the training samples and the \(h\) function given by the weighted combination.

\[\frac{\partial E}{\partial w_i} = 0\]

where \(E \rightarrow\) sum of squared errors

linear combination
\[w_i(t+1) = w_i(t) + \Delta w_i(t)\\[0.8em] \Delta w_i(t) = \eta (h_\text{target}(n) - h(n)) h_i(n)\]

where \( \eta \rightarrow \) learning rate

Steps:

  1. Assign arbitrary values to all weights

  2. Do for all training examples

  3. Find a goal node \(n_g\) using general graph search with \(f(n) = g(n) + h(n)\)

  4. \(h(n_g) = 0\)

  5. Do for all nodes \(n_i\) on a path from \(s\) to \(n_g\)

  6. \(h(n_{i-1}) = h(n_i)\)

  7. Compute \(E\), the sum of squared errors

    \(E = \sum_i [w_1 h_1 (n_i) + w_2 h_2 (n_i) + \cdots]^2 \)

  8. Update weights \(w_i\) using least square error minimization \(\forall i\)

  9. Print the learned weight values

  10. Stop

Method 2 for Learning Weights (Temporal Difference Learning)

This learning method is an instance of temporal difference learning, first formalized by Sutton, 1988.

  • Weight adjustment depends on only two temporally adjacent values of a function

  • In this case, it is the difference between the current value of \(h\) and the past value of \(h\)

Sutton stated and proved several properties of the method relating to its convergence in various situations

Temporal difference can be applied during search before a goal is reached. However, values of \(h\) that are being learned do not become relevant to the goal until the goal is reached.

If a model of the effects agent’s actions is unavailable, this method can be applied, with learning taking place in the actual world.

Steps:

  1. Assign arbitrary values to all weights

  2. Do for all training problem instances

  3. Conduct general graph search with \(f(n) = g(n) + h(n)\)

  4. Expand node \(n_i\) to produce a set of successors \(\delta(n_i)\)

  5. Adjust weights so that \( h(n_i) = h(n_i) + \beta \left\{ \min_{n_j \in \delta(n_i)} [h(n_j) + c(n_i, n_j)] - h(n_i) \right\} \) or

    \[h(n_i) = (1 - \beta) h(n_i) + \beta \left\{ \min_{n_j \in \delta(n_i)} [h(n_j) + c(n_i, n_j)] \right\}\]

    \(\beta\) is the learning rate, \( 0 \leq \beta \leq 1\). It controls how closely \(h(n_i)\) is made to approach \( \min_{n_j \in \delta(n_i)} [h(n_j) + c(n_i, n_j)]. \\ \) \(\beta = 0 \rightarrow\) no change is made at all \( \\ \) \(\beta = 1 \rightarrow\) \(h(n_i)\) equals \( \min_{n_j \in \delta(n_i)} [h(n_j) + c(n_i, n_j)] \).

  6. Print the learned weight values

  7. Stop

Note
Small values of \(\beta\) lead to very slow learning, while large values of \(\beta\) may lead to erratic learning and non-convergence.

Learning When Model of Action Effects is Absent

If the agent does not have a model of the effects of its actions, it can learn them and also learn an \( h \) function at the same time by a similar process, although the learning has to take place in the actual world rather than in a state space search model.

Also, the agent does not know the cost of its actions. It learns this cost upon taking actions.

Steps:

State \((n)\) Action 1 \((a_1)\) Action 2 \((a_2)\)

\(n_0\)

\(\cdots\)

\(\cdots\)

\(n_1\)

\(\cdots\)

\(\cdots\)

  1. Sense the current state \(n_i\)

    also add an entry for \(n_i\) in the table if it is not already there

  2. Randomly select an action \(a\) from the set of possible actions to apply on \(n_i\) to generate \(n_j\) (as it visits states, it names them)

    add an entry for \(n_j\) in the table if it is not already there

    or

    choose an action \(a\) according to the policy

    \[a = \arg \min_{a \in A} \{h(\sigma(n_i, a)) + c(n_i, \sigma(n_i, a))\}\]

    \( \sigma(n_i, a) \rightarrow \) description of the state reached from node \(n_i\) by applying action \(a\)

  3. \(h(n_i) = h(n_j) + c(n_i, n_j)\)

    \(n_i\) is the node on which the action \(a\) is applied, \(n_j\) is the resulting state, \(c(n_i, n_j)\) is the revealed cost of the action, \(h(n_j)\) is an estimated value of \(n_j\), which is equal to 0 if \(n_j\) has never been visited before and is otherwise stored in the table

  4. Go to step 1

  5. End

Because the action selected at node \(n\) is the one leading to a node estimated to be on a minimal cost path from \(n\) to a goal, it is not necessary to evaluate all the successors of node \(n\) when updating \(h(n)\).

The algorithm may result in learning non-optimal paths!

  • The nodes on an optimal path may never be visited

Allowing occasional random actions (instead of those selected by the policy being learned) helps the agent learn new (perhaps better) paths to goal.

A trade-off needs to be maintained between:

  • exploration of newer paths

  • exploitation of already learnt knowledge

This trade-off is necessary to ensure that all nodes are visited.

Note
Taking random actions is one way for an agent to deal with exploration of new paths.

As a model is gradually built up, it is possible to combine "learning in the world" with "learning and planning in the model"

Note

Such combinations were explored by Sutton, 1990 in his Dyna model.

Dyna is an AI architecture that integrates:

  • Learning

  • Planning

  • Reactive execution