Nonlinear associative memory models
These notes have not yet been finalized for Winter, 1999.
Readings
Required reading: Chapters 15, 16 and 17
The Brain-State-in-a-Box (BSB) model
Iterative learning algorithms. Algorthms like
back-propagation, competitive learning and the SOFM and
Boltzmann machine are very powerful. Multilayer nonlinear networks are capable
of representing virtually any mapping (in theory, though in practice,
actually learning the mapping may take a very long time if it is even
possible). SOFM's and competitive networks may be thought of as finding
"clusters" in data, and in an approximate sense, they are discovering a
representation of the probability density of the data. That is, if there are
enough units in the network, in each region
of the input patterns where the density is high (there are many patterns), the
network is likely to allocate a unit whose weight vector represents the centre
of that region.
However, all of the learning
procedures mentioned above are inherently
iterative. Thus, they require many passes through a training set.
In contrast, humans, at least to some degree, can learn new material after a
single exposure.
Associative memory models are different; they can perform
one-shot learning. So far we have seen two examples of
one-shot learning devices:
- the linear associative memory models (both hetero- and
auto-associative)
- nonlinear associative models. So far we have seen one example, the
Hopfield network. Below, we will study another example of a nonlinear
associative memory model: Anderson's BSB model.
Although associative memory models are considered "weak learning devices"
(i.e. they have low capacity), they have a number of advantages:
- They may be good models of human episodic memory because they exhibit
one-shot learning.
- They can model dynamical aspects of memory retrieval because they have
recurrent connections, and so their activations change over time. This permits
associative computation and pattern completion.
- We can analyze their formal properties using tools such as
Eigenvalue analyses: If the linear associator updates its
state vector f by incrementing the state in proportion to the
weighted summed inputs to each unit, f(t+1) = f(t) + A
f(t), where A is the weight matrix, it can be shown that
f will converge to the direction of the eigenvector of
A having the largest eigenvalue; let's call this the dominant
eigenvector. This is a well-known technique for finding
eigenvectors, known as the "power method", and it is guaranteed to work,
provided the initial state is not
orthogonal to dominant eigenvector.
Note that we said the final
state converges to the direction of the dominant eigenvector;
however, the state will continue to grow in magnitude at an exponential rate,
if the eigenvalue is greater than one. THus, a linear recurrent network is a
dangerous thing: if the states are repeatedly updated, they may grow without
bound. During training, the matrix
A is created by the outer product rule; it is a sum of outer products
of the training pattern with itself (in the case of auto-association) or with
the target (in the case of hetero-association). Thus, we would hope that after
learning, we have a weight matrix for which the patterns are very close to the
eigenvectors of the matrix. Further, the eigenvectors should have roughly
equal eigenvalues. If one eigenvector has a much larger eigenvalue that the
others, the state update rule will almost always retrieve that same vector
regardless of the initial state (except when the initial state is orthogonal
to the dominant eigenvector).
The BSB model is very similar to a Hopfield network,
in that it is a nonlinear associative memory model with recurrent connections.
Like the Hopfield model, it can also be trained with a simple Hebbian
learning rule. (It can also be trained using a Widrow-Hoff or delta rule, but
we didn't discuss that in class.) However, Anderson generalizes the simple
binary Hopfield network in an important way: the units have
continous real-valued states. The units' states are computed
according to a "LIMIT function" (see text, p 504-506) of the total
weighted summed input. This function is roughly sigmoidal in shape. States
are updated gradually, so that for each unit, the old state is gradually
replaced by the influence of the incoming activations. As in the case of the
linear associator, the state dynamics tend to cause the network to converge to
eigenvectors of the weight matrix. So if the initial state is close to an
eigenvector with a large eigenvalue, it will converge rapidly to something
close to that eigenvector. However, because of the LIMIT function, the final
state will actually consist of all 1's and -1's. If there are only 3 units,
this would correspond to the corners of a cube in state-space. For higher
dimensions, the network states are bounded by the corners of a
high-dimensional hypercube. That's why this model is called
brain-state-in-a-box.
Applications of the BSB model:
- Modelling multistable perception, e.g. the Necker cube.
- Modelling rule-based inferences: for semantic knowledge,
and knowledge of drugs, organisms and diseases.
- Modelling arithemtic associations e.g. multiplication. We
discussed evidence for localist ("sensory") versus distributed/abstract
representations of numbers in children, and cross-cultural evidence of
different numerical representations. The BSB simulations used a combination of
a localist and an abstract code.