Nonlinear associative memory models

These notes have not yet been finalized for Winter, 1999.

Readings

Required reading: Chapters 15, 16 and 17

The Brain-State-in-a-Box (BSB) model

Iterative learning algorithms. Algorthms like back-propagation, competitive learning and the SOFM and Boltzmann machine are very powerful. Multilayer nonlinear networks are capable of representing virtually any mapping (in theory, though in practice, actually learning the mapping may take a very long time if it is even possible). SOFM's and competitive networks may be thought of as finding "clusters" in data, and in an approximate sense, they are discovering a representation of the probability density of the data. That is, if there are enough units in the network, in each region of the input patterns where the density is high (there are many patterns), the network is likely to allocate a unit whose weight vector represents the centre of that region. However, all of the learning procedures mentioned above are inherently iterative. Thus, they require many passes through a training set. In contrast, humans, at least to some degree, can learn new material after a single exposure.

Associative memory models are different; they can perform one-shot learning. So far we have seen two examples of one-shot learning devices:

the linear associative memory models (both hetero- and auto-associative)
nonlinear associative models. So far we have seen one example, the Hopfield network. Below, we will study another example of a nonlinear associative memory model: Anderson's BSB model.

Although associative memory models are considered "weak learning devices" (i.e. they have low capacity), they have a number of advantages:

They may be good models of human episodic memory because they exhibit one-shot learning.
They can model dynamical aspects of memory retrieval because they have recurrent connections, and so their activations change over time. This permits associative computation and pattern completion.
We can analyze their formal properties using tools such as Eigenvalue analyses: If the linear associator updates its state vector f by incrementing the state in proportion to the weighted summed inputs to each unit, f(t+1) = f(t) + A f(t), where A is the weight matrix, it can be shown that f will converge to the direction of the eigenvector of A having the largest eigenvalue; let's call this the dominant eigenvector. This is a well-known technique for finding eigenvectors, known as the "power method", and it is guaranteed to work, provided the initial state is not orthogonal to dominant eigenvector. Note that we said the final state converges to the direction of the dominant eigenvector; however, the state will continue to grow in magnitude at an exponential rate, if the eigenvalue is greater than one. THus, a linear recurrent network is a dangerous thing: if the states are repeatedly updated, they may grow without bound. During training, the matrix A is created by the outer product rule; it is a sum of outer products of the training pattern with itself (in the case of auto-association) or with the target (in the case of hetero-association). Thus, we would hope that after learning, we have a weight matrix for which the patterns are very close to the eigenvectors of the matrix. Further, the eigenvectors should have roughly equal eigenvalues. If one eigenvector has a much larger eigenvalue that the others, the state update rule will almost always retrieve that same vector regardless of the initial state (except when the initial state is orthogonal to the dominant eigenvector).

The BSB model is very similar to a Hopfield network, in that it is a nonlinear associative memory model with recurrent connections. Like the Hopfield model, it can also be trained with a simple Hebbian learning rule. (It can also be trained using a Widrow-Hoff or delta rule, but we didn't discuss that in class.) However, Anderson generalizes the simple binary Hopfield network in an important way: the units have continous real-valued states. The units' states are computed according to a "LIMIT function" (see text, p 504-506) of the total weighted summed input. This function is roughly sigmoidal in shape. States are updated gradually, so that for each unit, the old state is gradually replaced by the influence of the incoming activations. As in the case of the linear associator, the state dynamics tend to cause the network to converge to eigenvectors of the weight matrix. So if the initial state is close to an eigenvector with a large eigenvalue, it will converge rapidly to something close to that eigenvector. However, because of the LIMIT function, the final state will actually consist of all 1's and -1's. If there are only 3 units, this would correspond to the corners of a cube in state-space. For higher dimensions, the network states are bounded by the corners of a high-dimensional hypercube. That's why this model is called brain-state-in-a-box.

Applications of the BSB model:

Modelling multistable perception, e.g. the Necker cube.
Modelling rule-based inferences: for semantic knowledge, and knowledge of drugs, organisms and diseases.
Modelling arithemtic associations e.g. multiplication. We discussed evidence for localist ("sensory") versus distributed/abstract representations of numbers in children, and cross-cultural evidence of different numerical representations. The BSB simulations used a combination of a localist and an abstract code.