GB2409088A

GB2409088A - Visual object recognition system

Info

Publication number: GB2409088A
Application number: GB0328830A
Authority: GB
Inventors: Simon Maitland Stringer; Edmund Rolls; Gavin Perry
Original assignee: Oxford University Innovation Ltd
Current assignee: Oxford University Innovation Ltd
Priority date: 2003-12-12
Filing date: 2003-12-12
Publication date: 2005-06-15
Also published as: GB0328830D0; WO2005057474A1

Abstract

An unsupervised method of training a hierarchical feedforward neural network to perform transform invariant visual object recognition, referred to as "continuous transformation (CT) learning". The method involves presenting a stimulus, preferably visual but possibly oral, to the neural network, then performing continuous synaptic enhancement of the feedforward inter-layer connection weights using an assocative learning rule, such as a Hebbian learning rule, during continuous, gradual transformation, translation or rotation for example, of the stimulus.

Description

1 2409088 Visual Obiect Recognition System This invention relates to a

system for visual object recognition and, more particularly, to a system for visual object recognition which simulates the activity of the primate visual system with regard to recognition of visual stimuli.

There is evidence that, over a series of cortical processing stages, the visual system of primates produces a representation of objects that shows invariance with respect to, for example, translation, size and view, which has been shown using recordings from single neurons in the temporal lobe. A theory as to how these neurons could acquire their transformindependent selectivity based on the known physiology of the visual cortex and self-organising principles, has the following fundamental elements: A series of competitive networks, organised in hierarchical layers, exhibiting mutual inhibition over a short range within each layer. These networks allow combinations of features or inputs that occur in a given spatial arrangement to be learned by neurons, ensuring that higher- order spatial properties of the input stimuli are represented in the network.

A convergent series of connections from a localised population of cells in preceding layers to each cell of the following layer, thus allowing the receptive- field size of cells to increase through the visual processing areas or layers.

A modified Hebb-like learning rule incorporating a temporal trace of each cell's previous activity, which, it is suggested, will enable the neurons to learn transform invariances.

One of the major problems to be solved with respect to a visual system for use in object recognition is the building of a representation of visual information which allows recognition to occur relatively independently of size, contrast, spatial frequency, position on the retina, angle of view, etc. Another important issue to be considered is the operation of the visual system to recognise those stimuli that are either partially occluded or presented against natural cluttered backgrounds.

Using the above-mentioned theory, a simulation system has been developed which is consistent with the fundamental elements referred to above. This simulation system implemented by a neural network has been shown to be capable of object, including face, recognition in a biologically plausible way, and after training has been shown to exhibit, for example, translation and view invariance (see Guy Wallis & Edmund T. Rolls, Invariant Face and Object Recognition in the Visual System, Prog. In Neurobiology, Vol. 51, ppl67 to 194, 1997).

The architecture and operation of the neural network referred to above will now be described in more detail.

Referring to Figure 1 of the drawings, there is illustrated a schematic four-layer neural network intended to simulate the operation of the visual system of a primate, such that the successive layers correspond approximately to V2, V4, the posterior temporal cortex and the anterior temporal cortex respectively. The network is designed as a series of hierarchical, convergent, competitive networks, and the overall network is constructed such that convergence of information from the most disparate parts of the network's input layer can potentially influence firing in a single neuron of the final layer, as shown in Figure 1. This corresponds to a known proposed scheme as present in the primate visual system.

The forward connections to a cell in one layer are derived from a topologically related and confined region of the preceding layer. The choice of whether a connection between neurons in adjacent layers exists or not is based upon a Gaussian distribution of connection probabilities which roll off radially from the focal point of connections for each neuron. In practice, a minor extra constraint precludes the repeated connection of any pair of cells. Each cell receives 100 connections from a 32 x 32 grid of cells in the preceding layer, initially with a 67% probability that a connection comes from within four cells of the distribution centre - although the effective convergence increases slightly through the layers. Figure 1 shows the general convergent network architecture used. Localisation and limitation of connectivity in the network is intended to mimic cortical connectivity in the visual system of a primate, partially because of the clear retention of retinal topology through regions of the visual cortex. This architecture also encourages the gradual combination of features from layer to layer.

Calculation of neuronal firing in the arrangement of Figure 1 is described in detail by Guy Wallis & Edmund T. Rolls, in Invariant Face and Object Recognition in the Visual System, Prog. In Neurobiology, Vol. 51, pp. 167 to 194.

The learning rule implemented in the simulations of the arrangement of Figure 1 uses the spatio-temporal constraints placed on the behaviour of real-world objects to learn about natural object transformations. By presenting consistent sequences of transforming objects, the cells in the network can learn to respond to the same object through all of its naturally transformed states. The learning rule incorporates a decaying trace of previous cell activity and is henceforth referred to simply as the trace learning rule. The learning paradigm is intended in principle to enable learning of any transforms tolerated by inferior cortex neurons.

A known trace update rule used in prior art simulations can be summarized as follows: Awj = y x (1)

J where

y = (1-17)yr + 77 y-r-] xj: jib input to the neuron.

y: Trace value of the output of the neuron at time step r.

wj: Synaptic weight between input and the neuron.

y: Output from the neuron.

a: Learning rate; annealed between unity and zero.

q: Trace value; the optimal value varies with presentation sequence length.

In order to bound the growth of each cell's dendritic weight factor, its length is explicitly normalised, as is common in known networks.

The above-described network is provided with a set of input filters that can be applied to an image to produce inputs to the network that correspond with those provided by simple cells in visual cortical area 1 (V1) of a primate visual system. The purpose is to enable within the arrangement the more complicated response properties of cells between Vl and the inferior temporal cortex (IT) to be investigated, using as inputs natural stimuli such as those that could be applied to the retina of the real visual system.

Thus, in the trace rule referred to above, where r indexes the current trial and r-l indexes the previous trial, the postsynaptic term in the synaptic modification is based on the trace from previous trials available at time r-l and on the current activity (at time r), with determining the relative proportion of these two.

In a more recent study, described by Edmund T. Rolls & T. Milward in A Model of Invariant Object Recognition in the Visual System: Learning Rules, Activation Functions, Lateral Inhibition, and Information-Based Performance Measures, Neural Computation 12, 2547-2572 (2000), it is illustrated that the use of a presynaptic trace, a postsynaptic trace (as used previously), and both presynaptic and postsynaptic traces produced similar performance. However, significantly better translation invariance can be achieved if the rule is used with a trace calculated for r -1, for example: -r- r (2) rather than at timer, for example, in equation 1 above. One way to understand this is to note that the trace rule is trying to set up the synaptic weight on trial r based on whether the neuron, based on its previous history, is responding to that stimulus (in other positions). Use of the trace rule at r-ldoes this, that is, takes into account the firing of the neuron on previous trials, with no contribution from the firing being produced by the stimulus on the current trial. On the other hand, use of the trace at time r in the update takes into account the current firing of the neuron to the stimulus in that particular position, which is not a good estimate of whether that neuron should be allocated to represent that stimulus invariantly. Effectively, using the trace at time r introduces a Hebbian element into the update, which tends to build position-encoded analysers rather than stimulus encoded analysers.

However, the trace learning mechanisms described above have been found to allow the object recognition arrangement to learn to recognise visual stimuli at only a small number of locations on the retina. In addition, there is a relatively long training time required. It is therefore an object of the present invention to enable such an arrangement to achieve a higher capacity relative to prior art arrangements by being able to train the arrangement to recognise visual stimuli over many different locations, if not every location on the retina. It is a further object of the invention to provide an arrangement with a significantly reduced training time.

Thus, in accordance with the present invention, there is provided apparatus for automatic entity recognition, the apparatus comprising a neural network having a plurality of layers, each layer comprising a plurality of neurons, one or more neurons in one layer becoming activated as a result of stimulation of one or more neurons in another layer in response to an input originating from an entity, the apparatus comprising means for training said apparatus to recognise entities from stimuli applied thereto by performing a training process comprising the steps of: a) applying a series of stimuli at respective trial times to a first layer of said network so as to cause activation of one or more neurons in another layer; b) applying an associated learning rule to said one or more neurons so as to calculate a synaptic weight associated with said one or more neurons based on stimulation at said first layer at the current trial time; and c) identifying spatial continuity between features of a first set of stimuli and features of a subsequent set of stimuli and causing the same one or more neurons to be activated in response to both said first and said second set of stimuli.

The present invention also extends to a method of automatic entity recognition in a neural network having a plurality of layers, each layer comprising a plurality of neurons, one or more neurons in one layer becoming activated as a result of stimulation of one or more neurons in another layer in response to an input originating from an entity, the method comprising training said apparatus to recognise entities from stimuli applied thereto by performing a training process comprising the steps of: a) applying a series of stimuli at respective trial times to a first layer of said network so as to cause activation of one or more neurons in another layer; b) applying an associated learning rule to said one or more neurons so as to calculate a synaptic weight associated with said one or more neurons based on stimulation at said first layer at the current trial time; and c) identifying spatial continuity between features of a first set of stimuli and features of a subsequent set of stimuli and causing the same one or more neurons to be activated in response to both said first and said second set of stimuli.

The stimuli may be visual, in apparatus for object recognition for example, or it may be oral, in apparatus for speech recognition for example.

The neural network beneficially comprises a hierarchical series of layers of competitive networks, preferably with associatively-modifiable forward connections between the layers. Each layer comprises a plurality of neurons, wherein forward connections from neurons in a first layer to neurons in a second layer are derived from a topologically corresponding region of the first layer, beneficially using a Gaussian distribution of connection probabilities.

Means are preferably provided to pre-process an input layer of the neural network with a set of input filters prior to the application of a series of visual stimuli to the neural network, the input filters beneficially being computed by weighting the difference of two Gaussians by a third Gaussian, and means are preferably provided for the subsequent application of contrast enhancement by means of a sigmoid activation function.

In a preferred embodiment, elements of an input stimulus applied at two successive trial times overlap in an input layer of the neural network. Beneficially, at each trial time, activity caused by the application of a stimulus on an input layer of the neural network is propagated in a feedforward fashion through the network, thereby stimulating patterns of activity in later layers. These activity patterns are beneficially computed, and then the synaptic weights of the forward connections between the layers are preferably updated by an associative learning rule. This learning rule may comprise the Hebb learning rule.

These and other aspects of the present invention will be apparent from, and elucidated with reference to the embodiment described herein.

An embodiment of the present invention will now be described by way of example only and with reference to the accompanying drawings: Figure 1 is a schematic illustration of a four-layer neural network for use in an arrangement according to an exemplary embodiment of the present invention whereby convergence through the network is designed to provide fourthlayer neurons with information from across the entire input retina, and a corresponding diagram illustrating convergence in the visual system: V1visual cortex area V1, TEO-posterior inferior temporal cortex, TEinferior temporal cortex (IT); Figure 2 is an illustration of how CT learning would function in a network with a single layer of forward synaptic connections between an input layer and an output layer: Figure 2a shows the initial presentation of the stimulus to the network, and Figure 2b shows what happens after the stimulus is shifted by a small amount; and Figure 3 illustrates graphically numerical results obtained in applying a method according to an exemplary embodiment of the present invention to train a neural network to recognise two different face stimuli as they are translated across the retina.

Thus, as explained above, there is now much evidence demonstrating that over successive stages the primate visual system develops neurons that respond with view, size and position (translation) invariance to objects or faces. A fundamental challenge for modellers is to postulate unsupervised learning mechanisms that might explain how such neurons develop their transform-variant response properties. Unsupervised learning means that the network is not given additional information about the identity of the object or face which is currently being viewed during training.

In the following, a new algorithm is described for the unsupervised training of feedforward hierarchical neural networks to perform transforminvariant visual object recognition, which is hereinafter referred to as continuous transformation (CT) learning.

The neural network consists of a hierarchical series of layers of competitive networks, with associatively-modifiable forward connections between the layers. CT learning is a general learning mechanism, which relies on continuous synaptic enhancement of the feedforward inter-layer connection weights using a Hebbian learning rule during continuous transformation (e.g. translation, rotation, etc) of the visual stimulus. The CT learning mechanism can be instantiated in various forms of feedforward neural network, with different forms of associative learning rule or different ways of implementing competition between neurons within each layer. For the simulations demonstrated herein, CT learning is demonstrated in a hierarchical network model of the visual system known in the art as VisNet.

In numerical simulations, CT learning according to the present invention exhibits the following significant advantages: (i) it requires only a relatively few (e.g.10) training epochs, and (ii) it is able to teach the network to recognise a stimulus at all locations on the retina. In addition, the fact that CT learning is an unsupervised learning mechanism offers the advantage (over supervised algorithms such as backpropagation) that the neural network can continue to learn (without human intervention) about visual stimuli such as faces, even as they mildly alter their appearance due to factors such as changes in hairstyle or aging, etc. This is clearly a capability of the human visual system that would be useful in artificial vision systems.

CT learning according to the invention is a general technique for finding and partitioning data clusters which cover highly elongated continuous regions in the data space. (For example, the continuous transforms of a visual object will form a continuous region of the retinal input space). As such, CT learning may be used for all kinds of data clustering applications where the data clusters have this elongated form.

Data clustering algorithms have a very broad range of applications. One such area is voice recognition, where a neural network must be trained to recognise different parts of speech.

In the following, the above-mentioned VisNet model will be described, in respect of which the CT learning mechanism of the present invention will be demonstrated.

However, in principle, the CT learning mechanism could operate in various forms of feedforward neural network, with different forms of associative learning rule or different ways of implementing competition between neurons within each layer.

The VisNet model consists of a hierarchical series of four layers of competitive networks, corresponding to V2, V4, the posterior inferior temporal cortex, and the anterior inferior temporal cortex, as shown in Figure 1. The forward connections to individual cells are derived from a topologically corresponding region of the preceding layer, using a Gaussian distribution of connection probabilities. These distributions are defined by a radius which will contain approximately 67% of the connections from the preceding layer. Type values are given in Table 1: Dimensions # Connections Radius Layer 4 32 x 32 100 12 Layer 3 32 x 32 100 9 Layer 2 32 x 32 100 Layerl 32x32 21 6 Retina 128x 128x 32 Before stimuli are presented to VisNet's input layer they are pre-processed by a set of input filters which accord with the general tuning profiles of simple cells in V1. The input filters used are computed by weighting the difference of two Gaussians by a third orthogonal Gaussian according to the following: rxy(P,O,f)=P [e(Xc s+Ysin0)2- 1 e-(xcos0+ysino)2l-(xsin0ycos0 where f is the filter spatial frequency, is the filter orientation, and p is the sign of the filter, i.e. + 1. Individual filters are tuned to spatial frequency (0.0625 to 0.125 cycles/pixel); orientation (0 to 135 in steps of 45 ); and sign (+1). The number of layer 1 connections to each spatial frequency filter group is given in Table 2: Frequency 1 0. 125 1 0.0625 | # Connections | 13 | 8 l The activation hi of each neuron i the network is set equal to a linear sum of the inputs y from afferent neurons j weighted by the synaptic weights wij. That is, hj=wyyi (4) where yj is the firing rate of neuron j, and win is the strength of the synapse from neuron j to neuron i.

Within each layer competition is graded rather than winner-take-all, and is implemented in two stages. First, to implement lateral inhibition the activation h of neurons within a layer are convolved with a spatial filter, 1, where controls the contrast and rT controls the width, and a and b index the distance away from the centre of the filter t_ -a2+b2'/72 if O b Ia6 l1_aoIab if a=Oandb=O. boo

Typical lateral inhibition parameters are given in Table 3: Layer 1 4 Radius, 1.38 2.7 4.0 6.0 Contrast, 1.5 1.6 1.4 Next, contrast enhancement is applied by means of a sigmoid activation function y = f sigmoid (r) = 1 + -2a<'-' (6) where r is the activation (or firing rate) after lateral inhibition, y is the firing rate after contrast enhancement, and and are the sigmoid threshold and slope respectively.

The parameters and are constant within each layer, although is adjusted to control the sparseness of the firing rates. For example, to set the sparseness to, say, 5%, the threshold is set to the value of the 95th percentile point of the activations within the layer. Typical parameters for the sigrnoid activation function are shown in Table 4: Layer 1 2 3 4 Percentile 99.2 98 88 91 Slope 190 40 75 26 Continuous transformation (CT) learning relies on the spatio-temporal continuity of how objects transform in the real world, combined with continuous enhancement of the feedforward connection weights according to an associative learning rule.

At each timestep during training, a transform of an object is presented to the retina. In the early stages of training, the visual stimuli presented to the retina must transform continuously. For translation invariance, this means that the retinal images of a stimulus at successive timsteps must overlap so that each two successive images have a number of units in the input layer in common. At each timestep, the activity due to the stimulus on the retina is propagated in a feedforward fashion through the network, stimulating patterns of activity in the later layers. Once the activity patterns have been computed in the various layers, the synaptic weights of the forward connections between the layers are updated by an associative learning rule which enhances the synaptic weight between two neurons when they are co-firing. There are a variety of associative rules possible. In the simulations with CT learning described in this patent application we use the Hebb learning rule Awij = ayiyj (7) where Awij is the increment in the synaptic weight win, Pi, is the firing rate of the post synaptic neuron i, yj is the firing rate of the pre-synaptic neurons, and is the learning rate. To bound the growth of each neuron's synaptic weight vector, wi for the ith neuron, its length is normalised at the end of each timestep during training.

Given that the hierarchy of neuronal layers utilize competitive learning, during the presentation of a visual image at one position on the retina a small winning set of connections from the retina to respond well to that image in that location. When the same image appears later at nearby locations, so that there is spatial continuity, the same neurons in the next layer up will be activated because some of the afferents from the retina are the same as when the image was in the first position. However, new afferents responding to the parts of the image which now occupy a new region of the retina will also be active, and will become modified onto the same neurons in the next layer up. In this way, a small set of winning neurons in the next layer up modify their afferent synapses to respond to the same image over a range of positions. This whole process is repeated throughout the network, both horizontally as the image moves on the retina, and hierarchically up through the network. Over a series of stages transform invariant (e.g. location invariant) representations of images are successfully learned, allowing the network to perform invariant object recognition.

This learning process is illustrated in Figure 2, which illustrates schematically how CT learning would function with a single layer of forward synaptic connections between an input layer and an output layer. Figure 2a shows the initial presentation of the stimulus to the network. The shaded cell in the output layer wins the "competition", and thus the weights between it and the active input cells (also shaded) are strengthened. Figure 2b shows what happens after the stimulus is shifted by a small amount. As many of the active input cells are the same as those at the first timestep, the same output cell wins the competitive process. The two shaded cells to the right, which are inactive at the previous timestep, now have their connections to the active output cell strengthened (as denoted by the broken lines). As can be seen, the process can be continued for subsequent shifts, provided that a large proportion of input cells stays active between individual shifts.

A similar CT learning process may operate for other kinds of transform, such as rotation or change m size.

Once limited invariant responses have been learned by the early layers of the network, CT learning in the higher layers can operate with larger (less continuous) transforms of the stimuli between learning updates. This is because, with invariant responses already learned in the lower layers, a relatively large transformation (e.g. translation) will still activate may of the same neurons in the lower layers due to their transform invariant responses. This means the higher layers will still receive similar inputs before and after stimulus transform.

To train the network a stimulus is presented in a sequence of locations in a square grid across the 128 x 128 input retina. The central location of the square grid is in the centre of the "retina", and the other locations are offset a set number of pixels horizontally and/or vertically from this. At each presentation the activation of individual neurons is calculated, then their firing rates are calculated, and then the synaptic weights are updated. After a stimulus has been presented in all the training locations, a new stimulus is chosen at random and the process repeated. The presentation of the stimuli across all locations constitutes 1 epoch of training. In this manner the network is trained one layer at a time starting with layer 1 and finishing with layer 4.

Two measures of performance were used to access the ability of the output layer of the network to develop neurons that are able to respond with view invariance to individual stimuli or objects. A single cell information measure was applied to individual cells in layer 4 and measures how much information is available from the response of a single cell about which stimulus was shown independently of view. A multiple cell information measure, the average amount of information that is obtained about which stimulus was shown from a single presentation of a stimulus from the responses of all the cells, enabled measurement of whether across a population of cells information about every object in the set was provided.

Next, the ability of the new CT learning algorithm to train VisNet to recognise two different face stimuli as they are translated across the retina will be demonstrated. For this test case, the maximum single cell information measure is Maximum single cell information = logs (Number of face stimuli), (8) where the number of face stimuli is 2. This gives a maximum single cell information measure of 1 bit for this test case.

During training, the face stimuli are translated continuously (i.e. 1 pixel spacing at a time) over a 15 x 15 square grid covering the central region of the retina. Although, in fact, an important aspect of the CT learning algorithm is that the training can involve the occasional larger (discontinuous) jump to a quite different location.

Numerical results are presented in Figure 3. On the top are single cell information measures for all top (4th) layer neurons ranked in order of their invariances to the faces, while on the bottom are multiple cell information measures. Results are presented with no training and after 10 training epochs. The simulation results with no training provide a baseline performance with random connection weights, with which to compare network performance after 10 training epochs. It can be seen that with no training, there are no cells which reach the maximum single cell information possible (I bit).

Also, the multiple cell information measure asymptotes to a sub-optimal value (i.e. less than 1 bit) as the number of cells increases.

From the top plot of Figure 3, it can be seen that after 10 epochs of training, there are a significant number of cells which reach the maximum level of performance (1 bit).

From the bottom plot of Figure 3, it can be seen that after 10 epochs oftraining, the multiple cell information asymptotes to a maximal value of I bit.

Thus, the present invention as described above provides a new algorithm for the unsupervised training of feedforward hierarchical neural networks to perform transform-invariant visual object recognition, which we call "continuous transformation (CT) learning". Unsupervised learning means that the network is not given additional information about the identity of the object or face which is currently being viewed during training. An important advantage of an unsupervised learning mechanism (over supervised algorithms such as backpropagation) is that the network can continue to learn (without human intervention) about visual stimuli such as faces as their appearance changes due to, for example, alterations in hairstyle or aging, etc. This is clearly a capability of the human visual system that would be useful in artificial vision systems.

CT learning is a general learning mechanism, which relies on continuous synaptic enhancement of the feedforward inter-layer connection weights using a hebbian learning rule during continuous transformation (e.g. translation, rotation, etc) of the visual stimulus.

However, once limited invariant responses have been learned by the early layers of the network, CT learning in the higher layers can operate with larger (less continuous) transforms of the stimuli between learning updates. This is because, with invariant responses already learned in the lower layers, a relatively large transformation (e.g. translation) will still activate many of the same neurons in the lower layers due to their transform invariant responses. This means the higher layers will still receive similar inputs before and after the stimulus transform.

In numerical simulations, CT learning exhibits the following desirable properties: (i) CT learning requires only a few (e.g. 10) training epochs, and (ii) CT learning is able to teach the network to recognise a stimulus at all locations on the retina.

CT learning is a general technique for finding and partitioning data clusters which cover highly elongated continuous regions in the data space. (For example, the continuous transforms of a visual object will form a continuous region of the retinal input space).

As such, CT learning may be used for all kinds of data clustering applications where the data clusters have this elongated form. Data clustering algorithms have a very broad range of applications. One such area is voice recognition, where a neural network must be trained to recognise different parts of speech.

Thus, in summary, a prior art model architecture for transform invariant object recognition in the visual system is based on a series of hierarchical competitive networks with local graded inhibition, convergent connections to each neuron from a topologically corresponding region of the preceding layer, and synaptic plasticity based on a modified Hebb-like learning rule with a temporal trace of each neuron's previous activity (trace learning). The known network is able to learn to respond invariantly to the different transforms of individual stimuli through the use of the Trace Learning (TL) rule. The trace learning rule encourages the neurons to develop invariant responses to input patterns that tend to occur close together in time, because they are likely to be of the same object. This network architecture has been shown to be able to recognise visual stimuli with position, size or rotation invariance. Furthermore, once the network has been trained on a set of stimuli, the network is able to recognise those stimuli when they are either partially occluded or presented against natural cluttered backgrounds.

This is an extremely important capability. However, the above-described trace learning mechanism has allowed this type of model architecture to learn to recognise visual stimuli at only a small number of locations on the retina.

The present invention, on the other hand, enables high capacity to be achieved in such a hierarchical system using an algorithm for learning invariant object recognition, referred to herein as Continuous Transformation (CT) learning. With CT learning, the above- mentioned architecture can remain the same, but there are two important modifications, namely, that in the exemplary embodiment described above, CT learning uses a standard Hebbian associative learning rule with instantaneous firing rates, i.e. with no memory trace of neural activity, and also there is spatial continuity between one image and the next in the set over which an invariant representation is to be learned.

Given that the hierarchy of neuronal layers utilise competitive learning, during the presentation of a feature combination (which forms part of an object) at one position, winning neurons in the next layer up will learn to modify their connections to respond well to that feature combination. When the same image appears later at nearby locations, so that there is spatial continuity, then the same neurons will be activated because some of the afferents are the same as when the feature combination was in the first position. However, new afferents responding to the features in the slightly translated part of the image space will be active, and will become modified onto the same neurons. In this way, neurons modify their synapses to respond to the same feature combination over a range of positions. This whole process is repeated throughout the network, both horizontally as the image moves on the retina, and hierarchically up through the network. Over a series of stages, transform invariant (e.g. location invariant) representations of high order feature combinations with the correct spatial arrangement of features are successfully learned, allowing the network to perform invariant object recognition.

The method of the present invention provides several significant advantages over the

prior art, including:

CT learning allows the network to learn to recognise visual stimuli at every location on the retina; CT learning allows the network to learn to recognise visual stimuli at every location on the retina; The training time required for CT learning is orders of magnitude less than the training time required for trace learning (in fact, only one or two epochs of training is required for CT learning); CT learning does not require the different transforms of each visual stimulus to be presented to the network's retina in temporal proximity.

It will be appreciated that the method and apparatus of the present invention is expected to have a wide range of industrial and military applications which involve transform invariant object recognition.

An embodiment of the present invention has been described above by way of example only, and it will be apparent to a person skilled in the art that modifications and variations can be made to the described embodiments without departing from the scope of the invention as defined by the appended claims. Further, in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim.

The term "comprising" does not exclude the presence of elements or steps other than those listed in a claim. The terms "a" or " an" does not exclude a plurality. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a device claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The mere fact that measures are recited in mutually different independent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

CLAIMS: 1. Apparatus for automatic entity recognition, the apparatus

comprising a neural network having a plurality of layers, each layer comprising a plurality of neurons, one or more neurons in one layer becoming activated as a result of stimulation of one or more neurons in another layer in response to an input originating from an entity, the apparatus comprising means for training said apparatus to recognise entities from stimuli applied thereto by performing a training process comprising the steps of: a) applying a series of stimuli at respective trial times T to a first layer of said network so as to cause activation of one or more neurons in another layer; b) applying an associated learning rule to said one or more neurons so as to calculate a synaptic weight associated with said one or more neurons based on stimulation at said first layer at the current trial time; and c) identifying spatial continuity between features of a first set of stimuli and features of a subsequent set of stimuli and causing the same one or more neurons to be activated in response to both said first and said second set of stimuli.
2. Apparatus according to claim 1, comprising object recognition apparatus or speech recognition apparatus, and wherein the stimuli are visual or oral respectively.
3. Apparatus according to claim 1 or claim 2, wherein the neural network comprises a hierarchical series of layers of competitive networks.
4. Apparatus according to claim 3, wherein associatively-modifiable forward connections are provided between the layers.
5. Apparatus according to claim 3 or claim 4, wherein each layer comprises a plurality of neurons, and wherein forward connections from neurons in a first layer to neurons in a second layer are derived from a topologically corresponding region of the first layer.
6. Apparatus according to claim 5, wherein said forward connections are derived using a Gaussian distribution of connection probabilities.
7. Apparatus according to any one of claims 3 to 6, wherein means are provided to pre-process an input layer of the neural network with a set of input filters prior to the application of a series of stimuli to the neural network.
8. Apparatus according to claim 7, wherein the input filters are computed by weighting the difference of two Gaussians by a third Gaussian.
9. Apparatus according to claim 7 or claim 8, wherein means are provided for the subsequent application of contrast enhancement by means of a sigmoid activation function.
10. Apparatus according to any one of claims 1 to 9, wherein elements of an input stimulus applied at two successive trial times overlap in an input layer of the neural network.
11. Apparatus according to any one of claims 1 to 10, wherein, at each trial time, activity caused by the application of a stimulus on an input layer of the neural network is propagated in a feedforward fashion through the network, thereby stimulating patterns of activity in later layers.
12. Apparatus according to claim 11, wherein said activity patterns are computed, and then the synaptic weights of the forward connections between layers of said neural network are updated by an associative learning rule.
13. Apparatus according to claim 12, wherein said associative learning rule comprises the Hebb learning rule.
14. A method of automatic entity recognition in a neural network having a plurality of layers, each layer comprising a plurality of neurons, one or more neurons in one layer becoming activated as a result of stimulation of one or more neurons in another layer in response to an input originating from an entity, the method comprising training said apparatus to recognise entities from stimuli applied thereto by performing a training process comprising the steps of: a) applying a series of stimuli at respective trial times T to a first layer of said network so as to cause activation of one or more neurons in another layer; b) applying an associated learning rule to said one or more neurons so as to calculate a synaptic weight associated with said one or more neurons based on stimulation at said first layer at the current trial time; and c) identifying spatial continuity between features of a first set of stimuli and features of a subsequent set of stimuli and causing the same one or more neurons to be activated in response to both said first and said second set of stimuli.