GB2236608A

GB2236608A - Digital neural networks

Info

Publication number: GB2236608A
Application number: GB8922528A
Authority: GB
Inventors: David John Myers
Original assignee: British Telecommunications PLC
Current assignee: British Telecommunications PLC
Priority date: 1989-10-06
Filing date: 1989-10-06
Publication date: 1991-04-10
Anticipated expiration: 2009-10-06
Also published as: HK132796A; GB8922528D0; GB2236608B

Abstract

A digital neuron receiving inputs X1-Xn includes weighting elements W1-Wn, a processor P, and a non-linear compressing element C. Element C uses a piecewise linear approximation to the sigmoidal neuron activation function, which maps compactly into a digital integrated circuit realisation, and involves slopes of powers of two so that it may be implemented by shifting the lower order bits of the neuron output in dependence upon the higher order bits. <IMAGE>

Description

DIGITAL NEURAL NETWORKS This invention relates to digital neurons, and to neural networks comprising a plurality of such neurons, particularly but not exclusively for pattern recognition.

A neuron in this context is a circuit (realised with electrical or optical components) which receives a plurality of inputs and produces an output corresponding to a function (eg the sum) of the inputs each weighted by a respective weighting factor, derived during a training phase.

Attempts were made to inter-connect such neurons in layers, the output of one forming an input to another, so as to form a net. However, it was found that the effect of so doing was exactly equivalent to that of a single layer of neurons, unless a non linear compression stage was included between the layers.

In the implementation of neural networks, there is thus generally a requirement for non-linear activation function at the output of each neuron . This may take a number of forms, including the simple threshold function, but the most popular activation function is the sigmoid function, given by: Y= 1 (1) (1 + e-X) In the reported analogue integrated circuit implementations of neural nets, the non-linearity is usually implicit in one of the other neuron operations, for example in the analogue multiplier2, or by allowing the analog summing amplifier to go into saturation and shit the rails'3. Thus the nature of the non-linear function is rather uncontrolled.

In digital Very Large Scale Integration (VLSI) implementations, by contrast, the activation function can be specified with arbitary precision. A number of possibilities exist for evaluating the function of eqn.

1. It could be evaluated by summing a truncated series expansion, which is likely to be slow in terms of computation time, or by using a table look-up scheme in which the y value (output value) associated with each x (input value) value is stored in memory (eg Random Access Memory (RAH) or Read-only Nemory (RON)), and the x value is used to address the memory. If x is a 16 bit number and y is an 8 bit number, a simple look-up scheme would require 64 K Bytes of memory: this is clearly impractical, in terms of silicon area required, for each neuron of a VLSI neural network since such a network needs a very large number of neurons. One alternative would be to use a piecewise linear approximation scheme, in which the breakpoints are stored in memory, and table lookup is combined with linear interpolation.

According to the invention there is provided a neural processor comprising means for receiving an input signal and means for producing a digital output signal corresponding to a function of the input signal and a weight vector associated with the processor, further comprising means for applying a non-linear function to said output signal to produce a digital compressed output signal wherein the function consists of a plurality of inclined linear segments with different slopes, the slope of each said segment being equal to an, where n is an integer and a is the base of the digital output signal.

This method of executing the activation function can be realised as a combination of a small number of simple logic gates (and shift stages) requiring no large storage table or repetitive calculation, and hence using less silicon area. Scaling to the different slopes is easy to achieve by left or right logical shifts since the slopes are powers of 2 (in binary arithmetic). It is inspired by the piecewise linear approximation used to implement A-law companding for Pulse Code Nodulation (PCx) systems Other aspects of the invention are as claimed or described herein, with advantages which will be apparent from the following.

One embodiment of the invention will now be illustrated, by way of example, with reference to the accompanying drawings in which: Figure 1 shows schematically a generalised digital neuron known in the art; Figure 2a shows the general form of the sigmoid excitation function, and of a standard piecewise-linear A law curve; Figure 2b shows an exemplary function curve according to a preferred embodiment of the invention, and the sigmoid excitation function; Figure 3 shows schematically part of one example of a circuit for generating a non-linear function in a neuron according to the invention; Figure 4 shows schematically a further part of that exemplary circuit; Figure 5 shows part of the circuit of Figure 3 in greater detail Figure 6 shows one example of a circuit comprising an element of the circuit of Figure 5; and Figure 7 shows a circuit, for use with that shown in Figures 3-6, for generating the gradient of the function shown in Figure 2b.

Referring to Figure 1, a digital neuron well known in the art comprises at least one input (X1, X2, X3 means for scaling each input according to the value of a weight (wl, W2, W3, Wn )l and n processor P which produces a binary digital output Y which is a function (e.g. the sum) of these scaled or weighted inputs.

As discussed above, it is also in general necessary to provide means C for applying a non linear function to compress the range of the output Y, to give a compressed output I. This (together with other neuron outputs), is connected to form an input to a further neuron.

During a training phase, training data signal vectors are presented as input to the neuron, and the weights are altered (eg incremented) in dependance upon, amongst other things, - the derivative of the (compressed) output by training means (not shown).

In our initial experiments a 13 segment piecewise-linear, A-law curve was used scaled such that values of neuron output x in the range -8 to +8 mapped to values of compressed output y in the range 0 to +1, to approximate the sigmoid function. Fig. 1 shows a plot of the scaled A-law curve, and for comparison the sigmoid function of eqn. 1. The nature of the piecewise linear A-law curve is explained in Reference 4, for example. However, simulations using this approximation indicated that it did not perform well when used to train Nulti-layer Perceptron (ALP) networks using the Backpropagation Algorithm This is because in the region around x=0, the first derivative (slope) of the A-law linear curve is much higher than that of the sigmoid function.As a result, when the sigmoid was replaced by the A-law curve, the training was unstable, and failed to converge. It could be made to converge by reducing the learning ratel, but this resulted in very long learning times compared to simulations of the same network utilising the sigmoid function. It was found that the A-law curve could however be used in the recognition mode for nets trained (i.e.

weight values derived) using the sigmoid function without serious degradation in the performance of the net.

A modified curve was therefore developed, similar to the A-law curve in that the gradient of each section can be expressed as a power of 2. This curve, shown in Fig. 2, has only 7 segments and is a better approximation to the sigmoid function, which is also shown in Fig. 2 for comparison. Note that at x=O the modified curve and the sigmoid have the same gradient (=0.25). Table 1 compares the breakpoints of the A-law and the modified curves.

The breakpoints are conveniently at input values which are pwers of 2; this enables the particular segment of the curve to which a given input value corresponds to be determined from higher order bits alone.

Simulation of the XLP with backpropagation using the modified piecewise linear approximation indicates that it gives comparable performance to simulations based on the true sigmoid, and is thus useful in trainable neurons and nets thereof.

Table 2 shows a possible truth table for the modified piecewise linear function corresponding to positive values of x, aimed at mapping a 16 bit 2's complement input value (Io-I15) in the range -8 to +8 to an 8 bit 21s complement output value (Ro-R7) in the range 0 to 1.

The negative half of the function corresponds in a obvious fashion. This could be used in a digital neural net system with 8 bit data word representation, which allows an additional 8 bits internal to each neuron for bit growth.

One possible hardware implementation of the truth table of Table 1 is shown in Fig. 3. This consists of a 'most significant 1' selector circuit 1, which takes the most significant bits Ill-Il4 of the neuron output as input, and outputs a 4 bit word Y11-Y14 containing a 1 at the position of the most significant 1 in the input, or else outputs 0 where there is no 1 in Ill-Il4.

Signals Y11-Y14 of the neuron output are decoded to produce compressed output bits R4-R5. Bit R6=1 because the output is always greater than 0.5, in the positive quadrant of the function (shown in Fig. 1) and bit R=O because the output is always positive. R7 is therefore ignored (at this stage). Signals Y12-Yl4 are also used to control a shifting circuit 2 comprising a bank of 2:1 multiplexer circuits 2a, 2b, 2c, which controllably shift bits of the input word to provide the lower order compressed output bits Ro-R3. These multiplexers connect the upper right input to the output if the control signal is 0, and the lower right input to the output otherwise.

Fig. 4 shows one way in which the circuit of Fig. 3 can be generalised to cover both positive and negative values of x. If I15=l the neuron output is negative, and so the input is 2's complemented by a first complementing means 3a before being applied to the circuit of Fig. 2.

The compressed output of the circuit of Fig. 2 (i.e. bits Ro'R6) is also passed through second 2's complementing means 3b, which has the effect of mirroring the activation curve about the line y=0.5, and results in the correct value being output for negative inputs. For positive inputs, both 2's complement circuits 3a, 3b are bypassed.

I15 is used as a control bit to switch 3a and 3b in or out. One's complementing could also be used, at the expense of a slight loss of accuracy, allowing simpler logic implementation of circuits 3a and 3b.

Referring to Figure 5, the 'select most significant 1 circuit takes Ill-Il4 as input, and outputs a bit word Yll-Yl4 containing a 1 at the position of the most significant 1 in the input, or else outputs 0. This can be described by the following truth table:: Table 3

114 I13 I12 Ill Y14 Y13 Y12 Y11 o 0 0 0 0 0 0 0 O 0 0 1 0 0 0 1 O 0 1 X O 0 1 0 0 1 X x 0 1 0 0 1 X X X 1 0 0 0 (X = donut care) This may be implemented in a number of ways. A modular implementation is shown in Figure 5. Each of the modules la, Ib, ic, ld shown in this figure has two inputs; an input Ii and an input Ri+1.Pi+1 indicates whether there have been any more significant (previous) l's input.

i.e. Pi+1=1 if any Ij=l where j > i. Each module has two outputs, Y1 and Pi. The truth table for each module is as follows: Table 4

Pi+1 Ii Pi o 0 0 0 0 1 1 1 1 0 1 0 1 1 1 1 The required functions are thus; P's = I' + P' s+l (=I's.

as shown in Figure 6) and; Y's = I's . P's+1 (=I's+P's+1' as shown in Figure 6) This suggests one simple implementation of each module la, lb, lc, ld using NOR, NAND and NOT gates shown in Figure 6. other logically equivalent circuits are easily derived. As can be seen in Figure 5, P15 is set to 0, and P11 is a signal that is O if none of the inputs I Il4 are equal to 1.

A further feature of the invention is that from outputs Y11-Y14 and P11 it is possible to derive a value for the gradient of the activation fuction, in other words the first derivative of the compressed output with very little additional circuit overhead. This is of course particularly useful during the training phase since it is required by the back-propagation algorithm.

The table below relates values of Y11-Y14 and P11 to the gradient of the curve, expressed as a 2's complement fractional binary number Go-G6.

Table 5

Y14 Y13 Y12 Y11 Y11 G6 G5 G4 G3 G2 G1 G0 1 O O O 1 0 0 O O O O 1 (1/64) O 1 O O 1 O O o O O 1 O (1/32) 0 0 1 0 1 0 0 0 1 0 0 0 (1/8) 0 0 0 1 1 0 0 1 0 0 0 0 (1/4) 0 0 0 0 0 0 0 1 0 0 0 0 (1/4) This gives the simple implementation shown in Figure 7.

The gradient value output from this circuit (G0-G6) is valid regardless of the sign of the original input.

The circuit of Figs. 3 to 6 can be compactly and elegantly implemented in VLSI, using only simple logic functions which operate at extremely high rates compared to arithmetic processes, and occupy little chip area. Given the high operating speed, one potential application is in video image recognition, e.g. for a hybrid video coder (for example for use in a video phone). other applications include speech conversion, industrial robot vision and natural language computing, and the principles behind it can be easily extended to other combinations of input and output wordlength.

It will be understood that the concept of the invention is applicable also to digital optical neural networks, and that it functions analogously with tri-state or other multi-level logic-state arithmetic; for an arithmetic system to base 'a', the slopes of linear segments are powers of 'a' and the segment breakpoints are powers of ., Referendes 1. RUBELHART, D.E., and NcCLELLAND, J.L. (Eds): 'Parallel Distributed Processing' Vol. 1 (The NIT Press, Cambridge, Mass, 1986).

2. SCHWARTZ, D.B., and HOWARD, R. E.:'A Programmable Analog Neural Network Chip' Proc. IEEE Custom Integrated Circuits Conference, 1988.

3. NULLER, P. et Al. 'A General Purpose Analog Neural Computer' Proc. IEEE/INNS Int. Joint Conf. on Neural Networks, 1989.

5. SMITH, D.R. 'Digital Transmission Systems' (Van Nostrand Reinhold, 1985) pp. 78-88.

Tables Table 1. Breakpoints of the A-law and modified piecewise linear curves.

Table 2. Piecewise linear activation function truth table (corresponding to Figure 2a) for positive values of input.

Table 3. Truth table for circuit of Figure 5.

Table 4. Truth table for circuit of Figure 6.

Table 5. Truth table for circuit of Figure 7.

BREAKPOINTS SCALED A-LAW MODIFIED CURVE x Y X X Y -8.0 0.0 -8.0 0.0 -4.0 0.0625 -4.0 0.0625 -2.0 0.125 -2.0 0.125 -1.0 0.1875 -1.0 0.25 -0.5 0.25 - -0.25 0.3125 - -0.125 0.375 - 0.125 0.625 - 0.25 0.6875 - 0.5 0.75 - 1.0 0.8125 1.0 0.75 2.0 0.875 2.0 0.875 4.0 0.9375 4.0 0.9375 8.0 1.0 8.0 1.0

I15 I14 I13 I12 I11 I10 I9 I8 I7 I6 I5 I4 I3 I2 I1 I0 R7 R6 R5 R4 R3 R2 R1 R0 0 0 0 0 0 a b c d x x x x x x x 0 1 0 0 a b c d 0 0 0 0 1 a b c d x x x x x x x 0 1 0 1 a b c d 0 0 0 1 a b c d x x x x x x x x 0 1 1 0 a b c d 0 0 1 a b c x x x x x x x x x x 0 1 1 1 0 a b c 0 1 a b c x x x x x x x x x x x 0 1 1 1 1 a b c

Claims

CLAIMS 1 A neural processor comprising means for receiving an input signal and means for producing a digital output signal corresponding to a function of the input signal and a weight value associated with the processor, further comprising means for applying a non-linear function to said output signal to produce a digital compressed output signal wherein the function consists of a plurality of inclined linear segments with different slopes, the slope of each said segment being equal to an, where n is an integer and a is the base of the digital output signal.
2 A neural processor according to claim 1, in which the intersections of the said segments occur at digital output signal values of am, where m is an integer and a is the base of the digital output signal.
3 A neural processor according to claim 1 or claim 2, in which the said digital output is a binary number, the base a being 2.
4 A neural processor according to any preceding claim, in which the slope of the function around its centre approximates that of a sigmoid function.
5 A neural processor according to claim 4, in which the said slope of the function around its centre is 2 2.
6 A neural processor according to claim 4 or claim 5, including means for altering the value of the said weight value in dependence upon the compressed output.
7 A neural processor according to any preceding claim, in which the non-linear function means comprises shifting means for logically shifting at least some lower order digits of said output signal, in dependence upon the value of the higher order digits of said output signal, the said digits thus shifted forming digits of the compressed output signal.
8 A neural processor according to claim 7, in which the non-linear function means further comprises a logic gate circuit connected to receive said high order digits and to provide an output for controlling said shifting means.
9 A neural processor according to claim 8, in which the logic gate circuit is for producing also the high order digits of said compressed output signal.
10 A neural processor according to claim 8 or claim 9, in which the logic gate circuit further includes means for generating a slope signal indicating the slope of the function corresponding to the said output signal.
11 A neural network comprising a plurality of neural processors according to any preceding claim connected so that the compressed output signals of some processors may form the input signals to others.
12 A pattern recognition device comprising a neural network according to claim 11.
13 A visual pattern recognition device comprising a network according to claim 11 arranged to receive signals derived from a video signal as inputs.
14 A neural processor substantially as described herein.