CN112540946A

CN112540946A - Reconfigurable processor and method for calculating activation functions of various neural networks on reconfigurable processor

Info

Publication number: CN112540946A
Application number: CN202011511272.XA
Authority: CN
Inventors: 尹首一; 邓大峥; 谷江源; 韩慧明; 刘雷波; 魏少军
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-23

Abstract

The embodiment of the invention provides a reconfigurable processor and a method for calculating a plurality of neural network activation functions on the reconfigurable processor, wherein the method comprises the following steps: splitting a neural network activation function into basic operations; according to the calculation sequence of each basic operation in the neural network activation function, input data are read from a shared memory through a reconfigurable processing array of a reconfigurable processor to sequentially realize each basic operation, processing units on the peripheral edge in the reconfigurable processing array can be used for executing access and storage operations and other operation operations, the processing units are called access and storage processing units, other processing units except the processing units on the peripheral edge in the reconfigurable processing array can be used for executing operation operations, the processing units on the peripheral edge and the processing units on the row or the column are used for executing operation operations, and each processing unit in the reconfigurable processing array and the processing units which are adjacent and exist in the upper, lower, left and right directions of the processing unit are used for data transmission.

Description

Reconfigurable processor and method for calculating activation functions of various neural networks on reconfigurable processor

Technical Field

The invention relates to the technical field of reconfigurable processors, in particular to a reconfigurable processor and a method for calculating various neural network activation functions on the reconfigurable processor.

Background

In recent years, with the development of technologies such as artificial intelligence, cloud computing, big data and the like, the demand of human beings on computing is higher and higher, and the demand of human beings on chip performance is also higher and higher. However, as the chip size is gradually reduced and moore's law gradually approaches physical limits, the power of integrated circuits is difficult to continue to increase, and thus chip design is required to shift from an increase in power performance to an increase in energy efficiency and flexibility. Therefore, chip structure design in a dedicated field, which can be optimally designed for a certain field, becomes the mainstream of chip design today, and chip design with high performance, high energy efficiency ratio and high flexibility becomes an important index today.

Meanwhile, with the continuous development of the neural network, the network structure and the activation function are continuously changed, and for a special ASIC neural network accelerator, after the network structure and the activation function are changed, the acceleration effect is reduced to some extent, and even the special ASIC neural network accelerator is not suitable for a novel network any more.

Disclosure of Invention

The embodiment of the invention provides a method for calculating various neural network activation functions on a reconfigurable processor, which aims to solve the technical problem that an ASIC neural network accelerator in the prior art has low acceleration effect after the network structure and the activation functions are changed. The method comprises the following steps:

splitting a neural network activation function into basic operations;

according to the calculation sequence of each basic operation in the neural network activation function, input data are read from a shared memory through a reconfigurable processing array of a reconfigurable processor to sequentially realize each basic operation, wherein processing units on the peripheral edge in the reconfigurable processing array are used for executing access and storage operations and are called access and storage processing units, other processing units except the processing units on the peripheral edge in the reconfigurable processing array are used for executing operation operations and are called operation processing units, the processing units on the peripheral edge perform data transmission with the processing units on the row or the column for executing operation operations, and each processing unit in the reconfigurable processing array performs data transmission with the adjacent processing units existing in the upper, lower, left and right directions of the processing unit.

The embodiment of the invention also provides a reconfigurable processor for realizing the calculation of the activation functions of the various neural networks, so as to solve the technical problem that the acceleration effect of the ASIC neural network accelerator is low after the network structure and the activation functions are changed in the prior art. The reconfigurable processing array includes:

a shared memory for storing input data;

the reconfigurable processing array is used for reading input data from a shared memory to sequentially realize each basic operation according to the calculation sequence of each basic operation after the neural network activation function is split, wherein processing units on the peripheral edge in the reconfigurable processing array are used for executing access and storage operations and are called access and storage processing units, other processing units except the processing units on the peripheral edge in the reconfigurable processing array are used for executing operation operations and are called operation processing units, the processing units on the peripheral edge perform data transmission with the processing units on the row or the column for executing operation operations, and each processing unit in the reconfigurable processing array performs data transmission with the adjacent processing units existing in the upper, lower, left and right directions of the processing unit.

In the embodiment of the invention, the neural network activation function is divided into basic operations, and then the input data is read from the shared memory by the reconfigurable processing array according to the calculation sequence of each basic operation in the neural network activation function to sequentially realize each basic operation, so that the operation of the neural network activation function is realized on the existing reconfigurable processing array structure without changing the reconfigurable processing array structure or adding a circuit structure on the reconfigurable processing array structure, namely different processing units in the reconfigurable processing array are configured according to the algorithm requirements of different neural network activation functions to carry out corresponding operations, so that the complex activation function operation can be realized on the reconfigurable processing array structure by using basic operations such as addition, subtraction, multiplication, shift and the like, thereby being beneficial to simplifying the circuit design of the activation function operation, the method is favorable for improving the operation speed and the throughput rate of the circuit, and the operation algorithm of the processing units in the reconfigurable processing array can be flexibly configured and adopts the input and output mode of a production line, so that the method is favorable for meeting the operation of different changed activation functions, has expandability and is also favorable for improving the utilization rate of the processing units.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a flow chart of a method for calculating a plurality of neural network activation functions on a reconfigurable processor according to an embodiment of the present invention;

FIG. 2 is a graph illustrating a relu function according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a calculation flow of a relu function according to an embodiment of the present invention;

fig. 4 is a schematic layout diagram of processing units in a reconfigurable processing array when a relu function is operated according to an embodiment of the present invention;

FIG. 5 is a graph illustrating a sigmoid function according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a calculation flow of a sigmoid function according to an embodiment of the present invention;

fig. 7 is a schematic layout diagram of processing units in a reconfigurable processing array when a sigmoid function is computed according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a segmentation function image in the case of computing a sigmoid function according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of accumulated slice function images during sigmoid function operation according to an embodiment of the present invention;

FIG. 10 is a graphical illustration of a tanh function provided by an embodiment of the present invention;

FIG. 11 is a schematic diagram illustrating a calculation flow of a tanh function according to an embodiment of the present invention;

fig. 12 is a schematic layout diagram of processing units in a reconfigurable processing array when computing a tanh function according to an embodiment of the present invention;

FIG. 13 is a schematic diagram of a computing flow of an anti-overflow process according to an embodiment of the present invention;

FIG. 14 is a schematic diagram illustrating an arrangement of processing units in a reconfigurable processing array during anti-overflow processing according to an embodiment of the present invention;

FIG. 15 shows a calculation e according to an embodiment of the present invention^xThe schematic diagram of the calculation flow of (1);

FIG. 16 shows a calculation e according to an embodiment of the present invention^xThe arrangement schematic diagram of the processing units in the time reconfigurable processing array;

FIG. 17 is a block diagram of a method for calculating ln (Σ e) according to an embodiment of the present invention^x) The schematic diagram of the calculation flow of (1);

FIG. 18 is a block diagram of a method for calculating ln (Σ e) according to an embodiment of the present invention^x) The arrangement schematic diagram of the processing units in the time reconfigurable processing array;

fig. 19 is a block diagram of a reconfigurable processor for implementing multiple neural network activation function computations according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

The inventor of the application finds that the coarse-grained reconfigurable processor architecture is receiving more and more attention due to the characteristics of low energy consumption, high performance, high energy efficiency and flexible and dynamic reconfigurability. The flexibility of the reconfigurable computing architecture is between that of a general-purpose processor and that of an ASIC processor, and meanwhile, the efficiency of the reconfigurable computing architecture can approach that of the ASIC processor through optimization, so that the reconfigurable computing architecture has the advantages of the general-purpose processor and the ASIC processor. Its characteristics determine that it is well suited for data intensive operations, which is in full agreement with the computational requirements of neural networks. In the calculation of the neural network, the implementation of the activation function is particularly important as a part for providing nonlinearity, however, unlike a dedicated ASIC processor, a coarse-grained reconfigurable processor does not have a circuit dedicated for processing the activation function, and if the activation function implementation circuit of the neural network is added to a reconfigurable calculation architecture, a certain redundancy is generated, and the complicated circuit design also causes the reduction of performance and the increase of power consumption. Therefore, the inventor of the present application proposes a plurality of methods for calculating neural network activation functions on the reconfigurable processor, so as to realize the operation of the more complicated neural network activation functions on the basis of the existing simpler design of the reconfigurable processing array circuit.

In an embodiment of the present invention, a method for calculating multiple neural network activation functions on a reconfigurable processor is provided, as shown in fig. 1, the method includes:

step 102: splitting a neural network activation function into basic operations;

step 104: according to the calculation sequence of each basic operation in the neural network activation function, input data are read from a shared memory through a reconfigurable processing array of a reconfigurable processor to sequentially realize each basic operation, wherein processing units on the peripheral edge in the reconfigurable processing array are used for executing access and storage operations and are called access and storage processing units, other processing units except the processing units on the peripheral edge in the reconfigurable processing array are used for executing operation operations and are called operation processing units, the processing units on the peripheral edge perform data transmission with the processing units on the row or the column for executing operation operations, and each processing unit in the reconfigurable processing array performs data transmission with the adjacent processing units existing in the upper, lower, left and right directions of the processing unit.

It can be known from the flow shown in fig. 1 that, in the embodiment of the present invention, it is proposed to split the neural network activation function into basic operations, and further sequentially implement each basic operation by reading input data from the shared memory through the reconfigurable processing array according to the calculation sequence of each basic operation in the neural network activation function, so as to implement the operation of the neural network activation function on the existing reconfigurable processing array structure, without changing the reconfigurable processing array structure, and without adding a circuit structure on the reconfigurable processing array structure, that is, different processing units in the reconfigurable processing array are configured to perform corresponding operations according to the algorithm requirements of different neural network activation functions, so that the complex activation function operations can be implemented on the reconfigurable processing array structure by using basic operations such as addition, subtraction, multiplication, and shift, thereby facilitating to simplify the circuit design of the activation function operations, the method is favorable for improving the operation speed and the throughput rate of the circuit, and the operation algorithm of the processing units in the reconfigurable processing array can be flexibly configured and adopts the input and output mode of a production line, so that the method is favorable for meeting the operation of different changed activation functions, has expandability and is also favorable for improving the utilization rate of the processing units.

In specific implementation, the operation of the neural network activation function can be divided into basic operations aiming at different neural network activation functions, and then the reconfigurable processing array reads input data from the shared memory to sequentially realize each basic operation. Specifically, for the same neural network activation function, the operation of the neural network activation function has expandability by adjusting the fineness of the split basic operation and different split schemes on the operation of the neural network activation function, and the requirements for different precisions and different throughput can be met. For example, under the requirement of low precision, the neural network activation function can be roughly split into fewer basic operations, so that the precision is reduced, and the throughput is improved; under the high-precision requirement, the neural network activation function can be finely divided into a plurality of basic operations, so that the precision is improved.

In a specific implementation, the basic operation may include: basic and simple operations such as addition, subtraction, multiplication, multiply-accumulate operation, shift operation, and selection operation. To enable the computation of complex neural network activation functions by performing simple basic operations on a reconfigurable processing array.

In particular implementations, for a linear, segmented neural network activation function, the operations may be performed on a reconfigurable processing array by, for example,

splitting a neural network activation function into basic operations, comprising:

for a linear segmented neural network activation function, splitting the neural network activation function into selection operations;

according to the calculation sequence of each basic operation in the neural network activation function, the basic operations are sequentially realized through a reconfigurable processing array, and the method comprises the following steps:

the method comprises the steps that input data are read from a shared memory through a plurality of access processing units in the reconfigurable processing array, the input data are transmitted to an operation processing unit of a row or a column where the access processing units are located through each access processing unit to carry out selection operation, calculation results of the selection operation are transmitted to the access processing unit of the row or the column where the access processing units are located through the operation processing units, and then calculation results are stored in the shared memory, wherein the access processing unit for reading the input data and the access processing unit for storing the calculation results are different access processing units, and the calculation results output by the different operation processing units are transmitted to the different access processing units.

In specific implementation, the linear segmented neural network activation function is exemplified by a linear rectification function (i.e., relu function), and the relu function is f (z) ═ max (0, x), as shown in fig. 2, the curve has characteristics of monotonous increase and easy derivation.

Specifically, how to map the hardware ASIC circuit algorithm implementation of the relu function to the architecture of the reconfigurable computing needs to be considered when the relu function is implemented on the architecture of the reconfigurable computing. Considering the ASIC circuit implementation principle of the relu function, it is necessary to take input data x out of the shared memory of the reconfigurable processing array, select the input data x by sel operation, determine whether the input data x is positive or negative, and select whether the final output is 0 or x.

In the following, the implementation of the relu function will be described by taking an example of a 4 × 4 reconfigurable processing array PEA (which is one quarter of the entire reconfigurable processing array, and the general reconfigurable processing array is 8 × 8). Firstly, as shown in table 1 below, we take input data from a shared memory by executing Load operation through a processing unit PE (i.e., the above-mentioned access processing unit) on the edge of a reconfigurable processing array, then realize sel operation through the processing unit PE (i.e., the above-mentioned operation processing unit) inside the reconfigurable processing array, select 0 or x for output, and finally, store the calculated result into a shared memory by executing Save operation through the processing unit PE on the edge of the reconfigurable processing array, specifically, the arrangement of different operations executed by each processing unit in the reconfigurable processing array is shown in fig. 4, wherein the access processing unit reading the input data and the access processing unit storing the calculated result are different access processing units to realize pipeline execution, the calculated results output by different operation processing units are transmitted to different access processing units, and then different access processing units can store the calculation results output by different operation processing units into the shared memory, so that data coverage is avoided.

TABLE 1

Operation sign	Means of
		Load	Fetching data in memory
Sel	Selecting, inputting a, b, c, selecting b or c to output according to the value of a
		Save	Storing, storing data in memory

In particular implementations, for neural network activation functions that are symmetric and allow a piecewise taylor expansion fit, the functions can be computed on a reconfigurable processing array by, for example,

for a symmetrical neural network activation function allowing segmented Taylor expansion fitting, splitting the neural network activation function into a first symmetrical part and a second symmetrical part according to symmetry, dividing input data of the first symmetrical part into a plurality of data segments, sequentially splitting operation of each data segment into subtraction, selection operation and multiply-accumulate operation, adding the multiply-accumulate operation results of the data segments, subtracting the output maximum value of the first symmetrical part from the accumulated result, performing selection operation to obtain output data of the first symmetrical part, subtracting the output data of the first symmetrical part from the output maximum value of the first symmetrical part, and performing selection operation to obtain output data of the second symmetrical part;

sequentially subtracting a numerical value in each data segment from the shared memory through one memory access processing unit in the reconfigurable processing array, respectively subtracting the read numerical value from an end value of the divided data segment through a plurality of operation processing units, forming a first-stage selector through the plurality of operation processing units, wherein each operation processing unit in the first-stage selector corresponds to one data segment, and each operation processing unit in the first-stage selector outputs a minimum value in the read numerical value and a maximum value of the corresponding data segment based on a subtraction result; the second-stage selector is composed of a plurality of operation processing units, each operation processing unit in the second-stage selector corresponds to a previous data segment, a first operation processing unit in the second-stage selector outputs the output of the first operation processing unit in the first-stage selector, other operation processing units in the second-stage selector output the maximum value of the output of the corresponding operation processing unit in the first-stage selector and the maximum value of the previous data segment, the operation processing units respectively carry out multiply-accumulate operation on the outputs of the operation processing units in the second-stage selector, the operation processing units carry out addition operation on the results of the multiply-accumulate operation, the operation processing units subtract 1 from the result of the addition operation and carry out selection operation to obtain the output data of the first symmetric part, and the operation processing units subtract 1 from the output data of the first symmetric part and carry out selection operation, the output data of the second symmetric part is obtained.

In specific implementation, the above symmetric neural network activation functions allowing the piecewise taylor expansion fitting are exemplified by an S-shaped growth curve function (i.e., Sigmoid function) and a hyperbolic tangent function (i.e., Tanh function). Sigmoid function of

Is a common sigmoid function in biology. It can map the input variables between (0, 1), as shown in fig. 5, with the property of monotonic increase and easy derivation. In a neural network, if the output unit of the user deals with the binary problem, the sigmoid function can be obtained by using the generalized linear distribution, and the output result is the Bernoulli distribution.

In specific implementation, a lookup table can be hardly realized on a reconfigurable array based on a pipeline. Due to the change of the input data, the address of the access is changed. Based on a general reconfigurable array, the fetch address of a processing unit is generally realized by a base address and an offset address. If the reconfigurable array is used to implement a look-up table, the fetch address will change as the input data changes, causing the pipeline to stall. Therefore, the present embodiment proposes to perform piecewise integral accumulation on the function, and implement the calculation of the function in a pipelined manner. Specifically, the basic operation of splitting the Sigmoid function when the Sigmoid function is operated is shown in table 2 below.

TABLE 2

In specific implementation, firstly, according to the symmetry of sigmoid, we can know that only a part (namely the first symmetric part) with a function larger than 0 needs to be calculated, and finally, the other half of the function (namely the second symmetric part) can be obtained through rotation change. Therefore, we map all input data to the interval of [0, ∞).

Secondly, Taylor expansion is carried out on different parts of the sigmoid function to obtain an approximate function of the sigmoid function. Based on the reconfigurable processing array, the input data part of the sigmoid function at [0, + ∞) is divided into 4 data segments (in specific implementation, the number of data segments can be determined according to different precision requirements, and the higher the number of data segments is, the higher the precision is), which are [0, 4), [4, 8), [8, 15), [15, ∞) respectively.

Firstly, the data range of input data is judged by using a sel operation function. The input of the Sel operation function is a, b and c, and the Sel operation function can select any one of b or c to output according to the value of the input a. Firstly, subtraction is carried out on input data, and the input range is judged.

We construct a two-level selection function through the processing unit. The first-stage selection function is realized by three processing units, the smaller number of the two numbers is selected to be output, and the second-stage selection function is realized by three processing units, the larger number of the two numbers is selected to be output.

As shown in fig. 6 and 7, the range of the input data can be determined by subtracting 4, 8, and 15 (i.e., the end point values of the divided data segments) from the input data and by mapping the subtracted data. We will take 3 segments of input data as an example for analysis, and 1, 6, and 18 as examples for input data.

When the input data is 1, the first level of selectors is passed, wherein the first selector (whose inputs are 1 and 4) will output 1, the second selector (whose inputs are 1 and 8) will output 1, and the third selector (whose inputs are 1 and 15) will output 1. And then passing the output data of the first-stage selector through a second-stage selector, wherein the output of the first selector is 1, the output of the first selector in the first-stage selector is directly output through the first selector in the second-stage selector through routing operation, the output of the second selector (the input of the second selector is 1 and 4) is 4, and the output of the third selector (the input of the third selector is 1 and 8) is 8.

Similarly, when the input data is 6, the first selector will output 4, the second selector will output 6 and the third selector will output 6 when passing through the first stage of selectors. And the output data of the first-stage selector passes through a second-stage selector, the output of the first selector is 4, the output of the second selector (the inputs of the second selector are 6 and 4) is 6, and the output of the third selector (the inputs of the third selector are 6 and 8) is 8.

Similarly, when the input data is 18, the first selector will output 4, the second selector will output 8, and the third selector will output 15 when passing through the first stage of selectors. And the output data of the first-stage selector passes through a second-stage selector, the output of the first selector is 4, the output of the second selector (the inputs of the second selector are 8 and 4) is 8, and the output of the third selector (the inputs of the third selector are 8 and 18) is 18.

In summary, the sel operation function formed by the two-level selector can be expressed as the following formula (1)

sel(x,y,z)＝max(min(x,y),z)y＝4、5、8 z＝4、8 (1)

The output results of the three selectors of the second-stage selector are respectively processed by the processing units of three different paths through MAC operation, namely, Taylor expansion functions generated by expanding at three different points. And accumulating the signals to obtain a final output result. As shown in fig. 8, the function image of the solid line is a half sigmoid function, the function image of the "o" mark is a taylor function developed at [0, 4), "the function image of the" i "mark is a taylor function developed at [4, 8)," the function image of the "x" mark is a taylor function developed at [8, 15), "and the function image of the" x "mark is 1. By piecing them together, i.e. cumulatively, a new function can be obtained, as shown in fig. 9, it can be seen that the sigmoid function image is well fitted by the function of taylor expansion.

In the specific implementation, the segment intervals of the Sigmoid function are exemplified by [0, 4 ], [4, 8 ], [8, 15], [15, ∞) in consideration of the problem of accuracy loss. After (15, ∞), the result will be taken to be 1, with a loss of accuracy of about 10^-7It can be ignored. In [0,15]]The interval is expanded to the third order by adopting a function expanded by a segmented Taylor function to obtain an approximate function, the specific precision loss and the Taylor expansion function are shown in the following table 3, and only the interval [0,30 ] is shown in the table]The negative interval can be obtained by the central symmetry about x ═ 0.

TABLE 3

	Taylor expansion function	Loss of maximum precision
			[0，4)	3.5610^-3x³-5.7110^-2x²+2.9310^-1x+4.92*10^-1	7.53*10^-3
[4，8)	4.9610^-4x³-1.0510^-2x²+7.5110^-2x+8.19*10^-1	5.23*10^-4
			[8，15)	3.2110^-6x³-1.2210^-4x²+1.5410^-3x+9.94*10^-1	3.71*10^-5
[15,∞)	1	3.06*10^-7

In specific implementation, when a PE on the reconfigurable processing array performs a normal operation, if the input of the PE is a and b, the output is the function f (a and b) executed by the PE, and a certain value of a and b can be selected as the output, specifically which value is output, depending on the position of the input a and b in the compiled instruction of the configuration PE, so that the specific operation and output performed by each PE in the reconfigurable processing array can be realized through configuration.

Specifically, the sigmoid function is calculated by adopting an implementation mode based on piecewise integral accumulation, and finally, the running water calculation of the function is realized based on the symmetry of the function, and can be realized by utilizing 3 global PEs and 28 processing unit PEs.

In particular, the Tanh function is

As shown in fig. 10, has characteristics of monotonous increase and easy derivation similar to the sigmoid function, while havingThe input variables can be mapped between (-1, 1).

Specifically, the operations of the Tanh function may be similar to the operations of the sigmoid function, except that the segmentation intervals are different, and [0,1 ], [1, 2 ], [2, 4 ], [4, ∞) are taken as examples. A calculation flowchart for calculating tanh is shown in fig. 11, a schematic layout of processing units in the reconfigurable processing array when the tanh function is calculated is shown in fig. 12, a specific precision loss and taylor expansion function are shown in table 4 below, where only the interval [0,15] is shown, and a negative interval can be obtained by central symmetry about x ═ 0.

TABLE 4

	Taylor expansion function	Loss of maximum precision
			[0，1)	5.7010^-2x³-4.5710^-1x²+1.17100x-1.50*10^-2	1.50*10^-2
[1，2)	8.6910^-2x³-0.55910^-1x²+1.27100x-3.83*10^-2	3.27*10^-4
			[2，4)	7.9310^-3x³-8.4210^-2x²+3.0110^-1x+6.37*10^-1	1.04*10^-3
[4,∞)	1	6.71*10^-4

In particular implementations, for neural network activation functions that include division, the operations on the reconfigurable processing array are implemented by, for example,

for a neural network activation function comprising division, subtracting the maximum value of input data from the input data of the neural network activation function to avoid overflow, converting the division in the neural network activation function into subtraction, and dividing parameters participating in operation into different operation items according to the subtraction in the neural network activation function;

and sequentially realizing the operation of each operation item through the reconfigurable processing array.

In specific implementation, the neural network activation function including the division is, for example, Softmax, and the expression of Softmax is

By means of anti-overflow processing (i.e. replacing the input data x by x-x)_max) The softmax function can be converted into

I.e. input the number x-x_maxTo thereby avoid e^xThe result of the function is too large resulting in overflow. Because the division is realized in a circuit more complexly, the invention adopts subtraction to replace the division, reduces the generated power consumption and the consumed resources, thereby improving the speed and the efficiency of the operation. By using logarithmic change, the softmax function can be converted into

Therefore, the operation on the softmax function is mainly divided into four parts, wherein the first part is an anti-overflow part, namely solving x-x_max(i.e., the above-described operation terms). The second part is the calculation of e^x(i.e., the above-described operation terms). The third part is e to be obtained^xAdd up and find ln (Σ e)^x) (i.e., the above-described operation terms). The fourth part is solving

(i.e., the above-described operation terms).

In specific implementation, in order to implement the anti-overflow processing, the maximum value of the input data is subtracted from the input data, and specifically, the maximum value of the input data is found out by, for example, dividing the input data into a plurality of data groups, for each data group, reading the input data through an access processing unit, receiving the input data through an operation processing unit and performing selective operation on the input data, outputting the maximum value of the data group, performing parallel processing on the plurality of data groups to obtain the maximum value of each data group, reading the maximum value of each data group through an access processing unit, receiving the maximum value of each data group through an operation processing unit and performing selective operation on the received data, and outputting the maximum value of the maximum values of each data group to obtain the maximum value of the input data.

Specifically, taking the operation item of the first step of calculating softmax as an example, the input data may be divided into 16 data groups, the operation for determining the maximum value in the input data may include the operation shown in table 5 below, the 16 data groups may be compared in parallel by the comparison operation of the processing array of the RPU, as shown in fig. 13 and 14, the access processing unit may perform load operation to read the input data of each data group from the shared memory, perform subtraction and selection operation by the operation processing unit to select the maximum value in each of the 16 data groups, and perform sel operation by the access processing unit to store the maximum value in each data group into the shared memory. Finally, the maximum values of the 16 data sets are compared with each other, thereby obtaining the maximum value of the input data. By utilizing the characteristic that the RPU can process data in parallel, the data processing is accelerated, and the efficiency is improved.

TABLE 5

Operation sign	Means of
		Load	Fetching data in memory
Sel	Selecting, inputting a, b, c, selecting b or c to output according to the value of a
		-	Subtraction, input a, b, output a-b
Save	Storing, storing data in a memory

In specific implementation, for an exponential function with e as the base in an operation item, this embodiment proposes that input data is read through a memory access processing unit, and then the input data is processed through an operation processing unitPerforming subtraction operation according to the maximum value of the input data, and performing subtraction operation on the subtraction result and the input data through an operation processing unit

Performing multiplication, converting an exponential function into an exponential function with the base 2, wherein the result of the multiplication is input data of the exponential function after bottom changing, the input data of the exponential function after bottom changing comprises an integer part and a decimal part, performing Taylor expansion on the exponential function with the base 2 and the decimal part as the exponent to obtain a polynomial, performing corresponding operation on the polynomial through an operation processing unit to obtain the output of the exponential function with the base 2 and the decimal part as the exponent, performing shift operation on the output and the integer part through the operation processing unit to obtain the output of the exponential function, and performing accumulation operation on the output of the exponential function through the operation processing unit.

Specifically, taking the operation term of the second step of calculating softmax as an example, e is calculated^x. It is noted that here the subtraction of x from the input data is taken_maxThereby preventing overflow. First, the bottom-changing formula, e^xBecome into

U in the formula_iFor the integer part, v, of the input data after variation using the bottom-changing formula_iThen the fractional part, y_i＝x-x_max. Based on the characteristics of binary number, we can change the above formula again

At this time, the range of the input data is reduced to [ -1,0 [ ]]So that Taylor's expansion can be used to solve

We are right to

Performing Taylor expansion of

Finally, the obtained result is shifted to obtain

The result of (1).

In particular, calculating

As shown in fig. 15 and 16, the access processing unit is used to execute a load to execute a fetch operation, the input data is fetched from the memory, and x obtained by subtracting the overflow prevention processing of the previous stage from the subtraction is used to obtain_maxAnd finishing the anti-overflow and updating the data. Then, the data after the anti-overflow processing is multiplied with

Multiplication to obtain u_i+v_iBy AND operation, u can be obtained_iAnd v_iWill u_iStoring, and performing multiply-accumulate by the arithmetic processing unit_iPolynomial calculation is performed, and the specific calculation is formula (7). Finally, u is fetched from the memory by using the access operation_iAnd shifting the result of the polynomial calculation to obtain the final output result, and storing it in a memory.

All e^xThrough addition and accumulation, the sigma e is obtained^xAnd stores it in memory for the next part of the calculation.

TABLE 6

Operation sign	Means of
		Load	Fetching data in memory
Sel	Selecting, inputting a, b, c, selecting b or c to output according to the value of a
		And	And operation, input a, b, output a&b
>>	Shift operation, input a, output shifted a
		+	Addition, input a, b, output a + b
-	Subtraction, input a, b, output a-b
		*	Multiplication, input a, b, output a b
MAC	Multiply-accumulate, input a, b, c, perform ab + c
		Save	Storing, storing data in a memory

In specific implementation, for a logarithmic function with e as a base in an operation term, this embodiment proposes that an input term of the logarithmic function is accumulation of an exponential function with e as a base, the accumulation of the exponential function is converted into a product of an exponential function with w as an index with 2 as a base and k, a leading 0 operation is performed by an operation processing unit to obtain a value of w, the accumulation of the exponential function with e as a base is shifted to obtain a value of k, the logarithmic function is subjected to taylor expansion based on the value of w and the value of k to obtain a polynomial, and the polynomial is operated by the operation processing unit to obtain an output of the logarithmic function.

Specifically, taking the operation term of the third step of calculating softmax as an example, the obtained e^xAdd up and find ln (Σ e)^x). The accumulation part can be synchronously realized in the operation item process of the second step of the operation of softmax, and the result is accumulated in the global register every time one result is calculated. While calculating ln (∑ e)^x) The central idea of (1) is taylor function expansion. For ln (∑ e)^x) By the following variations, can be obtained

ln(∑e^x)＝ln(2^w*k) (8)

According to e^xAs can be appreciated, Σ e^xMust be positive, so in a binary number, the number is stored in the original code. By shifting, we can get the value of k, and reduce the calculated data to [0,1]The interval enables calculation of taylor expansion. And the value of w is calculated by the leading 0. After obtaining the value of w, Σ e^xBy performing a shift operation, the value of k can be obtained. Then, the final calculation expression, equation (9), can be obtained by changing equation (8) and performing taylor expansion. Specifically, ln (∑ e) is calculated^x) The process of (a) calculates ln (Σ e) as shown in table 7 below^x) The calculation flow diagram of (1) is shown in FIG. 17, and ln (∑ e) is calculated^x) Fig. 18 is a schematic diagram showing the arrangement of processing units in the reconfigurable processing array.

TABLE 7

Operation sign	Means of
		Load	Fetching data in memory
Clz	Calculating leading 0, calculating the number of leading 0 in input data
		+	Addition, input a, b, output a + b
-	Subtraction, input a, b, output a-b
		*	Multiplication, input a, b, output a b
MAC	Multiply-accumulate, input a, b, c, perform ab + c
		Save	Storing, storing data in a memory

In specific implementation, in the operation item process of the fourth step of calculating softmax, solving is carried out

Since x has already been solved in the first step_maxThe third step has already solved

The number to be subtracted is updated to

Bringing into the second step e^xAnd (4) calculating a function, wherein the calculation flow chart is completely the same as the calculation flow chart in the second step.

In the specific implementation process, in the process of sequentially realizing each basic operation through the reconfigurable processing array, when each operation processing unit needs to perform data transmission with the processing unit which is not in the row or the column, the processing unit which is in data transmission interconnection with the operation processing unit executes routing operation, so that the operation processing unit performs data transmission with the processing unit which is not in the row or the column; or, the data of the operation processing unit is output to a global register for storage, and the data is read by the processing unit of the operation processing unit which is not in the row or the column.

In specific implementation, simulation tests can be performed on the multiple neural network activation function calculation methods on the reconfigurable processor by adopting a python language, input data are random numbers between (-101, 101), the number of the input data is random numbers between (1, 100), and the turn is 100 times. According to the final simulation result, the maximum error is about 0.01, the maximum error is the precision of a 2-bit decimal number of 6-7 bits, the precision can be improved by improving the order of Taylor expansion, and the precision of the Taylor expansion is not improved in order to reduce power consumption and improve operation precision.

The calculation method of the neural network activation functions on the reconfigurable processor mainly realizes the calculation of the neural network activation functions on the reconfigurable architecture in a Taylor expansion mode. In the calculation of the softmax function, subtraction is adopted to replace division, and a bottom changing formula is combined with displacement to replace e^xIn a manner that reduces the number of coefficients and operations that need to be storedTime and thus further reduce the overhead of hardware resources of the device, thereby reducing area and power consumption.

In addition, the calculation method of the neural network activation functions on the reconfigurable processor has certain flexibility, and the expansion order can be determined for the application in a customized manner, so that the requirements of various precision data are met, and the power consumption, the calculation efficiency and the precision are well balanced.

Based on the same inventive concept, the embodiment of the present invention further provides a reconfigurable processor for implementing multiple neural network activation function computations, as described in the following embodiments. The reconfigurable processor for realizing the calculation of the multiple neural network activation functions has the advantages that the problem solving principle is similar to the problem solving principle of the calculation method of the multiple neural network activation functions on the reconfigurable processor, so the implementation of the reconfigurable processor for realizing the calculation of the multiple neural network activation functions can refer to the implementation of the calculation method of the multiple neural network activation functions on the reconfigurable processor, and repeated parts are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 19 is a block diagram of a reconfigurable processor for implementing multiple neural network activation function computations according to an embodiment of the present invention, as shown in fig. 19, including:

a shared memory 1902 for storing input data;

the reconfigurable processing array 1904 is configured to read input data from the shared memory to sequentially implement each basic operation according to a calculation sequence of each basic operation after splitting of a neural network activation function, where processing units on the peripheral edge in the reconfigurable processing array are used to perform access and storage operations and are referred to as access and storage processing units, other processing units in the reconfigurable processing array except the processing units on the peripheral edge are used to perform operation operations and are referred to as operation processing units, the processing units on the peripheral edge perform data transmission with the processing units on the row or the column where the processing units are located and used to perform operation operations, and each processing unit in the reconfigurable processing array performs data transmission with the processing units that are located and adjacent to each other in the vertical and horizontal directions of the processing unit.

In another embodiment, a software is provided, which is used to execute the technical solutions described in the above embodiments and preferred embodiments.

In another embodiment, a storage medium is provided, in which the software is stored, and the storage medium includes but is not limited to: optical disks, floppy disks, hard disks, erasable memory, etc.

The embodiment of the invention realizes the following technical effects: the method has the advantages that the neural network activation function is divided into basic operations, and then input data are read from a shared memory through a reconfigurable processing array according to the calculation sequence of the basic operations in the neural network activation function to sequentially realize the basic operations, so that the operation of the neural network activation function is realized on the existing reconfigurable processing array structure without changing the reconfigurable processing array structure or adding a circuit structure on the reconfigurable processing array structure, namely different processing units in the reconfigurable processing array are configured according to the algorithm requirements of different neural network activation functions to carry out corresponding operations, so that the complex activation function operation can be realized on the reconfigurable processing array structure by using the basic operations of addition, subtraction, multiplication, shift and the like, thereby being beneficial to simplifying the circuit design of the activation function operation and improving the operation speed and the throughput rate of the circuit, the operation algorithm of the processing units in the reconfigurable processing array can be flexibly configured and adopts a pipeline input and output mode, so that the operation of different changed activation functions is favorably met, the expandability is realized, and the utilization rate of the processing units is favorably improved.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for calculating a plurality of neural network activation functions on a reconfigurable processor is characterized by comprising the following steps:

splitting a neural network activation function into basic operations;

2. The method for multiple neural network activation function computation on a reconfigurable processor according to claim 1, wherein the basic operation includes: addition, subtraction, multiplication, multiply-accumulate operations, and selection operations.

3. The method for multiple neural network activation function computation on a reconfigurable processor according to claim 1,

4. The method for multiple neural network activation function computation on a reconfigurable processor according to claim 1,

for a symmetrical neural network activation function which can be fitted through segmented Taylor expansion, splitting the neural network activation function into a first symmetrical part and a second symmetrical part according to symmetry, dividing input data of the first symmetrical part into a plurality of data segments, sequentially splitting operation of each data segment into subtraction, selection operation and multiplication-accumulation operation, adding the multiplication-accumulation operation results of the data segments, comparing the accumulation result with the maximum output value of the first symmetrical part, performing selection operation to obtain output data of the first symmetrical part, subtracting the output data of the first symmetrical part from the maximum output value of the first symmetrical part, and performing selection operation to obtain output data of the second symmetrical part;

sequentially subtracting a numerical value in each data segment from the shared memory through one memory access processing unit in the reconfigurable processing array, respectively subtracting the read numerical value from an end value of the divided data segment through a plurality of operation processing units, forming a first-stage selector through the plurality of operation processing units, wherein each operation processing unit in the first-stage selector corresponds to one data segment, and each operation processing unit in the first-stage selector outputs a minimum value in the read numerical value and a maximum value of the corresponding data segment based on a subtraction result; a second selector is composed of a plurality of operation processing units, each operation processing unit in the second selector corresponds to a previous data segment, a first operation processing unit in the second selector outputs the output of the first operation processing unit in the first selector, other operation processing units in the second selector output the maximum value of the output of the corresponding operation processing unit and the maximum value of the previous data segment in the first selector, the operation processing units respectively multiply and accumulate the outputs of the operation processing units in the second selector, the operation processing units add the results of the multiply and accumulate operations, the operation processing units subtract the maximum value of the output of the first symmetric part from the result of the addition operation and select the result to obtain the output data of the first symmetric part, the operation processing units subtract the output data of the first symmetric part from the maximum value of the output of the first symmetric part and select the result, the output data of the second symmetric part is obtained.

5. The method for calculating the multiple neural network activation functions on the reconfigurable processor according to any one of claims 1 to 4,

for a neural network activation function comprising exponential accumulation and exponential division, subtracting the maximum value of input data from the input data of the neural network activation function to prevent overflow, converting the division in the neural network activation function into subtraction, and dividing parameters participating in operation into different operation items according to the subtraction in the neural network activation function;

6. The method for calculating the multiple neural network activation functions on the reconfigurable processor according to claim 5, wherein the operation of each operation item is sequentially realized through the reconfigurable processing array, and the method comprises the following steps:

dividing input data into a plurality of data groups, for each data group, reading the input data through an access processing unit, receiving the input data through an operation processing unit and carrying out selective operation on the input data, outputting the maximum value of the data group, carrying out parallel processing on the data groups to obtain the maximum value of each data group, reading the maximum value of each data group through an access processing unit, receiving the maximum value of each data group through an operation processing unit and carrying out selective operation on the received data, and outputting the maximum value of the maximum values of each data group to obtain the maximum value of the input data.

7. The method for calculating the multiple neural network activation functions on the reconfigurable processor according to claim 5, wherein the operation of each operation item is sequentially realized through the reconfigurable processing array, and the method comprises the following steps:

aiming at an exponential function with e as the base in an operation item, reading input data through a memory access processing unit, subtracting the maximum numerical value of the input data from the input data through an operation processing unit, and comparing the subtraction result with the maximum numerical value of the input data through the operation processing unit

Performing multiplication operation, converting an exponential function into an exponential function with the base 2, wherein the result of the multiplication operation is input data of the exponential function after the base is changed, the input data of the exponential function after the base is changed comprises an integer part and a decimal part, performing Taylor expansion on the exponential function with the base 2 and the decimal part as the exponent to obtain a polynomial, performing corresponding operation on the polynomial through an operation processing unit to obtain the output of the exponential function with the base 2 and the decimal part as the exponent, and performing operation on the polynomial through an operation positionThe processing unit performs shift operation on the output and the integer part to obtain the output of the exponential function, and the operation processing unit performs accumulation operation on the output of the exponential function.

8. The method for calculating the multiple neural network activation functions on the reconfigurable processor according to claim 7, wherein the operation of each operation item is sequentially realized through the reconfigurable processing array, and the method comprises the following steps:

for a logarithmic function with e as a base in an operation term, the input term of the logarithmic function is the accumulation of an exponential function with e as a base, the accumulation of the exponential function is converted into the product of an exponential function with w as an index with 2 as a base and k, leading 0 operation is carried out through an operation processing unit to obtain the value of w, the accumulation of the exponential function with e as a base is shifted to obtain the value of k, Taylor expansion is carried out on the logarithmic function based on the value of w and the value of k to obtain a polynomial, and the polynomial is operated through the operation processing unit to obtain the output of the logarithmic function.

9. The method for calculating the multiple neural network activation functions on the reconfigurable processor according to claim 5, wherein the step of sequentially implementing the basic operations through the reconfigurable processing array comprises:

in the process of sequentially realizing each basic operation through the reconfigurable processing array, when each operation processing unit needs to perform data transmission with a processing unit which is not in the row or not in the column, the processing unit which is interconnected with the operation processing unit in a data transmission manner executes routing operation, so that the operation processing unit performs data transmission with the processing unit which is not in the row or not in the column; or, the data of the operation processing unit is output to a global register for storage, and the data is read by the processing unit of the operation processing unit which is not in the row or the column.

10. A reconfigurable processor for implementing a plurality of neural network activation function computations, comprising:

a shared memory for storing input data;