CN113435570A

CN113435570A - Programmable convolutional neural network processor, method, device, medium, and terminal

Info

Publication number: CN113435570A
Application number: CN202110496788.XA
Authority: CN
Inventors: 张犁; 刘夏; 杨伯杨; 胡海虹; 闫战伟; 陈治宇
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-05-07
Filing date: 2021-05-07
Publication date: 2021-09-24
Anticipated expiration: 2041-05-07

Abstract

The invention belongs to the technical field of signal processing, and discloses a programmable convolutional neural network processor, a method, equipment, a medium and a terminal, wherein the programmable convolutional neural network processor comprises a Move instruction for indicating an execution unit address and a data address, and the programmable convolutional neural network processor is used for realizing acceleration of convolutional neural network reasoning operation in an embedded environment and finishing identification and classification of target data; the programmable convolutional neural network processor comprises a data storage control module, a data cache module, a processing unit array, a program storage module, a data transmission switching network and an instruction decoding-state machine control module. The control instruction of the processor uses the transmission Move instruction, each instruction comprises the address of the target execution unit transmitted by the instruction and the data address required by the operation of the execution unit, the instruction structure is simple, the structure of the processor is easy to parametrize, the compiler is simple in design and high in compiling efficiency, and the universality and the usability of the processor structure are improved.

Description

Programmable convolutional neural network processor, method, device, medium, and terminal

Technical Field

The invention belongs to the technical field of signal processing, and particularly relates to a programmable convolutional neural network processor, a method, equipment, a medium and a terminal.

Background

At present, aiming at the task of identifying and classifying target data in an embedded scene environment, firstly, strict limitations are imposed on the function, reliability, cost, volume, speed, power consumption and the like of a system in the embedded scene. Secondly, in the field of signal identification and image classification, the convolutional neural network is one of the most widely applied algorithms with the highest precision. A typical convolutional neural network mainly includes multiply-accumulate operations. The large number of multiply-accumulate operations requires a large amount of operation time. The power consumption of a Central Processing Unit (CPU) and a high-speed Graphics Processing Unit (GPU) of a traditional Personal Computer (PC) are in the order of hundreds of watts, so that the power consumption requirements of signal identification or target image classification tasks in an embedded scene are difficult to meet. Therefore, there is a need to develop a programmable convolutional neural network processor that is more real-time efficient, low-power, and more advantageous in terms of ease of use and price for signal processing tasks in embedded scenarios.

The university of sienna electronics technology discloses a programmable neural network processor in the patent document "programmable neural network processor" (application No. 201710805918.7, publication No. 107590535) of its application. The processor comprises a storage control module, a data cache module, a processing unit array, an instruction storage module, a data transmission network and a global control module. The processor selects a proper number of layers and types of neural networks according to requirements, generates instructions and configuration parameters according to the neural network structure programming, loads the instructions and the configuration parameters into the processor, sets a configuration register in the global control module according to the configuration parameters, reads target data to be processed, weights and offsets according to the instructions and the configuration register, calculates layer by layer, and finally outputs a target detection or identification result. The processor has the following defects: firstly, the convolution structure of the processor only supports a basic convolution neural network structure, and the achievable network structure is single; secondly, in the processor, in the operation unit B in the processing unit not multiplexed in the pooling part, a comparator in the control unit C is additionally designed for pooling operation, so that the resource consumption is high.

The patent document "programmable deep neural network processor" (application No. 201810281984.3, publication No. 108520297) filed by the general discloses a programmable deep neural network processor. The processor comprises a program control unit, a filter buffer area and a characteristic diagram buffer area. The processor realizes a programmable deep neural network with low power consumption and low cost through multiplexing control of a multiply-accumulate unit, characteristic diagram data reading control, characteristic diagram accumulation control and redundant data elimination control. The processor has the following defects: firstly, the multiply-accumulate unit of the processor is only a multiply-add array of 5x5, and the parallelism of the operation is low; secondly, the feature map data which can be processed by the processor at one time is limited, and the data transmission efficiency is low, so that the control pooling operation step is complicated, and the system reconfigurability is low.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) a large amount of multiply-accumulate operation needs to consume a large amount of operation time, and the power consumption of a Central Processing Unit (CPU) and a high-speed Graphics Processing Unit (GPU) of a traditional Personal Computer (PC) are in the order of hundreds of watts, so that the power consumption requirements of signal identification or target image classification tasks in an embedded scene are difficult to meet.

(2) The convolution structure of the existing processor only supports a basic convolution neural network structure, and the achievable network structure is single; in the prior operation unit B in the non-multiplexing processing unit of the pooling part, a comparator in the other control unit C is designed for pooling operation, and the resource consumption is large.

(3) The multiply-accumulate unit of the existing processor is only a multiply-accumulate array of 5x5, and the parallelism degree of the operation is low; the existing processor has limited feature map data which can be processed at one time and low data transmission efficiency, so that the control pooling operation step is complicated, and the system has low reconfigurability.

The difficulty in solving the above problems and defects is:

(1) the convolutional neural network has a large number of operations, requires software to have higher flexibility, and has extremely high requirements on the design of a processor architecture to realize all functions in embedded application.

(2) The convolutional neural network has a complex structure and a plurality of variable structure parameters, and is difficult to be compatible with various convolutional neural network structures for hardware circuits. The convolution neural network has more operation forms, the existing design is usually realized respectively aiming at each single function and then combined, so that the resource of a processor is greatly wasted, and only one part of circuits are in a working state in each operation step. How to multiplex the core operation components to complete all operations of the convolutional neural network is difficult.

(3) The larger the operation array, the higher the parallelism of the processor, and the more difficult it is to control. The convolutional neural network contains a large number of similar operations, and the parallelism of the convolutional neural network is improved to the maximum extent if the operation speed of the processor is increased.

The significance of solving the problems and the defects is as follows:

(1) the embedded field is more valuable in practical application, and can meet the application requirements in application scenes more easily in the aspects of reliability, cost, volume, speed, power consumption and the like.

(2) All operations of the convolutional neural network are completed through the multiplexing core operation part, and the consumption of circuit resources can be saved.

(3) The user of the N-M operation array designed in the design can flexibly set according to specific application index requirements so as to meet different functions and index requirements and adapt to different application scenes. Meanwhile, if the power consumption and the circuit scale are not strictly required, the N x M array can be enlarged as much as possible, so that the calculation speed of the processor is increased.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a programmable convolutional neural network processor, a method, equipment, a medium and a terminal.

The invention is realized in this way, a programmable convolution neural network processor, the programmable convolution neural network processor comprises Move instructions for indicating the address of an execution unit and the address of data, realizes the acceleration of the inference operation of the convolution neural network under the embedded environment, and completes the identification and classification of target data;

the programmable convolutional neural network processor comprises a data storage control module, a data cache module, a processing unit array, a program storage module, a data transmission switching network and an instruction decoding-state machine control module;

the data storage control module is connected with the data caching module through a data transmission switching network; the program storage module is connected with the instruction decoding-state machine control module through a data transmission switching network; the processing unit array consists of N processing unit clusters, each processing unit cluster consists of M processing units, each processing unit cluster comprises a local data cache unit, and the processing unit array can calculate data of N characteristic graph convolution windows and data of M convolution kernels in parallel at a time; the data cache module is connected with the processing unit array through a data transmission switching network; the instruction decoding-state machine control module is connected with all other modules through a control bus.

Further, the data storage control module is used for controlling the memory to perform data read-write operation. Sending the data to be processed or the network weight parameter data read from the memory to a data caching module; writing back the operation result generated in the data caching module to a memory;

the data cache module is used for temporarily storing the data to be processed, the network weight parameter data or the feature map data connected by the residual block sent by the storage control module and forwarding the data to the processing unit array; the memory control module is used for temporarily storing the operation result generated by the processing unit array and writing back the operation result to the memory control module;

the processing unit array is used for receiving data to be processed or network weight parameter data sent by the data cache module, receiving a control instruction of the instruction decoding-state machine control module to complete corresponding operation, and sending an operation result to the data cache module;

the program storage module comprises 1 block of program memory and an addressing circuit thereof, and is used for storing all instructions of a program for guiding the processor to finish operation;

the data transmission switching network is used for connecting the data storage control module, the data cache module, the processing unit array and the program storage module and providing a broadband data channel for the data storage control module, the data cache module, the processing unit array and the program storage module; the device is used for efficiently transmitting data to be processed, network weight parameter data, an operation result and an instruction code;

the instruction decoding-state machine control module is used for completing a Move instruction of data according to the address of the execution mechanism and the data address indicated in the instruction, controlling the data storage control module, the data cache module, the processing unit array, the program storage module and the data transmission switching network, performing all neural network related operations on the read-in data to be processed or network weight parameter data, and outputting an operation result.

Further, the data cache module comprises a 4-block simple dual-port Block Random Access Memory (BRAM); the BRAM1 is used for temporarily storing the data to be processed sent by the storage control module and forwarding the data to be processed to the processing unit array; the BRAM2 is used for temporarily storing the weight parameter data sent by the storage control module and forwarding the weight parameter data to the processing unit array; the BRAM3 is used for temporarily storing the characteristic diagram data of the residual block connection sent by the storage control module and forwarding the characteristic diagram data to the processing unit array; the BRAM4 is used for temporarily storing the operation result generated by the processing unit array and writing back the operation result to the storage control module;

wherein the content of the first and second substances,

BRAM1 memory data bit width = (maximum signature width/n) × data bit width bit;

BRAM1 memory depth = (maximum convolution kernel size + maximum step size) × maximum number of channels of the signature = n;

BRAM2 memory data bit width = array depth data bit width bit;

BRAM2 memory depth = maximum convolution kernel size maximum number of convolution kernel channels + 1;

BRAM3 memory data bit width = (maximum residual block signature width/n) × data bit width bit;

BRAM3 memory depth = array depth n;

BRAM4 memory data bit width = (maximum signature width/n) × data bit width bit;

BRAM4 memory depth = array depth n.

n is an integer whose size is selected according to memory resource conditions, data transfer switching network bandwidth, and data transfer speed and efficiency requirements.

Furthermore, the processing unit array comprises N processing unit clusters, wherein one processing unit cluster comprises M processing units, and each processing unit can independently complete all operations contained in the convolutional neural network; the processing unit array is used for receiving data to be processed, network weight parameter data or feature map data connected with the residual block, which are sent by the data cache module, completing the operation of the operation target type indicated by the instruction, and then sending the operation result to the data cache module according to the instruction; the operation purpose type supported by the processing unit array refers to that any one of the following operation purpose types can be selected, wherein the operation purpose types comprise multiply-accumulate operation of convolution layers, addition operation among residual blocks, multiply-accumulate operation of full connection layers, pooling operation and nonlinear function operation; the arithmetic number supported by the processing unit array is made into a floating point number system or a fixed point number system; the nonlinear function operation means that the slope and the intercept are read by a piecewise linear table and then sent to a core operation component to complete the nonlinear function operation; the processing unit is divided into N clusters which are respectively numbered as: the cluster 0, the cluster 1 and the cluster … are N-1, each cluster comprises M processing units which are respectively numbered as: the 0 th, the 1 st and the … M-1 th correspond to the 1 st, the 2 nd and the 3 rd correspond to each other, and so on, in order to complete the pooling operation, a plurality of special pooling channels for pooling operation are designed in the processing unit array, and M/2 pooling special channels are arranged between the even processing unit and the corresponding odd processing unit in each cluster; when the convolution operation is completed, the even number operation unit and the odd number operation unit both generate a convolution operation result, and the odd number operation unit sends the operation result to the even number operation unit by utilizing the special pooling channel to perform the pooling operation.

Further, the processing unit comprises a piecewise linear table, a core operation component and an operation result register Acc _ Reg; the core operation part comprises an input data selection module MUX, a multiplication operation module Multi and an addition and subtraction operation module ADD/SUB; the required operation operand is selected by inputting the data selection signals Sel _ A and Sel _ B, and the operation type of the addition and subtraction module is selected by the operation destination address selection signal Sel _ C.

The core operation part can complete operations including multiply-accumulate operation of convolution layers, addition operation among residual blocks, multiply-accumulate operation of full connection layers, pooling operation and nonlinear function operation, and the generated result is stored in an operation result register Acc _ Reg; adopting a piecewise linear approximate table look-up method to carry out nonlinear function operation, setting an internal nonvolatile Flash memory with the depth of D and the data bit width of W, writing table look-up data into the nonvolatile Flash memory according to a nonlinear function to be realized, wherein the high W/2 bit of each data in a table look-up file represents the slope k of a linear approximate subsection, the low W/2 bit is the intercept b of the subsection, then taking the high D bit of input data x as the address of the nonvolatile Flash memory to carry out table look-up, the nonvolatile Flash memory outputs the slope k and the intercept b of the linear subsection where x is positioned at the moment, then using a core operation part to complete the operation of f (x) = kx + b, and finally obtaining a result which is the output value of the nonlinear function and outputting the result to an operation result register; wherein, the size of D represents the number of linear approximate segments, and the more the number is, the more accurate the function value of the calculation output is, and the setting can be carried out according to the application requirement.

The input data selection signal Sel _ a is used for selecting input operands of the MUX, including output data and input data In _ C of the multiplication module Multi;

the input data selection signal Sel _ B is used for simultaneously selecting input operands of three MUXs, including input data In _ a, input data In _ B and an operation result register Acc _ Reg, or the operation result register Acc _ Reg, the slope of a piecewise linear table and the intercept of the piecewise linear table;

the operation destination address selection signal Sel _ C is used to select the operation type of the ADD-subtract device ADD/SUB, including addition and subtraction.

Furthermore, the programmable convolutional neural network processor and the related instruction set thereof can not only develop the parallelism of data in network reasoning operation, but also realize the parallelism of instructions. The parallelism of the data comprises parallel calculation of M convolution kernels and parallel calculation of N characteristic graph convolution windows; the parallelism of the instructions comprises that reading of the data memory to be processed and reading of the weight parameter data memory are controlled in parallel according to the corresponding position address in each instruction, and data paths of the data memory are indicated. The instructions adopt a Move instruction format, and each Move instruction comprises an execution unit address op _ addr, an auxiliary marker bit marker _ bit, a to-be-processed data storage address fimg _ addr, a weight parameter data storage address paramt _ addr and an auxiliary extension bit extension _ bit.

Another object of the present invention is to provide a method for controlling a programmable convolutional neural network processor using the programmable convolutional neural network processor, including the steps of:

analyzing application requirements, acquiring data samples by using various data acquisition equipment, and labeling labels to make a data set;

and (3) acquiring corresponding data samples according to the application scene required by the user or the target to be detected, labeling each data sample, and arranging into a data set for training and testing the neural network model in the step two.

Designing a neural network model, training the model by using a data set on an upper computer software platform, testing the designed convolutional neural network model, and repeatedly modifying the model until the model reaches design application indexes;

and (3) training the model by using the data set manufactured in the step one on an upper computer software platform by referring to the existing neural network model, testing the designed convolutional neural network model, and repeatedly modifying the parameters of the network model until the testing accuracy reaches the design application index.

Performing hardware engineering optimization on the neural network model and evaluating the optimized performance; if the performance degradation meets the expected index, accepting the modification, otherwise, re-executing the second step and the third step;

fine-tuning the neural network model by referring to the hardware structure parameters of the processor, testing the accuracy of the network model again, if the performance is reduced a little, namely the performance meets the expected project index, accepting the modification, and continuing the step four; otherwise, re-executing the step two and the step three.

Step four, programming the neural network model generated in the step three by using a special Move instruction set, and generating configuration information and a program which are executable by corresponding hardware and comprise a piecewise linear table used for calculating a nonlinear function according to an instruction table;

programming the finally determined neural network model according to a special Move instruction set structure, and compiling a corresponding Move instruction compiler and a to-be-tested data initialization program, wherein the program can automatically generate binary machine code instructions of all hardware executable programs required by the neural network according to an instruction list.

And step five, completely downloading the generated configuration information and the instruction machine code into a hardware circuit through an upper computer, testing the performance of the actual circuit, and evaluating the operation index of the actual system.

And downloading the binary instruction machine code generated in the fourth step into a hardware circuit, testing the performance of the actual circuit, including function, power consumption, speed, area and the like, and evaluating the operation index of the actual circuit system.

It is a further object of the present invention to provide a host computer apparatus comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the host computer to perform the steps of:

(1) analyzing application requirements, acquiring data samples by using various data acquisition equipment, and labeling labels to make a data set;

(2) designing a neural network model, training the model by using a data set on an upper computer software platform, testing the designed convolutional neural network model, and repeatedly modifying the model until a design application index is reached;

(3) performing hardware engineering optimization on the neural network model and evaluating the optimized performance; accepting the modification if the performance degradation meets the expected criteria, otherwise re-executing steps (2) and (3);

(4) programming the neural network model generated in the step three by using a special Move instruction set, and generating corresponding hardware executable configuration information and instruction codes according to an instruction table;

(5) and downloading all the generated configuration information and the instruction machine code into a hardware circuit, testing the performance of the actual circuit, and evaluating the operation index of the actual system.

It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the host computer to perform the steps of:

Another object of the present invention is to provide an information data processing terminal for implementing the programmable convolutional neural network processor.

By combining all the technical schemes, the invention has the advantages and positive effects that: the programmable convolutional neural network processor provided by the invention can improve the speed in a target data identification and classification task under an embedded environment and reduce the resource consumption. The invention can realize two convolution structures, including a basic convolution neural network and a residual error network ResNet structure, and adopts a design structure based on a broadband data transmission switching network, all operation multiplexing operation processing units and an N x M two-dimensional large-scale processing unit array, thereby not only improving the universality of a network processor, improving the processing speed and efficiency, reducing the resource consumption and the system power consumption, but also greatly simplifying the design of a system compiler and accelerating the engineering application speed of a deep learning network.

The invention adopts a Move instruction execution structure, all the functional modules are mutually connected through the data transmission switching network optimized by speed and transmission bandwidth, the speed and the efficiency of data transmission are greatly improved, and further, the data communication overhead in the parallel operation process can be greatly reduced, thereby improving the operation speed and the efficiency of the system. The regular system circuit architecture and the simple programming instruction with complete functions provided by the invention can easily design the compiler system of the regular system circuit architecture, greatly simplify the complexity of compiling and deploying deep learning application programs and further improve the system design efficiency. The structure is not only suitable for the design of a convolutional neural network processor, but also can be used for the design of a pulse neural network and a pulse convolutional neural network processor.

Selecting a proper type of neural network according to requirements, generating corresponding instructions and configuration parameters according to the neural network structure programming, loading the instructions and the configuration parameters into a processor, setting a configuration register in a processing unit array by the processor according to the configuration parameters, reading in data to be processed, weight and offset according to an address in the instructions, sending the data to be processed, the weight and the offset to a processing unit address indicated by the instructions, calculating layer by layer, and finally outputting a classification result of a target; the control instruction of the processor uses a transmission Move instruction, each instruction comprises an address of an execution unit transmitted by the instruction and a data address required by the operation of the execution unit, the instruction structure is simple, the processor structure is easy to parametrize, and the method is suitable for various convolutional neural network structures, so that the universality of the processor structure is improved, and an application program compiler matched with the processor can be very easily designed.

Compared with the prior art, the invention has the following advantages:

firstly, two convolution structures are adopted in the invention, so that a basic convolution neural network and a residual error network ResNet structure are realized, the defect of single convolution network structure in the prior art is overcome, the structure of the optional convolution neural network is more, and the universality in practical application is improved.

Secondly, the invention adopts the mode that all operations multiplex the core operation parts in the processing unit, thereby overcoming the defect of more circuit resource consumption in the neural network processor in the prior art, ensuring that the invention can multiplex the operation unit array to the maximum extent, reducing the circuit resource consumption of the processor and simultaneously reducing the system power consumption.

Thirdly, because the invention adopts the large-scale processing unit array with the scale of N x M, the defects of low operation parallelism and low operation unit utilization rate in the whole operation process in the prior art are overcome, so that the invention can calculate the data of N characteristic diagrams and M convolution kernels in parallel each time, thereby not only improving the calculation speed of the processor, but also further realizing the parametrization design of the processor structure according to the application requirements and meeting the real-time requirements and the individual requirements under more embedded environments.

Fourthly, because the invention adopts the Move instruction execution structure, namely all the functional modules are mutually connected through the data transmission switching network which is specially optimized by speed and transmission bandwidth, the defect of low data transmission efficiency in the prior art is overcome, and further, the data communication overhead in the parallel operation process can be greatly reduced, thereby improving the operation speed and efficiency of the system.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of a programmable convolutional neural network processor according to an embodiment of the present invention;

in the figure: 1. a data storage control module; 2. a data caching module; 3. an array of processing units; 4. a program storage module; 5. a data transmission switching network; 6. and the instruction decoding-state machine control module.

Fig. 2 is a schematic diagram of the overall structure of the programmable convolutional neural network processor according to the embodiment of the present invention.

Fig. 3 is a flowchart of a control method of a programmable convolutional neural network processor according to an embodiment of the present invention.

Fig. 4 is a diagram of an internal structure of a single processing unit according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In view of the problems in the prior art, the present invention provides a programmable convolutional neural network processor, method, device, medium, and terminal, and the present invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the programmable convolutional neural network processor provided in the embodiment of the present invention includes a data storage control module 1, a data buffer module 2, a processing unit array 3, a program storage module 4, a data transmission switching network 5, and an instruction decoding-state machine control module 6.

The data storage control module 1 is connected with the data cache module 2 through a data transmission switching network 5; the program storage module 4 is connected with the instruction decoding-state machine control module 6 through a data transmission switching network 5; the processing unit array 3 is composed of a plurality of processing unit clusters, each processing unit cluster is composed of a plurality of processing units, and each processing unit cluster comprises a local data cache unit; the data cache module 2 is connected with the processing unit array 3 through a data network; the instruction decoding-state machine control module 6 is connected with all other modules through a control bus.

As shown in fig. 3, a control method of a programmable convolutional neural network processor provided in an embodiment of the present invention includes the following steps:

s101, analyzing application requirements, acquiring data samples by using various data acquisition equipment, and labeling a label to manufacture a data set;

s102, designing a neural network model, training the model by using a data set on an upper computer software platform, testing the designed convolutional neural network model, and repeatedly modifying the model until the model reaches design application indexes;

s103, carrying out hardware engineering optimization on the neural network model and evaluating the optimized performance; accepting the modification if the performance degradation meets the expected criteria, otherwise re-executing S102 and S103;

s104, programming the neural network model generated in the step S103 by using a special Move instruction set, and generating corresponding hardware executable configuration information and instruction codes according to an instruction table;

and S105, completely downloading the generated configuration information and the instruction machine code into a hardware circuit through the upper computer, testing the performance of the actual circuit, and evaluating the operation index of the actual system.

The technical solution of the present invention will be further described with reference to the following examples.

Example 1

The programmable convolutional neural network processor provided by the embodiment of the invention selects a proper type of neural network according to requirements, generates a corresponding program instruction and configuration parameters according to the neural network structure programming, loads the program instruction and the configuration parameters into the processor, sets a configuration register in a processing unit array according to the configuration parameters, reads data to be processed, weight and offset according to an address in the instruction, sends the data to be processed, the weight and the offset to a processing unit address indicated by the instruction, calculates layer by layer, and finally outputs a classification result of a target. The control instruction of the processor uses a transmission Move instruction, each instruction comprises an execution unit address transmitted by the instruction and a data address required by executing the unit operation, the instruction structure is simple, the processor structure is easy to participate, and the method is suitable for various convolutional neural network structures, so that the universality of the processor structure is improved, and an application program compiler matched with the processor can be very easily designed.

The convolutional neural network processor comprises a Move instruction for indicating an address of an execution unit and a data address, and is used for realizing acceleration of convolutional neural network reasoning operation in an embedded environment and finishing identification and classification of target data; the programmable convolutional neural network processor comprises a data storage control module, a data cache module, a processing unit array, a program storage module, a data transmission switching network and an instruction decoding-state machine control module, wherein the data storage control module is connected with the data cache module through the data transmission switching network; the program storage module is connected with the instruction decoding-state machine control module through a data transmission switching network; the processing unit array consists of N processing unit clusters, each processing unit cluster consists of M processing units, each processing unit cluster comprises a local data cache unit, and the processing unit array can calculate data of N characteristic graph convolution windows and data of M convolution kernels in parallel at a time; the data cache module is connected with the processing unit array through a data transmission switching network; the instruction decoding-state machine control module is connected with all other modules through a control bus; wherein:

and the data storage control module is used for controlling the memory to carry out data reading and writing operations. Sending the data to be processed or the network weight parameter data read from the memory to a data caching module; writing back the operation result generated in the data caching module to a memory;

the program storage module is used for storing all instructions in a program for guiding the processor to finish operation;

the instruction decoding-state machine control module completes the Move instruction of data according to the address of the execution mechanism and the data address indicated in the instruction, controls the data storage control module, the data cache module, the processing unit array, the program storage module and the data transmission switching network, performs all neural network related operations on the read-in data to be processed or network weight parameter data, and outputs an operation result.

Because the invention adopts two convolution structures, the basic convolution neural network and the residual error network ResNet structure are realized, the defect of single convolution network structure in the prior art is overcome, the structure of the optional convolution neural network is more, and the universality in practical application is improved.

The invention adopts the mode that all operations multiplex the core operation parts in the processing units, overcomes the defect of more circuit resource consumption in the neural network processor in the prior art, can multiplex the operation unit array to the maximum extent, reduces the circuit resource consumption of the processor and simultaneously reduces the system power consumption.

Because the invention adopts the large-scale processing unit array with the scale of N x M, the defects of low operation parallelism and low operation unit utilization rate in the whole operation process in the prior art are overcome, so that the invention can calculate the data of N characteristic graphs and M convolution kernels in parallel each time, thereby not only improving the calculation speed of the processor, but also further realizing the parametrization design of the processor structure according to the application requirements and meeting the real-time requirements and the individual requirements under more embedded environments.

The invention adopts Move instruction execution structure, that is, all the function modules are connected with each other through the data transmission switching network which is specially optimized by speed and transmission bandwidth, thus overcoming the defect of low data transmission efficiency in the prior art, and further greatly reducing the data communication overhead in the parallel operation process, thereby improving the operation speed and efficiency of the system.

Example 2

Referring to fig. 2, the convolutional neural network processor of the present invention includes a data storage control module, a data cache module, a processing unit array, a program storage module, a data transmission switching network, and an instruction decoding-state machine control module, where the data storage control module is connected to the data cache module through the data transmission switching network; the program storage module is connected with the instruction decoding-state machine control module through a data transmission switching network; the processing unit array is composed of a plurality of processing unit clusters, each processing unit cluster is composed of a plurality of processing units, and each processing unit cluster comprises a local data cache unit; the data cache module is connected with the processing unit array through a data network; the instruction decoding-state machine control module is connected with all other modules through a control bus.

And the data storage control module is used for controlling the memory to carry out data reading and writing operations. Sending the data to be processed or the network weight parameter data read from the memory to a data caching module; writing back the operation result generated in the data caching module to the data memory;

the data cache module comprises a 4-block simple dual-port block random access memory BRAM. The BRAM1 is used for temporarily storing the data to be processed sent by the storage control module and forwarding the data to the processing unit array; the BRAM2 is used for temporarily storing the weight parameter data sent by the storage control module and forwarding the weight parameter data to the processing unit array; the BRAM3 is used for temporarily storing the characteristic diagram data of the residual block connection sent by the storage control module and forwarding the characteristic diagram data to the processing unit array; the BRAM4 is used for temporarily storing the operation result generated by the processing unit array and writing back the operation result to the storage control module; wherein:

BRAM2 memory data bit width = array depth data bit width bit;

BRAM3 memory depth = array depth n;

BRAM4 memory depth = array depth n;

The processing unit array comprises N processing unit clusters, wherein one processing unit cluster comprises M processing units, and each processing unit can independently complete all operations contained in the convolutional neural network; the processing unit array is used for receiving data to be processed, network weight parameter data or feature map data connected with the residual block, which are sent by the data cache module, completing the operation of the operation target type indicated by the instruction, and then sending the operation result to the data cache module according to the instruction; the operation purpose type supported by the processing unit array refers to that any one of the following operation purpose types can be selected, wherein the operation purpose types comprise multiply-accumulate operation of convolution layers, addition operation among residual blocks, multiply-accumulate operation of full connection layers, pooling operation and nonlinear function operation; the arithmetic number supported by the processing unit array is made into a floating point number system or a fixed point number system; the nonlinear function operation means that the slope and the intercept are read by a piecewise linear table and then sent to a core operation component to complete the nonlinear function operation; the processing unit is divided into N clusters which are respectively numbered as: the cluster 0, the cluster 1 and the cluster … are N-1, each cluster comprises M processing units which are respectively numbered as: the 0 th, the 1 st and the … M-1 th correspond to the 1 st, the 2 nd and the 3 rd correspond to each other, and so on, in order to complete the pooling operation, a plurality of special pooling channels for pooling operation are designed in the processing unit array, and M/2 pooling special channels are arranged between the even processing unit and the corresponding odd processing unit in each cluster; when the convolution operation is completed, the even number operation unit and the odd number operation unit both generate a convolution operation result, and the odd number operation unit sends the operation result to the even number operation unit by using the special pooling channel to perform pooling operation;

the program storage module comprises 1 block of program memory and an addressing circuit thereof, and is used for storing all instructions for guiding the processor to complete operation;

Referring to fig. 4, the processing unit includes a piecewise linear table, a core operation unit, and an operation result register Acc _ Reg. The core operation part comprises an input data selection module MUX, a multiplication operation module Multi and an addition and subtraction operation module ADD/SUB. The required operation operand is selected by the input data selection signals Sel _ a and Sel _ B, and the operation type of the addition and subtraction module is selected by the operation destination address selection signal Sel _ C.

The core operation part can complete operations including multiply-accumulate operation of convolution layers, addition operation among residual blocks, multiply-accumulate operation of full connection layers, pooling operation and nonlinear function operation, and the generated result is stored in an operation result register Acc _ Reg; a piecewise linear approximate table look-up method is adopted to carry out nonlinear function operation, specifically, an internal nonvolatile Flash memory with the depth of D and the data bit width of W is set, table look-up data is written into the nonvolatile Flash memory according to a nonlinear function to be realized, the high W/2 bit of each data in a table look-up file represents the slope k of a linear approximate segment, the low W/2 bit represents the intercept b of the segment, then the high D bit of input data x is used as the address of the nonvolatile Flash memory to carry out table look-up, the nonvolatile Flash memory outputs the slope k and the intercept b of the linear segment where x is located at the moment, a core operation part is used to complete the operation of f (x) = kx + b, and the obtained result is the output value of the nonlinear function and is output to an operation result register. The size of D represents the number of linear approximation segments, and the larger the number, the more accurate the value of the function of the calculation output, which can be set according to the application requirements.

The input data selection signal Sel _ a is used to select input operands of the MUX, including output data of the multiplication module Multi and input data In _ C.

The input data selection signal Sel _ B is used for simultaneously selecting input operands of three MUXs, including input data In _ a, input data In _ B, and operation result register Acc _ Reg, or operation result register Acc _ Reg, slope of piecewise linear table, and intercept of piecewise linear table.

In the Move instruction format adopted by the invention, each Move instruction comprises an execution unit address op _ addr, an auxiliary marker bit marker _ bit, a to-be-processed data storage address fimg _ addr, a weight parameter data storage address paramt _ addr and an auxiliary extension bit extension _ bit.

TABLE 1 instruction Format

The present invention is further described below with reference to examples of the present invention. In an actual embedded scene, when a target data identification and classification task is carried out, the specific process of designing the processor according to the method provided by the invention is as follows:

step 1, analyzing requirements, and collecting data samples and label making data sets.

Analyzing application requirements, using various data acquisition equipment in an actual application scene, acquiring samples and labeling a label making data set.

And 2, designing a neural network model, training the model by using a data set on an upper computer software platform, and testing the performance.

And designing and optimizing a neural network model according to the application requirements by referring to the classical convolutional neural network, simulating by using software by using the collected data set, training and testing the designed convolutional neural network model, and repeatedly modifying the model until the designed application indexes are reached.

And 3, further carrying out hardware engineering optimization on the neural network model and evaluating the optimized performance.

And (3) performing hardware engineering optimization on the model which is designed by software, performing performance evaluation on the optimized and adjusted model, accepting the modification if the performance reduction meets the expected index, and otherwise, re-executing the step (2) and the step (3).

And 4, programming by using a special Move instruction set and generating a related instruction machine code.

And programming the neural network model generated in the step 3 by using a special Move instruction set, and generating corresponding hardware executable configuration information and instruction codes according to the instruction table.

And 5, downloading the configuration information and the instruction machine code into a hardware circuit through the upper computer, and carrying out actual circuit performance test.

And downloading all the generated configuration information and instruction machine codes into a hardware circuit, and then carrying out operation index test on the actual system.

There are computer software copyright registration documents "neural network processor instruction program generation system software V1.0" (registration number 2019SR 0368741) and "neural network processor application program generation software V1.0" (registration number 2019SR 1410805). The application program file executable by the processor, including the hardware executable binary instruction machine code and the configuration information thereof, can be obtained only by inputting the model and the corresponding parameters on the interface according to the prompt, then saving the model and clicking the generation button. Meanwhile, the software has the function of initializing different types of data according to the model, a user does not need to study the underlying principle of the neural network processor or manually write an application instruction program, time and labor are saved, and the software has a visual operation interface and is visual, simple and convenient to use.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used, may be implemented in whole or in part in the form of a processor program product including one or more processor instructions. When loaded or executed on a processor, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The host computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The processor instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the processor instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any modification, equivalent replacement, and improvement made by those skilled in the art within the technical scope of the present invention disclosed in the present invention should be covered within the scope of the present invention.

Claims

1. A programmable convolutional neural network processor is characterized by comprising a Move instruction for indicating an execution unit address and a data address, wherein the programmable convolutional neural network processor is used for realizing acceleration of convolutional neural network reasoning operation in an embedded environment and finishing identification and classification of target data;

2. The programmable convolutional neural network processor of claim 1, wherein the data storage control module is configured to control the memory to perform data read/write operations, and includes sending data to be processed read from the memory or network weight parameter data to the data caching module; writing back the operation result generated in the data caching module to a memory;

the instruction decoding-state machine control module is used for completing a Move instruction of data according to the address of the execution mechanism and the data address indicated in the instruction, controlling the data storage control module, the data cache module, the processing unit array, the program storage module and the data transmission switching network, completing all related operations of the neural network on the read-in data to be processed or network weight parameter data, and outputting an operation result.

3. The programmable convolutional neural network processor of claim 1, wherein the data cache module comprises 4 simple dual-port Block Random Access Memories (BRAMs); the BRAM1 is used for temporarily storing the data to be processed sent by the storage control module and forwarding the data to be processed to the processing unit array; the BRAM2 is used for temporarily storing the weight parameter data sent by the storage control module and forwarding the weight parameter data to the processing unit array; the BRAM3 is used for temporarily storing the characteristic diagram data of the residual block connection sent by the storage control module and forwarding the characteristic diagram data to the processing unit array; the BRAM4 is used for temporarily storing the operation result generated by the processing unit array and writing back the operation result to the storage control module;

wherein the content of the first and second substances,

BRAM2 memory data bit width = array depth data bit width bit;

BRAM3 memory depth = array depth n;

BRAM4 memory depth = array depth n;

4. The programmable convolutional neural network processor of claim 1, wherein the processing unit array comprises N processing unit clusters, wherein one processing unit cluster comprises M processing units, and each processing unit can independently complete all operations contained in the convolutional neural network; the processing unit array is used for receiving data to be processed, network weight parameter data or feature map data connected with the residual block, which are sent by the data cache module, completing the operation of the operation target type indicated by the instruction, and then sending the operation result to the data cache module according to the instruction; the operation purpose type supported by the processing unit array refers to that any one of the following operation purpose types can be selected, wherein the operation purpose types comprise multiply-accumulate operation of convolution layers, addition operation among residual blocks, multiply-accumulate operation of full connection layers, pooling operation and nonlinear function operation; the arithmetic number supported by the processing unit array is made into a floating point number system or a fixed point number system; the nonlinear function operation means that the slope and the intercept are read by a piecewise linear table and then sent to a core operation component to complete the nonlinear function operation; the processing unit is divided into N clusters which are respectively numbered as: the cluster 0, the cluster 1 and the cluster … are N-1, each cluster comprises M processing units which are respectively numbered as: the 0 th, the 1 st and the … M-1 th correspond to the 1 st, the 2 nd and the 3 rd correspond to each other, and so on, in order to complete the pooling operation, a plurality of special pooling channels for pooling operation are designed in the processing unit array, and M/2 pooling special channels are arranged between the even processing unit and the corresponding odd processing unit in each cluster; when the convolution operation is completed, the even number operation unit and the odd number operation unit both generate a convolution operation result, and the odd number operation unit sends the operation result to the even number operation unit by utilizing the special pooling channel to perform the pooling operation.

5. The programmable convolutional neural network processor of claim 1, wherein the processing unit comprises a piecewise linear table, a core arithmetic unit, and an arithmetic result register Acc Reg; the core operation part comprises an input data selection module MUX, a multiplication operation module Multi and an addition and subtraction operation module ADD/SUB; selecting the operand required for operation through input data selection signals Sel _ A and Sel _ B, and selecting the operation type of an addition and subtraction module through an operation destination address selection signal Sel _ C;

the core operation part can complete operations including multiply-accumulate operation of convolution layers, addition operation among residual blocks, multiply-accumulate operation of full connection layers, pooling operation and nonlinear function operation, and the generated result is stored in an operation result register Acc _ Reg; performing nonlinear function operation by adopting a piecewise linear approximate table look-up method, setting an internal nonvolatile Flash memory with the depth of D and the data bit width of W to store look-up table data, writing the look-up table data into the Flash memory in a firmware form through an external configuration interface according to a nonlinear function to be realized, wherein the high W/2 bit of each data in a look-up table file represents the slope k of a linear approximate subsection, the low W/2 bit is the intercept b of the subsection, then, the high D bit of input data x is taken as the address of the Flash memory to perform table look-up, the Flash memory outputs the slope k and the intercept b of the linear subsection where x is positioned at the moment, then, a core operation part is used for completing the operation of f (x) = kx + b, and finally, the obtained result is the output value of the nonlinear function and is output to an operation result register; the size of D represents the number of linear approximate segments, and the more the number is, the more accurate the calculation output function value is, and the setting can be carried out according to the application requirement;

6. The programmable convolutional neural network processor of claim 1, wherein the programmable convolutional neural network processor and its associated instruction set can exploit data parallelism and implement instruction parallelism in network inference operations; the parallelism of the data comprises parallel calculation of M convolution kernel data and parallel calculation of N characteristic graph convolution window data; the parallelism of the instructions comprises that reading of a data memory to be processed and reading of a weight parameter data memory are controlled in parallel according to the corresponding position address in each instruction, and a data path of the data memory is indicated; the instructions adopt a Move instruction format, and each Move instruction comprises an execution unit address op _ addr, an auxiliary marker bit marker _ bit, a to-be-processed data storage address fimg _ addr, a weight parameter data storage address paramt _ addr and an auxiliary extension bit extension _ bit.

7. A control method of a programmable convolutional neural network processor operating the programmable convolutional neural network processor of any one of claims 1 to 6, comprising the steps of:

designing a neural network model, training the model on a software platform by using a data set, testing the designed convolutional neural network model, and repeatedly modifying the model until the model reaches a design application index;

step four, programming the model generated in the step three by using a special Move instruction set, and generating configuration information and instruction codes which are executable by corresponding hardware and comprise a piecewise linear table used for calculating a nonlinear function according to an instruction table;

and step five, downloading all the generated configuration information and instruction machine codes into a hardware circuit, testing the performance of the actual circuit, and evaluating the operation index of the actual system.

8. An upper computer apparatus comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the upper computer to perform the steps of:

(2) designing a neural network model, training the model on a software platform by using a data set, testing the designed convolutional neural network model, and repeatedly modifying the model until a design application index is reached;

(4) programming the model generated in the step three by using a special Move instruction set, and generating configuration information and instruction codes which are executable by corresponding hardware and comprise a piecewise linear table used for calculating a nonlinear function according to an instruction table;

9. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

10. An information data processing terminal, characterized in that the information data processing terminal is used for implementing a programmable convolutional neural network processor as claimed in any one of claims 1 to 6.