CN107862378A

CN107862378A - Convolutional neural networks accelerated method and system, storage medium and terminal based on multinuclear

Info

Publication number: CN107862378A
Application number: CN201711273248.5A
Authority: CN
Inventors: 张慧明
Original assignee: Core Chip Technology (shanghai) Co Ltd; VeriSilicon Microelectronics Shanghai Co Ltd
Current assignee: Core Chip Technology (shanghai) Co Ltd; VeriSilicon Microelectronics Shanghai Co Ltd; Vivante Corp
Priority date: 2017-12-06
Filing date: 2017-12-06
Publication date: 2018-03-30
Anticipated expiration: 2037-12-06
Also published as: CN107862378B

Abstract

The present invention provides a kind of convolutional neural networks accelerated method and system based on multinuclear, storage medium and terminal, including one layer of convolutional neural networks are split into at least two subtasks, and each subtask is corresponding with a convolution kernel；It is connected in series between the convolution kernel；Parallel to perform the first predetermined number dot product computing based on each convolution kernel, each dot product computing includes the second predetermined number multiplying；The product of first predetermined number and second predetermined number is the adder and multiplier number in convolution kernel；By the dot product operation result of each convolution kernel according to output priority Sequential output.The convolutional neural networks accelerated method and system based on multinuclear, the storage medium and terminal of the present invention save the data bandwidth of convolutional neural networks by multiple parallel convolution kernels；Under conditions of same hardware data bandwidth, the processing speed of convolutional neural networks is improved by parallel dot product computing in convolution kernel.

Description

Convolutional neural networks accelerated method and system, storage medium and terminal based on multinuclear

Technical field

The present invention relates to the technical field of data processing, accelerates more particularly to a kind of convolutional neural networks based on multinuclear Method and system, storage medium and terminal.

Background technology

At present, deep learning and machine learning have obtained widely in visual processes, speech recognition and art of image analysis Using.Convolutional neural networks are deep learning and the important component of machine learning.Improve the processing speed of convolutional neural networks Degree, deep learning and the processing speed of machine learning can be improved with equal proportion.

In the prior art, visual processes, speech recognition and the application of graphical analysis are all based on multilayer convolutional neural networks. Every layer of convolutional neural networks will pass through substantial amounts of data processing and convolution algorithm, and the consumption to hardware process speed and resource will Ask very high.With wearable device, Internet of Things application and the continuous development of automatic Pilot technology, how convolutional neural networks to be existed The middle realization of embedded product and reach can smooth processing speed, turn into that Current hardware architecture design faces huge chooses War.By taking typical convolutional neural networks ResNet and VGG16 as an example, ResNet is under 16 floating point precisions, if to run to 60 frames speed per second is, it is necessary to 15G byte bandwidth；VGG16 is under 16 floating point precisions, if to run to 60 frames speed per second Degree is, it is necessary to 6.0G byte bandwidth.

At present, in order to realize the acceleration of convolutional neural networks, realized by the multiple convolution units of parallel arranged.In ideal In the case of, convolution unit is more, and processing speed is faster.But in actual applications, data bandwidth can restricted wreath product unit significantly Processing speed, the bandwidth resources of hardware are of great rarity, and the data bandwidth for improving hardware is costly.Therefore, in limited number According to the processing speed under bandwidth and hardware spending, improving convolutional neural networks, turn into Current hardware architecture design and be badly in need of what is solved Problem.

The content of the invention

In view of the above the shortcomings that prior art, it is an object of the invention to provide a kind of convolutional Neural based on multinuclear Network accelerating method and system, storage medium and terminal, the data of convolutional neural networks are saved by multiple parallel convolution kernels Bandwidth；Under conditions of same hardware data bandwidth, convolutional Neural is improved by parallel dot product computing in convolution kernel The processing speed of network.

In order to achieve the above objects and other related objects, the present invention provides a kind of convolutional neural networks based on multinuclear and accelerated Method, comprise the following steps：One layer of convolutional neural networks are split into at least two subtasks, each subtask and a convolution Nuclear phase is corresponding；Serial connection is for input data serial transmission between the convolution kernel between the convolution kernel；Based on each Convolution kernel, parallel to perform the first predetermined number dot product computing, each dot product computing includes the second predetermined number Multiplying；The product of first predetermined number and second predetermined number is the adder and multiplier number in convolution kernel；Will be each The dot product operation result of convolution kernel is according to output priority Sequential output.

In one embodiment of the invention, second predetermined number is 3 to support 3D dot products.

In one embodiment of the invention, the output priority inputs the sequencing of convolution kernel according to the input data It is determined that the convolution kernel for first inputting the input data has higher output preferential than the convolution kernel of the rear input input data Level.

It is parallel to perform the first predetermined number dot product computing based on each convolution kernel in one embodiment of the invention Comprise the following steps：

Obtain (N+M-1) individual input data；Wherein N is first predetermined number, and M is second predetermined number；

The 1st, which is separately input into, by preceding 1 to N number of input data arrives N number of adder and multiplier and the first multiplication；

Preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication；

By that analogy, preceding M to N+M-1 input datas are separately input into (N*M-N+1) to N*M adder and multiplier and the M multiplications；

By in M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.

Accordingly, the present invention provides a kind of convolutional neural networks acceleration system based on multinuclear, including convolution kernel sets mould Block, dot product module and output module；

The convolution kernel setup module is used to one layer of convolutional neural networks splitting at least two subtasks, appoints per height Business is corresponding with a convolution kernel；It is connected in series between the convolution kernel so that input data serially passes between the convolution kernel It is defeated；

The dot product module is used to be based on each convolution kernel, parallel to perform the first predetermined number dot product fortune Calculate, each dot product computing includes the second predetermined number multiplying；First predetermined number and described second is preset The product of quantity is the adder and multiplier number in convolution kernel；

The output module is used for the dot product operation result of each convolution kernel according to output priority Sequential output.

In one embodiment of the invention, the dot product module is based on each convolution kernel, parallel to perform the first present count Following steps are performed during amount dot product computing：

The present invention provides a kind of storage medium, is stored thereon with computer program, the program is realized when being executed by processor The above-mentioned convolutional neural networks accelerated method based on multinuclear.

Finally, the present invention provides a kind of terminal, including processor and memory；

The memory is used to store computer program；

The processor is used for the computer program for performing the memory storage, so that the terminal performs above-mentioned base In the convolutional neural networks accelerated method of multinuclear.

As described above, the convolutional neural networks accelerated method and system based on multinuclear of the present invention, storage medium and terminal, Have the advantages that：

(1) bandwidth consumption of convolutional neural networks computing dynamic memory is saved；By taking 4 convolution kernel normal forms as an example, in phase Under same data processing speed, 75% input image data bandwidth can be saved when convolutional neural networks are run；

(2) processing speed of convolutional neural networks is improved, under conditions of same hardware data bandwidth, with 3D vector points Exemplified by product, it is possible to increase the processing speed of convolutional neural networks to 300%；

(3) dynamic power consumption of convolutional neural networks is reduced, by taking 4 convolution kernels, 3D dot products as an example, when reducing computing Between to original 33%, reduce the input picture bandwidth of convolutional neural networks to original 25%, reduction dynamic power consumption 85%；

(4) optimize processing speed of the convolutional neural networks in embedded product, have framework is clear, the division of labor clearly, Easily realize, the advantages that flow is simple, can be widely applied in Internet of Things, wearable device and mobile unit.

Brief description of the drawings

Fig. 1 is shown as flow chart of the convolutional neural networks accelerated method in an embodiment based on multinuclear of the present invention；

Fig. 2 is shown as the coordinate schematic diagram of input picture, coefficient and output image；

Fig. 3 is shown as framework signal of the convolutional neural networks accelerated method based on multinuclear of the present invention in an embodiment Figure；

It is real in one that Fig. 4 is shown as parallel 3D dot products in the convolutional neural networks accelerated method based on multinuclear of the present invention Apply the first state figure in example；

It is real in one that Fig. 5 is shown as parallel 3D dot products in the convolutional neural networks accelerated method based on multinuclear of the present invention Apply the second state diagram in example；

It is real in one that Fig. 6 is shown as parallel 3D dot products in the convolutional neural networks accelerated method based on multinuclear of the present invention Apply the third state figure in example；

Fig. 7 be shown as parallel 3D dot products in the convolutional neural networks accelerated method based on multinuclear of the present invention sum in State diagram in one embodiment；

Fig. 8 is shown as structural representation of the convolutional neural networks acceleration system in an embodiment based on multinuclear of the present invention Figure；

Fig. 9 is shown as structural representation of the terminal of the present invention in an embodiment.

Component label instructions

21 convolution kernel setup modules

22 dot product modules

23 output modules

31 processors

32 memories

Embodiment

Illustrate embodiments of the present invention below by way of specific instantiation, those skilled in the art can be by this specification Disclosed content understands other advantages and effect of the present invention easily.The present invention can also pass through specific realities different in addition The mode of applying is embodied or practiced, the various details in this specification can also be based on different viewpoints with application, without departing from Various modifications or alterations are carried out under the spirit of the present invention.It should be noted that in the case where not conflicting, following examples and implementation Feature in example can be mutually combined.

It should be noted that the diagram provided in following examples only illustrates the basic structure of the present invention in a schematic way Think, only show the component relevant with the present invention in schema then rather than according to component count, shape and the size during actual implement Draw, kenel, quantity and the ratio of each component can be a kind of random change during its actual implementation, and its assembly layout kenel It is likely more complexity.

The convolutional neural networks accelerated method and system based on multinuclear of the present invention, storage medium and terminal is in data bandwidth The data bandwidth of convolutional neural networks is saved in the case of limited by multiple parallel convolution kernels；In same hardware data bandwidth Under conditions of, pass through the processing speed of dot product computing raising convolutional neural networks parallel in convolution kernel；Optimize volume Processing speed of the product neutral net in embedded product, it is excellent to have that framework is clear, the division of labor clearly, is easily realized, flow is simple etc. Point, it can be widely applied in Internet of Things, wearable device and mobile unit.

As shown in figure 1, in an embodiment, the convolutional neural networks accelerated method of the invention based on multinuclear includes following Step：

Step S1, one layer of convolutional neural networks are split into at least two subtasks, each subtask and a convolution kernel It is corresponding；Serial connection is for input data serial transmission between the convolution kernel between the convolution kernel.

By taking image procossing as an example, as shown in Fig. 2 input picture and output image are all 3-D views, input picture includes Abscissa inx, ordinate iny and coefficient depth coordinate kz.Output image includes abscissa outx, ordinate outy and depth is sat Mark z.Coefficient is 4 D data, including abscissa kx, ordinate ky, coefficient depth coordinate kz and output depth coordinate z.

When one layer of convolutional neural networks are split as into four subtasks, coefficient sequence is split into four groups according to z directions, As shown in Fig. 2 the convolution kernel of different convolutional neural networks is distributed to per system number.Wherein, data between different convolution kernels Serial UNICOM, so as to save bandwidth by data sharing.

According to the treatment characteristic of convolutional neural networks, convolution algorithm will be carried out with all input pictures per system number, Obtain the output image of a z-plane.Therefore, input picture has very big reusability.As shown in figure 3, different convolution kernels it Between be connected in series by channels in series so that input image data serially passes through between convolution kernel.Specifically, input figure As data are after internal memory is read into convolution kernel 0, input image data and the first system number are carried out convolution algorithm by convolution kernel 0, Input image data is transferred to convolution kernel 1 by convolution kernel 0 by data channels in series simultaneously.It is defeated that convolution core 1 saves reading Enter the bandwidth consumption of view data.Convolution kernel 2 and convolution kernel 3 can carry out same data and grasp to avoid the band of input image data Width consumption.Under 4 convolution kernel normal forms, by tandem data passage, reduce the reading of input image data three times, save 75% input image data bandwidth consumption, while decrease the power consumption of internal memory；Solve physics realization layer coiling well Quantity and the equilibrium problem for realizing frequency.

Step S2, it is parallel to perform the first predetermined number dot product computing, each dot product based on each convolution kernel Computing includes the second predetermined number multiplying；The product of first predetermined number and second predetermined number is convolution kernel In adder and multiplier number.

21) (N+M-1) individual input data is obtained；Wherein N is first predetermined number, and M is second predetermined number；

22) it is separately input into the 1st by preceding 1 to N number of input data and arrives N number of adder and multiplier and the first multiplication；

23) preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication；

24) preceding M to N+M-1 input datas by that analogy, are separately input into (N*M-N+1) and arrive N*M adder and multiplier With M multiplications；

25) by M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.

It is expanded on further below with dot product comprising multiplication three times (M=3).In this embodiment, in each volume 24 adder and multipliers are included in product core.In order to maximize the utilization rate of adder and multiplier in the case of finite bandwidth, by comprising three times The 3D dot products computing of multiplication to calculate 8 3D dot products results (N=8) simultaneously.Specifically, the computing of 3D dot products is public Formula is as follows：

Out0=in0*k0+in1*k1+in2*k2

Out1=in1*k0+in2*k1+in3*k2

Out2=in2*k0+in3*k1+in4*k2

......

Out7=in7*k0+in8*k1+in9*k2

First, as shown in figure 4, reading 10 input image datas, i.e. in0, in1, int2...in9 from internal memory.By 0 to the 7th data (in0, in1, in2, in3, in4, in5, in6, in7) and coefficient 0 (k0) carry out multiplying, operation result The input of write accumulator, i.e. out00, out01, out02...out07.

Secondly, as shown in figure 5, reusing 10 input image datas of previous step, i.e. in0, in1, int2...in9.Will 1st to the 8th data (in1, in2, in3, in4, in5, in6, in7, in8) and coefficient 1 (k1) carry out multiplying, computing knot The input of fruit write accumulator, i.e. out10, out11, out12...out17.

Again, as shown in fig. 6, reusing 10 input image datas of previous step, i.e. in0, in1, int2...in9.Will 2nd to the 9th data (in2, in3, in4, in5, in6, in7, in8, in9) and coefficient 2 (k2) carry out multiplying, computing knot The input of fruit write accumulator, i.e. out20, out21, out22...out27.

Finally, as shown in fig. 7, the result on the correspondence position of multiplication three times is added up successively, 8 3D dot products are obtained As a result.Wherein, result out00, out10 at the first position of each multiplication are added with out20, obtain first 3D vector Dot product result；Result out01, out11 of the second place of each multiplication are added with out21, obtain second 3D vector Dot product result；By that analogy, result out07, out17 at the 8 positions of each multiplication are added with out27, obtain the 8th Individual 3D dot products result.

Step S3, by the dot product operation result of each convolution kernel according to output priority Sequential output.

Specifically, the output priority inputs the sequencing determination of convolution kernel according to the input data.Obtain defeated Enter that data are more early, the output priority of corresponding convolution kernel is higher.Therefore, the convolution kernel for first inputting the input data is more defeated than rear The convolution kernel for entering the input data has higher output priority.

As shown in figure 8, in an embodiment, the convolutional neural networks acceleration system of the invention based on multinuclear includes convolution Core setup module 21, dot product module 22 and output module 23.

Convolution kernel setup module 21 is used to one layer of convolutional neural networks splitting at least two subtasks, each subtask It is corresponding with a convolution kernel；It is connected in series between the convolution kernel so that input data serially passes between the convolution kernel It is defeated.

Dot product module 22 is connected with convolution kernel setup module 21, for based on each convolution kernel, performing first parallel Predetermined number dot product computing, each dot product computing include the second predetermined number multiplying；Described first is pre- If quantity and the product of second predetermined number are the adder and multiplier number in convolution kernel.

In one embodiment of the invention, dot product module 22 is based on each convolution kernel, parallel to perform the first predetermined number Individual dot product computing performs following steps：

It is expanded on further below with 3D dot products comprising multiplication three times (M=3).In this embodiment, each 24 adder and multipliers are included in convolution kernel.In order to maximize the utilization rate of adder and multiplier in the case of finite bandwidth, by including three The 3D dot products computing of secondary multiplication to calculate 8 3D dot products results (N=8) simultaneously.Specifically, 3D dot products computing Formula is as follows：

Out0=in0*k0+in1*k1+in2*k2

Out1=in1*k0+in2*k1+in3*k2

Out2=in2*k0+in3*k1+in4*k2

......

Out7=in7*k0+in8*k1+in9*k2

Output module 23 is connected with dot product module 22, for by the dot product operation result of each convolution kernel according to Output priority Sequential output.

It should be noted that it should be understood that the modules of system above division be only a kind of division of logic function, Can completely or partially it be integrated on a physical entity when actually realizing, can also be physically separate.And these modules can be with All realized in the form of software is called by treatment element；All it can also realize in the form of hardware；Can also part mould Block calls the form of software to realize by treatment element, and part of module is realized by the form of hardware.For example, x modules can be The treatment element individually set up, it can also be integrated in some chip of said apparatus and realize, in addition it is also possible to program generation The form of code is stored in the memory of said apparatus, is called by some treatment element of said apparatus and is performed above x moulds The function of block.The realization of other modules is similar therewith.In addition these modules can completely or partially integrate, can also be only It is vertical to realize.Treatment element described here can be a kind of integrated circuit, have the disposal ability of signal.In implementation process, Each step of the above method or more modules can pass through the integrated logic circuit or soft of the hardware in processor elements The instruction of part form is completed.

For example, the above module can be arranged to implement one or more integrated circuits of above method, such as： One or more specific integrated circuits (ApplicationSpecificIntegratedCircuit, abbreviation ASIC), or, one Or multi-microprocessor (digitalsingnalprocessor, abbreviation DSP), or, one or more field-programmable gate array Arrange (FieldProgrammableGateArray, abbreviation FPGA) etc..For another example, some module is dispatched by treatment element more than When the form of program code is realized, the treatment element can be general processor, such as central processing unit (CentralProcessingUnit, abbreviation CPU) or it is other can be with the processor of caller code.For another example, these modules can To integrate, realized in the form of on-chip system (system-on-a-chip, abbreviation SOC).

The present invention storage medium on be stored with computer program, the program realized when being executed by processor it is above-mentioned based on The convolutional neural networks accelerated method of multinuclear.Preferably, the storage medium includes：ROM, RAM, magnetic disc or CD etc. are various Can be with the medium of store program codes.

As shown in figure 9, in an embodiment, terminal of the invention includes processor 31 and memory 32.

The memory 32 is used to store computer program.

Preferably, the memory 32 includes：ROM, RAM, magnetic disc or CD etc. are various can be with store program codes Medium.

The processor 31 is connected with the memory 32, the computer program stored for performing the memory 32, So that the terminal performs the above-mentioned convolutional neural networks accelerated method based on multinuclear.

Preferably, processor 31 can be general processor, including central processing unit (CentralProcessingUnit, Abbreviation CPU), network processing unit (NetworkProcessor, abbreviation NP) etc.；It can also be digital signal processor (DigitalSignalProcessing, abbreviation DSP), application specific integrated circuit (ApplicationSpecificIntegratedCircuit, abbreviation ASIC), field programmable gate array (Field- ProgrammableGateArray, abbreviation FPGA) either other PLDs, discrete gate or transistor logic device Part, discrete hardware components.

In summary, the convolutional neural networks accelerated method and system of the invention based on multinuclear, storage medium and terminal Save the bandwidth consumption of convolutional neural networks computing dynamic memory；By taking 4 convolution kernel normal forms as an example, in identical data processing Under speed, 75% input image data bandwidth can be saved when convolutional neural networks are run；Improve convolutional neural networks Processing speed, under conditions of same hardware data bandwidth, by taking 3D dot products as an example, it is possible to increase convolutional neural networks Processing speed is to 300%；Reduce the dynamic power consumption of convolutional neural networks, by taking 4 convolution kernels, 3D dot products as an example, reduce Operation time to original 33%, reduces the input picture bandwidth of convolutional neural networks to original 25%, reduces dynamic power consumption 85%；Optimize processing speed of the convolutional neural networks in embedded product, have framework is clear, the division of labor clearly, is easily realized, The advantages that flow is simple, it can be widely applied in Internet of Things, wearable device and mobile unit.So effective gram of present invention Take various shortcoming of the prior art and have high industrial utilization.

The above-described embodiments merely illustrate the principles and effects of the present invention, not for the limitation present invention.It is any ripe Know the personage of this technology all can carry out modifications and changes under the spirit and scope without prejudice to the present invention to above-described embodiment.Cause This, those of ordinary skill in the art is complete without departing from disclosed spirit and institute under technological thought such as Into all equivalent modifications or change, should by the present invention claim be covered.

Claims

1. a kind of convolutional neural networks accelerated method based on multinuclear, it is characterised in that comprise the following steps：

One layer of convolutional neural networks are split into at least two subtasks, each subtask is corresponding with a convolution kernel；It is described Serial connection is for input data serial transmission between the convolution kernel between convolution kernel；

Parallel to perform the first predetermined number dot product computing based on each convolution kernel, each dot product computing includes the Two predetermined number multiplyings；The product of first predetermined number and second predetermined number is the adder and multiplier in convolution kernel Number；

By the dot product operation result of each convolution kernel according to output priority Sequential output.

2. the convolutional neural networks accelerated method according to claim 1 based on multinuclear, it is characterised in that described second is pre- If quantity is 3 to support 3D dot products.

3. the convolutional neural networks accelerated method according to claim 1 based on multinuclear, it is characterised in that the output is excellent The sequencing that first level inputs convolution kernel according to the input data determines that the convolution kernel for first inputting the input data is more defeated than rear The convolution kernel for entering the input data has higher output priority.

4. the convolutional neural networks accelerated method according to claim 1 based on multinuclear, it is characterised in that based on each volume Product core, the first predetermined number dot product computing of parallel execution comprise the following steps：

By that analogy, preceding M to N+M-1 input datas are separately input into (N*M-N+1) and arrive N*M adder and multiplier and M systems Number is multiplied；

5. a kind of convolutional neural networks acceleration system based on multinuclear, it is characterised in that including convolution kernel setup module, vector point Volume module and output module；

The convolution kernel setup module is used to one layer of convolutional neural networks splitting at least two subtasks, each subtask with One convolution kernel is corresponding；Serial connection is for input data serial transmission between the convolution kernel between the convolution kernel；

The dot product module is used to be based on each convolution kernel, parallel to perform the first predetermined number dot product computing, often Individual dot product computing includes the second predetermined number multiplying；First predetermined number and second predetermined number it Product is the adder and multiplier number in convolution kernel；

6. the convolutional neural networks acceleration system according to claim 5 based on multinuclear, it is characterised in that described second is pre- If quantity is 3 to support 3D dot products.

7. the convolutional neural networks acceleration system according to claim 5 based on multinuclear, it is characterised in that the output is excellent The sequencing that first level inputs convolution kernel according to the input data determines that the convolution kernel for first inputting the input data is more defeated than rear The convolution kernel for entering the input data has higher output priority.

8. the convolutional neural networks acceleration system according to claim 5 based on multinuclear, it is characterised in that the vector point Volume module is based on each convolution kernel, and following steps are performed when performing the first predetermined number dot product computing parallel：

9. a kind of storage medium, is stored thereon with computer program, it is characterised in that power is realized when the program is executed by processor Profit requires the convolutional neural networks accelerated method based on multinuclear any one of 1 to 4.

10. a kind of terminal, it is characterised in that including processor and memory；

The memory is used to store computer program；

The processor is used for the computer program for performing the memory storage, so that terminal perform claim requirement 1 to 4 Any one of the convolutional neural networks accelerated method based on multinuclear.