CN107862378A - Convolutional neural networks accelerated method and system, storage medium and terminal based on multinuclear - Google Patents
Convolutional neural networks accelerated method and system, storage medium and terminal based on multinuclear Download PDFInfo
- Publication number
- CN107862378A CN107862378A CN201711273248.5A CN201711273248A CN107862378A CN 107862378 A CN107862378 A CN 107862378A CN 201711273248 A CN201711273248 A CN 201711273248A CN 107862378 A CN107862378 A CN 107862378A
- Authority
- CN
- China
- Prior art keywords
- convolution kernel
- convolutional neural
- neural networks
- predetermined number
- multinuclear
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
- G06N3/065—Analogue means
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The present invention provides a kind of convolutional neural networks accelerated method and system based on multinuclear, storage medium and terminal, including one layer of convolutional neural networks are split into at least two subtasks, and each subtask is corresponding with a convolution kernel;It is connected in series between the convolution kernel;Parallel to perform the first predetermined number dot product computing based on each convolution kernel, each dot product computing includes the second predetermined number multiplying;The product of first predetermined number and second predetermined number is the adder and multiplier number in convolution kernel;By the dot product operation result of each convolution kernel according to output priority Sequential output.The convolutional neural networks accelerated method and system based on multinuclear, the storage medium and terminal of the present invention save the data bandwidth of convolutional neural networks by multiple parallel convolution kernels;Under conditions of same hardware data bandwidth, the processing speed of convolutional neural networks is improved by parallel dot product computing in convolution kernel.
Description
Technical field
The present invention relates to the technical field of data processing, accelerates more particularly to a kind of convolutional neural networks based on multinuclear
Method and system, storage medium and terminal.
Background technology
At present, deep learning and machine learning have obtained widely in visual processes, speech recognition and art of image analysis
Using.Convolutional neural networks are deep learning and the important component of machine learning.Improve the processing speed of convolutional neural networks
Degree, deep learning and the processing speed of machine learning can be improved with equal proportion.
In the prior art, visual processes, speech recognition and the application of graphical analysis are all based on multilayer convolutional neural networks.
Every layer of convolutional neural networks will pass through substantial amounts of data processing and convolution algorithm, and the consumption to hardware process speed and resource will
Ask very high.With wearable device, Internet of Things application and the continuous development of automatic Pilot technology, how convolutional neural networks to be existed
The middle realization of embedded product and reach can smooth processing speed, turn into that Current hardware architecture design faces huge chooses
War.By taking typical convolutional neural networks ResNet and VGG16 as an example, ResNet is under 16 floating point precisions, if to run to
60 frames speed per second is, it is necessary to 15G byte bandwidth;VGG16 is under 16 floating point precisions, if to run to 60 frames speed per second
Degree is, it is necessary to 6.0G byte bandwidth.
At present, in order to realize the acceleration of convolutional neural networks, realized by the multiple convolution units of parallel arranged.In ideal
In the case of, convolution unit is more, and processing speed is faster.But in actual applications, data bandwidth can restricted wreath product unit significantly
Processing speed, the bandwidth resources of hardware are of great rarity, and the data bandwidth for improving hardware is costly.Therefore, in limited number
According to the processing speed under bandwidth and hardware spending, improving convolutional neural networks, turn into Current hardware architecture design and be badly in need of what is solved
Problem.
The content of the invention
In view of the above the shortcomings that prior art, it is an object of the invention to provide a kind of convolutional Neural based on multinuclear
Network accelerating method and system, storage medium and terminal, the data of convolutional neural networks are saved by multiple parallel convolution kernels
Bandwidth;Under conditions of same hardware data bandwidth, convolutional Neural is improved by parallel dot product computing in convolution kernel
The processing speed of network.
In order to achieve the above objects and other related objects, the present invention provides a kind of convolutional neural networks based on multinuclear and accelerated
Method, comprise the following steps:One layer of convolutional neural networks are split into at least two subtasks, each subtask and a convolution
Nuclear phase is corresponding;Serial connection is for input data serial transmission between the convolution kernel between the convolution kernel;Based on each
Convolution kernel, parallel to perform the first predetermined number dot product computing, each dot product computing includes the second predetermined number
Multiplying;The product of first predetermined number and second predetermined number is the adder and multiplier number in convolution kernel;Will be each
The dot product operation result of convolution kernel is according to output priority Sequential output.
In one embodiment of the invention, second predetermined number is 3 to support 3D dot products.
In one embodiment of the invention, the output priority inputs the sequencing of convolution kernel according to the input data
It is determined that the convolution kernel for first inputting the input data has higher output preferential than the convolution kernel of the rear input input data
Level.
It is parallel to perform the first predetermined number dot product computing based on each convolution kernel in one embodiment of the invention
Comprise the following steps:
Obtain (N+M-1) individual input data;Wherein N is first predetermined number, and M is second predetermined number;
The 1st, which is separately input into, by preceding 1 to N number of input data arrives N number of adder and multiplier and the first multiplication;
Preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication;
By that analogy, preceding M to N+M-1 input datas are separately input into (N*M-N+1) to N*M adder and multiplier and the
M multiplications;
By in M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.
Accordingly, the present invention provides a kind of convolutional neural networks acceleration system based on multinuclear, including convolution kernel sets mould
Block, dot product module and output module;
The convolution kernel setup module is used to one layer of convolutional neural networks splitting at least two subtasks, appoints per height
Business is corresponding with a convolution kernel;It is connected in series between the convolution kernel so that input data serially passes between the convolution kernel
It is defeated;
The dot product module is used to be based on each convolution kernel, parallel to perform the first predetermined number dot product fortune
Calculate, each dot product computing includes the second predetermined number multiplying;First predetermined number and described second is preset
The product of quantity is the adder and multiplier number in convolution kernel;
The output module is used for the dot product operation result of each convolution kernel according to output priority Sequential output.
In one embodiment of the invention, second predetermined number is 3 to support 3D dot products.
In one embodiment of the invention, the output priority inputs the sequencing of convolution kernel according to the input data
It is determined that the convolution kernel for first inputting the input data has higher output preferential than the convolution kernel of the rear input input data
Level.
In one embodiment of the invention, the dot product module is based on each convolution kernel, parallel to perform the first present count
Following steps are performed during amount dot product computing:
Obtain (N+M-1) individual input data;Wherein N is first predetermined number, and M is second predetermined number;
The 1st, which is separately input into, by preceding 1 to N number of input data arrives N number of adder and multiplier and the first multiplication;
Preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication;
By that analogy, preceding M to N+M-1 input datas are separately input into (N*M-N+1) to N*M adder and multiplier and the
M multiplications;
By in M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.
The present invention provides a kind of storage medium, is stored thereon with computer program, the program is realized when being executed by processor
The above-mentioned convolutional neural networks accelerated method based on multinuclear.
Finally, the present invention provides a kind of terminal, including processor and memory;
The memory is used to store computer program;
The processor is used for the computer program for performing the memory storage, so that the terminal performs above-mentioned base
In the convolutional neural networks accelerated method of multinuclear.
As described above, the convolutional neural networks accelerated method and system based on multinuclear of the present invention, storage medium and terminal,
Have the advantages that:
(1) bandwidth consumption of convolutional neural networks computing dynamic memory is saved;By taking 4 convolution kernel normal forms as an example, in phase
Under same data processing speed, 75% input image data bandwidth can be saved when convolutional neural networks are run;
(2) processing speed of convolutional neural networks is improved, under conditions of same hardware data bandwidth, with 3D vector points
Exemplified by product, it is possible to increase the processing speed of convolutional neural networks to 300%;
(3) dynamic power consumption of convolutional neural networks is reduced, by taking 4 convolution kernels, 3D dot products as an example, when reducing computing
Between to original 33%, reduce the input picture bandwidth of convolutional neural networks to original 25%, reduction dynamic power consumption 85%;
(4) optimize processing speed of the convolutional neural networks in embedded product, have framework is clear, the division of labor clearly,
Easily realize, the advantages that flow is simple, can be widely applied in Internet of Things, wearable device and mobile unit.
Brief description of the drawings
Fig. 1 is shown as flow chart of the convolutional neural networks accelerated method in an embodiment based on multinuclear of the present invention;
Fig. 2 is shown as the coordinate schematic diagram of input picture, coefficient and output image;
Fig. 3 is shown as framework signal of the convolutional neural networks accelerated method based on multinuclear of the present invention in an embodiment
Figure;
It is real in one that Fig. 4 is shown as parallel 3D dot products in the convolutional neural networks accelerated method based on multinuclear of the present invention
Apply the first state figure in example;
It is real in one that Fig. 5 is shown as parallel 3D dot products in the convolutional neural networks accelerated method based on multinuclear of the present invention
Apply the second state diagram in example;
It is real in one that Fig. 6 is shown as parallel 3D dot products in the convolutional neural networks accelerated method based on multinuclear of the present invention
Apply the third state figure in example;
Fig. 7 be shown as parallel 3D dot products in the convolutional neural networks accelerated method based on multinuclear of the present invention sum in
State diagram in one embodiment;
Fig. 8 is shown as structural representation of the convolutional neural networks acceleration system in an embodiment based on multinuclear of the present invention
Figure;
Fig. 9 is shown as structural representation of the terminal of the present invention in an embodiment.
Component label instructions
21 convolution kernel setup modules
22 dot product modules
23 output modules
31 processors
32 memories
Embodiment
Illustrate embodiments of the present invention below by way of specific instantiation, those skilled in the art can be by this specification
Disclosed content understands other advantages and effect of the present invention easily.The present invention can also pass through specific realities different in addition
The mode of applying is embodied or practiced, the various details in this specification can also be based on different viewpoints with application, without departing from
Various modifications or alterations are carried out under the spirit of the present invention.It should be noted that in the case where not conflicting, following examples and implementation
Feature in example can be mutually combined.
It should be noted that the diagram provided in following examples only illustrates the basic structure of the present invention in a schematic way
Think, only show the component relevant with the present invention in schema then rather than according to component count, shape and the size during actual implement
Draw, kenel, quantity and the ratio of each component can be a kind of random change during its actual implementation, and its assembly layout kenel
It is likely more complexity.
The convolutional neural networks accelerated method and system based on multinuclear of the present invention, storage medium and terminal is in data bandwidth
The data bandwidth of convolutional neural networks is saved in the case of limited by multiple parallel convolution kernels;In same hardware data bandwidth
Under conditions of, pass through the processing speed of dot product computing raising convolutional neural networks parallel in convolution kernel;Optimize volume
Processing speed of the product neutral net in embedded product, it is excellent to have that framework is clear, the division of labor clearly, is easily realized, flow is simple etc.
Point, it can be widely applied in Internet of Things, wearable device and mobile unit.
As shown in figure 1, in an embodiment, the convolutional neural networks accelerated method of the invention based on multinuclear includes following
Step:
Step S1, one layer of convolutional neural networks are split into at least two subtasks, each subtask and a convolution kernel
It is corresponding;Serial connection is for input data serial transmission between the convolution kernel between the convolution kernel.
By taking image procossing as an example, as shown in Fig. 2 input picture and output image are all 3-D views, input picture includes
Abscissa inx, ordinate iny and coefficient depth coordinate kz.Output image includes abscissa outx, ordinate outy and depth is sat
Mark z.Coefficient is 4 D data, including abscissa kx, ordinate ky, coefficient depth coordinate kz and output depth coordinate z.
When one layer of convolutional neural networks are split as into four subtasks, coefficient sequence is split into four groups according to z directions,
As shown in Fig. 2 the convolution kernel of different convolutional neural networks is distributed to per system number.Wherein, data between different convolution kernels
Serial UNICOM, so as to save bandwidth by data sharing.
According to the treatment characteristic of convolutional neural networks, convolution algorithm will be carried out with all input pictures per system number,
Obtain the output image of a z-plane.Therefore, input picture has very big reusability.As shown in figure 3, different convolution kernels it
Between be connected in series by channels in series so that input image data serially passes through between convolution kernel.Specifically, input figure
As data are after internal memory is read into convolution kernel 0, input image data and the first system number are carried out convolution algorithm by convolution kernel 0,
Input image data is transferred to convolution kernel 1 by convolution kernel 0 by data channels in series simultaneously.It is defeated that convolution core 1 saves reading
Enter the bandwidth consumption of view data.Convolution kernel 2 and convolution kernel 3 can carry out same data and grasp to avoid the band of input image data
Width consumption.Under 4 convolution kernel normal forms, by tandem data passage, reduce the reading of input image data three times, save
75% input image data bandwidth consumption, while decrease the power consumption of internal memory;Solve physics realization layer coiling well
Quantity and the equilibrium problem for realizing frequency.
Step S2, it is parallel to perform the first predetermined number dot product computing, each dot product based on each convolution kernel
Computing includes the second predetermined number multiplying;The product of first predetermined number and second predetermined number is convolution kernel
In adder and multiplier number.
It is parallel to perform the first predetermined number dot product computing based on each convolution kernel in one embodiment of the invention
Comprise the following steps:
21) (N+M-1) individual input data is obtained;Wherein N is first predetermined number, and M is second predetermined number;
22) it is separately input into the 1st by preceding 1 to N number of input data and arrives N number of adder and multiplier and the first multiplication;
23) preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication;
24) preceding M to N+M-1 input datas by that analogy, are separately input into (N*M-N+1) and arrive N*M adder and multiplier
With M multiplications;
25) by M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.
It is expanded on further below with dot product comprising multiplication three times (M=3).In this embodiment, in each volume
24 adder and multipliers are included in product core.In order to maximize the utilization rate of adder and multiplier in the case of finite bandwidth, by comprising three times
The 3D dot products computing of multiplication to calculate 8 3D dot products results (N=8) simultaneously.Specifically, the computing of 3D dot products is public
Formula is as follows:
Out0=in0*k0+in1*k1+in2*k2
Out1=in1*k0+in2*k1+in3*k2
Out2=in2*k0+in3*k1+in4*k2
......
Out7=in7*k0+in8*k1+in9*k2
First, as shown in figure 4, reading 10 input image datas, i.e. in0, in1, int2...in9 from internal memory.By
0 to the 7th data (in0, in1, in2, in3, in4, in5, in6, in7) and coefficient 0 (k0) carry out multiplying, operation result
The input of write accumulator, i.e. out00, out01, out02...out07.
Secondly, as shown in figure 5, reusing 10 input image datas of previous step, i.e. in0, in1, int2...in9.Will
1st to the 8th data (in1, in2, in3, in4, in5, in6, in7, in8) and coefficient 1 (k1) carry out multiplying, computing knot
The input of fruit write accumulator, i.e. out10, out11, out12...out17.
Again, as shown in fig. 6, reusing 10 input image datas of previous step, i.e. in0, in1, int2...in9.Will
2nd to the 9th data (in2, in3, in4, in5, in6, in7, in8, in9) and coefficient 2 (k2) carry out multiplying, computing knot
The input of fruit write accumulator, i.e. out20, out21, out22...out27.
Finally, as shown in fig. 7, the result on the correspondence position of multiplication three times is added up successively, 8 3D dot products are obtained
As a result.Wherein, result out00, out10 at the first position of each multiplication are added with out20, obtain first 3D vector
Dot product result;Result out01, out11 of the second place of each multiplication are added with out21, obtain second 3D vector
Dot product result;By that analogy, result out07, out17 at the 8 positions of each multiplication are added with out27, obtain the 8th
Individual 3D dot products result.
Step S3, by the dot product operation result of each convolution kernel according to output priority Sequential output.
Specifically, the output priority inputs the sequencing determination of convolution kernel according to the input data.Obtain defeated
Enter that data are more early, the output priority of corresponding convolution kernel is higher.Therefore, the convolution kernel for first inputting the input data is more defeated than rear
The convolution kernel for entering the input data has higher output priority.
As shown in figure 8, in an embodiment, the convolutional neural networks acceleration system of the invention based on multinuclear includes convolution
Core setup module 21, dot product module 22 and output module 23.
Convolution kernel setup module 21 is used to one layer of convolutional neural networks splitting at least two subtasks, each subtask
It is corresponding with a convolution kernel;It is connected in series between the convolution kernel so that input data serially passes between the convolution kernel
It is defeated.
By taking image procossing as an example, as shown in Fig. 2 input picture and output image are all 3-D views, input picture includes
Abscissa inx, ordinate iny and coefficient depth coordinate kz.Output image includes abscissa outx, ordinate outy and depth is sat
Mark z.Coefficient is 4 D data, including abscissa kx, ordinate ky, coefficient depth coordinate kz and output depth coordinate z.
When one layer of convolutional neural networks are split as into four subtasks, coefficient sequence is split into four groups according to z directions,
As shown in Fig. 2 the convolution kernel of different convolutional neural networks is distributed to per system number.Wherein, data between different convolution kernels
Serial UNICOM, so as to save bandwidth by data sharing.
According to the treatment characteristic of convolutional neural networks, convolution algorithm will be carried out with all input pictures per system number,
Obtain the output image of a z-plane.Therefore, input picture has very big reusability.As shown in figure 3, different convolution kernels it
Between be connected in series by channels in series so that input image data serially passes through between convolution kernel.Specifically, input figure
As data are after internal memory is read into convolution kernel 0, input image data and the first system number are carried out convolution algorithm by convolution kernel 0,
Input image data is transferred to convolution kernel 1 by convolution kernel 0 by data channels in series simultaneously.It is defeated that convolution core 1 saves reading
Enter the bandwidth consumption of view data.Convolution kernel 2 and convolution kernel 3 can carry out same data and grasp to avoid the band of input image data
Width consumption.Under 4 convolution kernel normal forms, by tandem data passage, reduce the reading of input image data three times, save
75% input image data bandwidth consumption, while decrease the power consumption of internal memory;Solve physics realization layer coiling well
Quantity and the equilibrium problem for realizing frequency.
Dot product module 22 is connected with convolution kernel setup module 21, for based on each convolution kernel, performing first parallel
Predetermined number dot product computing, each dot product computing include the second predetermined number multiplying;Described first is pre-
If quantity and the product of second predetermined number are the adder and multiplier number in convolution kernel.
In one embodiment of the invention, dot product module 22 is based on each convolution kernel, parallel to perform the first predetermined number
Individual dot product computing performs following steps:
21) (N+M-1) individual input data is obtained;Wherein N is first predetermined number, and M is second predetermined number;
22) it is separately input into the 1st by preceding 1 to N number of input data and arrives N number of adder and multiplier and the first multiplication;
23) preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication;
24) preceding M to N+M-1 input datas by that analogy, are separately input into (N*M-N+1) and arrive N*M adder and multiplier
With M multiplications;
25) by M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.
It is expanded on further below with 3D dot products comprising multiplication three times (M=3).In this embodiment, each
24 adder and multipliers are included in convolution kernel.In order to maximize the utilization rate of adder and multiplier in the case of finite bandwidth, by including three
The 3D dot products computing of secondary multiplication to calculate 8 3D dot products results (N=8) simultaneously.Specifically, 3D dot products computing
Formula is as follows:
Out0=in0*k0+in1*k1+in2*k2
Out1=in1*k0+in2*k1+in3*k2
Out2=in2*k0+in3*k1+in4*k2
......
Out7=in7*k0+in8*k1+in9*k2
First, as shown in figure 4, reading 10 input image datas, i.e. in0, in1, int2...in9 from internal memory.By
0 to the 7th data (in0, in1, in2, in3, in4, in5, in6, in7) and coefficient 0 (k0) carry out multiplying, operation result
The input of write accumulator, i.e. out00, out01, out02...out07.
Secondly, as shown in figure 5, reusing 10 input image datas of previous step, i.e. in0, in1, int2...in9.Will
1st to the 8th data (in1, in2, in3, in4, in5, in6, in7, in8) and coefficient 1 (k1) carry out multiplying, computing knot
The input of fruit write accumulator, i.e. out10, out11, out12...out17.
Again, as shown in fig. 6, reusing 10 input image datas of previous step, i.e. in0, in1, int2...in9.Will
2nd to the 9th data (in2, in3, in4, in5, in6, in7, in8, in9) and coefficient 2 (k2) carry out multiplying, computing knot
The input of fruit write accumulator, i.e. out20, out21, out22...out27.
Finally, as shown in fig. 7, the result on the correspondence position of multiplication three times is added up successively, 8 3D dot products are obtained
As a result.Wherein, result out00, out10 at the first position of each multiplication are added with out20, obtain first 3D vector
Dot product result;Result out01, out11 of the second place of each multiplication are added with out21, obtain second 3D vector
Dot product result;By that analogy, result out07, out17 at the 8 positions of each multiplication are added with out27, obtain the 8th
Individual 3D dot products result.
Output module 23 is connected with dot product module 22, for by the dot product operation result of each convolution kernel according to
Output priority Sequential output.
Specifically, the output priority inputs the sequencing determination of convolution kernel according to the input data.Obtain defeated
Enter that data are more early, the output priority of corresponding convolution kernel is higher.Therefore, the convolution kernel for first inputting the input data is more defeated than rear
The convolution kernel for entering the input data has higher output priority.
It should be noted that it should be understood that the modules of system above division be only a kind of division of logic function,
Can completely or partially it be integrated on a physical entity when actually realizing, can also be physically separate.And these modules can be with
All realized in the form of software is called by treatment element;All it can also realize in the form of hardware;Can also part mould
Block calls the form of software to realize by treatment element, and part of module is realized by the form of hardware.For example, x modules can be
The treatment element individually set up, it can also be integrated in some chip of said apparatus and realize, in addition it is also possible to program generation
The form of code is stored in the memory of said apparatus, is called by some treatment element of said apparatus and is performed above x moulds
The function of block.The realization of other modules is similar therewith.In addition these modules can completely or partially integrate, can also be only
It is vertical to realize.Treatment element described here can be a kind of integrated circuit, have the disposal ability of signal.In implementation process,
Each step of the above method or more modules can pass through the integrated logic circuit or soft of the hardware in processor elements
The instruction of part form is completed.
For example, the above module can be arranged to implement one or more integrated circuits of above method, such as:
One or more specific integrated circuits (ApplicationSpecificIntegratedCircuit, abbreviation ASIC), or, one
Or multi-microprocessor (digitalsingnalprocessor, abbreviation DSP), or, one or more field-programmable gate array
Arrange (FieldProgrammableGateArray, abbreviation FPGA) etc..For another example, some module is dispatched by treatment element more than
When the form of program code is realized, the treatment element can be general processor, such as central processing unit
(CentralProcessingUnit, abbreviation CPU) or it is other can be with the processor of caller code.For another example, these modules can
To integrate, realized in the form of on-chip system (system-on-a-chip, abbreviation SOC).
The present invention storage medium on be stored with computer program, the program realized when being executed by processor it is above-mentioned based on
The convolutional neural networks accelerated method of multinuclear.Preferably, the storage medium includes:ROM, RAM, magnetic disc or CD etc. are various
Can be with the medium of store program codes.
As shown in figure 9, in an embodiment, terminal of the invention includes processor 31 and memory 32.
The memory 32 is used to store computer program.
Preferably, the memory 32 includes:ROM, RAM, magnetic disc or CD etc. are various can be with store program codes
Medium.
The processor 31 is connected with the memory 32, the computer program stored for performing the memory 32,
So that the terminal performs the above-mentioned convolutional neural networks accelerated method based on multinuclear.
Preferably, processor 31 can be general processor, including central processing unit (CentralProcessingUnit,
Abbreviation CPU), network processing unit (NetworkProcessor, abbreviation NP) etc.;It can also be digital signal processor
(DigitalSignalProcessing, abbreviation DSP), application specific integrated circuit
(ApplicationSpecificIntegratedCircuit, abbreviation ASIC), field programmable gate array (Field-
ProgrammableGateArray, abbreviation FPGA) either other PLDs, discrete gate or transistor logic device
Part, discrete hardware components.
In summary, the convolutional neural networks accelerated method and system of the invention based on multinuclear, storage medium and terminal
Save the bandwidth consumption of convolutional neural networks computing dynamic memory;By taking 4 convolution kernel normal forms as an example, in identical data processing
Under speed, 75% input image data bandwidth can be saved when convolutional neural networks are run;Improve convolutional neural networks
Processing speed, under conditions of same hardware data bandwidth, by taking 3D dot products as an example, it is possible to increase convolutional neural networks
Processing speed is to 300%;Reduce the dynamic power consumption of convolutional neural networks, by taking 4 convolution kernels, 3D dot products as an example, reduce
Operation time to original 33%, reduces the input picture bandwidth of convolutional neural networks to original 25%, reduces dynamic power consumption
85%;Optimize processing speed of the convolutional neural networks in embedded product, have framework is clear, the division of labor clearly, is easily realized,
The advantages that flow is simple, it can be widely applied in Internet of Things, wearable device and mobile unit.So effective gram of present invention
Take various shortcoming of the prior art and have high industrial utilization.
The above-described embodiments merely illustrate the principles and effects of the present invention, not for the limitation present invention.It is any ripe
Know the personage of this technology all can carry out modifications and changes under the spirit and scope without prejudice to the present invention to above-described embodiment.Cause
This, those of ordinary skill in the art is complete without departing from disclosed spirit and institute under technological thought such as
Into all equivalent modifications or change, should by the present invention claim be covered.
Claims (10)
1. a kind of convolutional neural networks accelerated method based on multinuclear, it is characterised in that comprise the following steps:
One layer of convolutional neural networks are split into at least two subtasks, each subtask is corresponding with a convolution kernel;It is described
Serial connection is for input data serial transmission between the convolution kernel between convolution kernel;
Parallel to perform the first predetermined number dot product computing based on each convolution kernel, each dot product computing includes the
Two predetermined number multiplyings;The product of first predetermined number and second predetermined number is the adder and multiplier in convolution kernel
Number;
By the dot product operation result of each convolution kernel according to output priority Sequential output.
2. the convolutional neural networks accelerated method according to claim 1 based on multinuclear, it is characterised in that described second is pre-
If quantity is 3 to support 3D dot products.
3. the convolutional neural networks accelerated method according to claim 1 based on multinuclear, it is characterised in that the output is excellent
The sequencing that first level inputs convolution kernel according to the input data determines that the convolution kernel for first inputting the input data is more defeated than rear
The convolution kernel for entering the input data has higher output priority.
4. the convolutional neural networks accelerated method according to claim 1 based on multinuclear, it is characterised in that based on each volume
Product core, the first predetermined number dot product computing of parallel execution comprise the following steps:
Obtain (N+M-1) individual input data;Wherein N is first predetermined number, and M is second predetermined number;
The 1st, which is separately input into, by preceding 1 to N number of input data arrives N number of adder and multiplier and the first multiplication;
Preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication;
By that analogy, preceding M to N+M-1 input datas are separately input into (N*M-N+1) and arrive N*M adder and multiplier and M systems
Number is multiplied;
By in M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.
5. a kind of convolutional neural networks acceleration system based on multinuclear, it is characterised in that including convolution kernel setup module, vector point
Volume module and output module;
The convolution kernel setup module is used to one layer of convolutional neural networks splitting at least two subtasks, each subtask with
One convolution kernel is corresponding;Serial connection is for input data serial transmission between the convolution kernel between the convolution kernel;
The dot product module is used to be based on each convolution kernel, parallel to perform the first predetermined number dot product computing, often
Individual dot product computing includes the second predetermined number multiplying;First predetermined number and second predetermined number it
Product is the adder and multiplier number in convolution kernel;
The output module is used for the dot product operation result of each convolution kernel according to output priority Sequential output.
6. the convolutional neural networks acceleration system according to claim 5 based on multinuclear, it is characterised in that described second is pre-
If quantity is 3 to support 3D dot products.
7. the convolutional neural networks acceleration system according to claim 5 based on multinuclear, it is characterised in that the output is excellent
The sequencing that first level inputs convolution kernel according to the input data determines that the convolution kernel for first inputting the input data is more defeated than rear
The convolution kernel for entering the input data has higher output priority.
8. the convolutional neural networks acceleration system according to claim 5 based on multinuclear, it is characterised in that the vector point
Volume module is based on each convolution kernel, and following steps are performed when performing the first predetermined number dot product computing parallel:
Obtain (N+M-1) individual input data;Wherein N is first predetermined number, and M is second predetermined number;
The 1st, which is separately input into, by preceding 1 to N number of input data arrives N number of adder and multiplier and the first multiplication;
Preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication;
By that analogy, preceding M to N+M-1 input datas are separately input into (N*M-N+1) and arrive N*M adder and multiplier and M systems
Number is multiplied;
By in M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.
9. a kind of storage medium, is stored thereon with computer program, it is characterised in that power is realized when the program is executed by processor
Profit requires the convolutional neural networks accelerated method based on multinuclear any one of 1 to 4.
10. a kind of terminal, it is characterised in that including processor and memory;
The memory is used to store computer program;
The processor is used for the computer program for performing the memory storage, so that terminal perform claim requirement 1 to 4
Any one of the convolutional neural networks accelerated method based on multinuclear.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711273248.5A CN107862378B (en) | 2017-12-06 | 2017-12-06 | Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711273248.5A CN107862378B (en) | 2017-12-06 | 2017-12-06 | Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107862378A true CN107862378A (en) | 2018-03-30 |
CN107862378B CN107862378B (en) | 2020-04-24 |
Family
ID=61705060
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711273248.5A Active CN107862378B (en) | 2017-12-06 | 2017-12-06 | Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107862378B (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108681773A (en) * | 2018-05-23 | 2018-10-19 | 腾讯科技(深圳)有限公司 | Accelerated method, device, terminal and the readable storage medium storing program for executing of data operation |
CN108920413A (en) * | 2018-06-28 | 2018-11-30 | 中国人民解放军国防科技大学 | Convolutional neural network multi-core parallel computing method facing GPDSP |
CN109117940A (en) * | 2018-06-19 | 2019-01-01 | 腾讯科技(深圳)有限公司 | To accelerated method, apparatus and system before a kind of convolutional neural networks |
CN109740733A (en) * | 2018-12-27 | 2019-05-10 | 深圳云天励飞技术有限公司 | Deep learning network model optimization method, device and relevant device |
CN109740747A (en) * | 2018-12-29 | 2019-05-10 | 北京中科寒武纪科技有限公司 | Operation method, device and Related product |
CN109886400A (en) * | 2019-02-19 | 2019-06-14 | 合肥工业大学 | The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel |
CN110109646A (en) * | 2019-03-28 | 2019-08-09 | 北京迈格威科技有限公司 | Data processing method, device and adder and multiplier and storage medium |
WO2020003043A1 (en) * | 2018-06-27 | 2020-01-02 | International Business Machines Corporation | Low precision deep neural network enabled by compensation instructions |
CN110689121A (en) * | 2019-09-24 | 2020-01-14 | 上海寒武纪信息科技有限公司 | Method for realizing neural network model splitting by using multi-core processor and related product |
CN110689115A (en) * | 2019-09-24 | 2020-01-14 | 上海寒武纪信息科技有限公司 | Neural network model processing method and device, computer equipment and storage medium |
CN110738317A (en) * | 2019-10-17 | 2020-01-31 | 中国科学院上海高等研究院 | FPGA-based deformable convolution network operation method, device and system |
CN110796245A (en) * | 2019-10-25 | 2020-02-14 | 浪潮电子信息产业股份有限公司 | Method and device for calculating convolutional neural network model |
CN110880032A (en) * | 2018-09-06 | 2020-03-13 | 黑芝麻智能科技(上海)有限公司 | Convolutional neural network using adaptive 3D array |
CN110889497A (en) * | 2018-12-29 | 2020-03-17 | 中科寒武纪科技股份有限公司 | Learning task compiling method of artificial intelligence processor and related product |
CN111563586A (en) * | 2019-02-14 | 2020-08-21 | 上海寒武纪信息科技有限公司 | Splitting method of neural network model and related product |
CN111610963A (en) * | 2020-06-24 | 2020-09-01 | 上海西井信息科技有限公司 | Chip structure and multiply-add calculation engine thereof |
WO2021057746A1 (en) * | 2019-09-24 | 2021-04-01 | 安徽寒武纪信息科技有限公司 | Neural network processing method and apparatus, computer device and storage medium |
WO2021081854A1 (en) * | 2019-10-30 | 2021-05-06 | 华为技术有限公司 | Convolution operation circuit and convolution operation method |
CN114399828A (en) * | 2022-03-25 | 2022-04-26 | 深圳比特微电子科技有限公司 | Training method of convolution neural network model for image processing |
CN116303108A (en) * | 2022-09-07 | 2023-06-23 | 芯砺智能科技(上海)有限公司 | Convolutional neural network weight address arrangement method suitable for parallel computing architecture |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104915322A (en) * | 2015-06-09 | 2015-09-16 | 中国人民解放军国防科学技术大学 | Method for accelerating convolution neutral network hardware and AXI bus IP core thereof |
CN106203617A (en) * | 2016-06-27 | 2016-12-07 | 哈尔滨工业大学深圳研究生院 | A kind of acceleration processing unit based on convolutional neural networks and array structure |
CN106599883A (en) * | 2017-03-08 | 2017-04-26 | 王华锋 | Face recognition method capable of extracting multi-level image semantics based on CNN (convolutional neural network) |
CN106845635A (en) * | 2017-01-24 | 2017-06-13 | 东南大学 | CNN convolution kernel hardware design methods based on cascade form |
CN106875011A (en) * | 2017-01-12 | 2017-06-20 | 南京大学 | The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator |
CN106951395A (en) * | 2017-02-13 | 2017-07-14 | 上海客鹭信息技术有限公司 | Towards the parallel convolution operations method and device of compression convolutional neural networks |
CN107301456A (en) * | 2017-05-26 | 2017-10-27 | 中国人民解放军国防科学技术大学 | Deep neural network multinuclear based on vector processor speeds up to method |
WO2017186829A1 (en) * | 2016-04-27 | 2017-11-02 | Commissariat A L'energie Atomique Et Aux Energies Alternatives | Device and method for calculating convolution in a convolutional neural network |
-
2017
- 2017-12-06 CN CN201711273248.5A patent/CN107862378B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104915322A (en) * | 2015-06-09 | 2015-09-16 | 中国人民解放军国防科学技术大学 | Method for accelerating convolution neutral network hardware and AXI bus IP core thereof |
WO2017186829A1 (en) * | 2016-04-27 | 2017-11-02 | Commissariat A L'energie Atomique Et Aux Energies Alternatives | Device and method for calculating convolution in a convolutional neural network |
CN106203617A (en) * | 2016-06-27 | 2016-12-07 | 哈尔滨工业大学深圳研究生院 | A kind of acceleration processing unit based on convolutional neural networks and array structure |
CN106875011A (en) * | 2017-01-12 | 2017-06-20 | 南京大学 | The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator |
CN106845635A (en) * | 2017-01-24 | 2017-06-13 | 东南大学 | CNN convolution kernel hardware design methods based on cascade form |
CN106951395A (en) * | 2017-02-13 | 2017-07-14 | 上海客鹭信息技术有限公司 | Towards the parallel convolution operations method and device of compression convolutional neural networks |
CN106599883A (en) * | 2017-03-08 | 2017-04-26 | 王华锋 | Face recognition method capable of extracting multi-level image semantics based on CNN (convolutional neural network) |
CN107301456A (en) * | 2017-05-26 | 2017-10-27 | 中国人民解放军国防科学技术大学 | Deep neural network multinuclear based on vector processor speeds up to method |
Non-Patent Citations (2)
Title |
---|
刘进锋: "一种简洁高效的加速卷积神经网络的方法", 《科学技术与工程》 * |
李大霞: "CUDA-CONVNET 深层卷积神经网络算法的", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108681773A (en) * | 2018-05-23 | 2018-10-19 | 腾讯科技(深圳)有限公司 | Accelerated method, device, terminal and the readable storage medium storing program for executing of data operation |
CN109117940B (en) * | 2018-06-19 | 2020-12-15 | 腾讯科技(深圳)有限公司 | Target detection method, device, terminal and storage medium based on convolutional neural network |
CN109117940A (en) * | 2018-06-19 | 2019-01-01 | 腾讯科技(深圳)有限公司 | To accelerated method, apparatus and system before a kind of convolutional neural networks |
JP2021530761A (en) * | 2018-06-27 | 2021-11-11 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | Low-precision deep neural network enabled by compensation instructions |
GB2590000A (en) * | 2018-06-27 | 2021-06-16 | Ibm | Low precision deep neural network enabled by compensation instructions |
WO2020003043A1 (en) * | 2018-06-27 | 2020-01-02 | International Business Machines Corporation | Low precision deep neural network enabled by compensation instructions |
GB2590000B (en) * | 2018-06-27 | 2022-12-07 | Ibm | Low precision deep neural network enabled by compensation instructions |
JP7190799B2 (en) | 2018-06-27 | 2022-12-16 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Low Precision Deep Neural Networks Enabled by Compensation Instructions |
CN108920413A (en) * | 2018-06-28 | 2018-11-30 | 中国人民解放军国防科技大学 | Convolutional neural network multi-core parallel computing method facing GPDSP |
CN110880032B (en) * | 2018-09-06 | 2022-07-19 | 黑芝麻智能科技(上海)有限公司 | Convolutional neural network using adaptive 3D array |
US11954573B2 (en) | 2018-09-06 | 2024-04-09 | Black Sesame Technologies Inc. | Convolutional neural network using adaptive 3D array |
CN110880032A (en) * | 2018-09-06 | 2020-03-13 | 黑芝麻智能科技(上海)有限公司 | Convolutional neural network using adaptive 3D array |
CN109740733B (en) * | 2018-12-27 | 2021-07-06 | 深圳云天励飞技术有限公司 | Deep learning network model optimization method and device and related equipment |
CN109740733A (en) * | 2018-12-27 | 2019-05-10 | 深圳云天励飞技术有限公司 | Deep learning network model optimization method, device and relevant device |
CN109740747B (en) * | 2018-12-29 | 2019-11-12 | 北京中科寒武纪科技有限公司 | Operation method, device and Related product |
CN110889497A (en) * | 2018-12-29 | 2020-03-17 | 中科寒武纪科技股份有限公司 | Learning task compiling method of artificial intelligence processor and related product |
US11893414B2 (en) | 2018-12-29 | 2024-02-06 | Cambricon Technologies Corporation Limited | Operation method, device and related products |
CN109740747A (en) * | 2018-12-29 | 2019-05-10 | 北京中科寒武纪科技有限公司 | Operation method, device and Related product |
CN111563586A (en) * | 2019-02-14 | 2020-08-21 | 上海寒武纪信息科技有限公司 | Splitting method of neural network model and related product |
CN111563586B (en) * | 2019-02-14 | 2022-12-09 | 上海寒武纪信息科技有限公司 | Splitting method of neural network model and related product |
CN109886400A (en) * | 2019-02-19 | 2019-06-14 | 合肥工业大学 | The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel |
CN110109646B (en) * | 2019-03-28 | 2021-08-27 | 北京迈格威科技有限公司 | Data processing method, data processing device, multiplier-adder and storage medium |
CN110109646A (en) * | 2019-03-28 | 2019-08-09 | 北京迈格威科技有限公司 | Data processing method, device and adder and multiplier and storage medium |
WO2021057720A1 (en) * | 2019-09-24 | 2021-04-01 | 安徽寒武纪信息科技有限公司 | Neural network model processing method and apparatus, computer device, and storage medium |
CN110689121A (en) * | 2019-09-24 | 2020-01-14 | 上海寒武纪信息科技有限公司 | Method for realizing neural network model splitting by using multi-core processor and related product |
CN110689115A (en) * | 2019-09-24 | 2020-01-14 | 上海寒武纪信息科技有限公司 | Neural network model processing method and device, computer equipment and storage medium |
WO2021057746A1 (en) * | 2019-09-24 | 2021-04-01 | 安徽寒武纪信息科技有限公司 | Neural network processing method and apparatus, computer device and storage medium |
CN110689115B (en) * | 2019-09-24 | 2023-03-31 | 安徽寒武纪信息科技有限公司 | Neural network model processing method and device, computer equipment and storage medium |
CN110738317A (en) * | 2019-10-17 | 2020-01-31 | 中国科学院上海高等研究院 | FPGA-based deformable convolution network operation method, device and system |
CN110796245B (en) * | 2019-10-25 | 2022-03-22 | 浪潮电子信息产业股份有限公司 | Method and device for calculating convolutional neural network model |
CN110796245A (en) * | 2019-10-25 | 2020-02-14 | 浪潮电子信息产业股份有限公司 | Method and device for calculating convolutional neural network model |
WO2021081854A1 (en) * | 2019-10-30 | 2021-05-06 | 华为技术有限公司 | Convolution operation circuit and convolution operation method |
CN111610963A (en) * | 2020-06-24 | 2020-09-01 | 上海西井信息科技有限公司 | Chip structure and multiply-add calculation engine thereof |
CN111610963B (en) * | 2020-06-24 | 2021-08-17 | 上海西井信息科技有限公司 | Chip structure and multiply-add calculation engine thereof |
CN114399828B (en) * | 2022-03-25 | 2022-07-08 | 深圳比特微电子科技有限公司 | Training method of convolution neural network model for image processing |
CN114399828A (en) * | 2022-03-25 | 2022-04-26 | 深圳比特微电子科技有限公司 | Training method of convolution neural network model for image processing |
CN116303108A (en) * | 2022-09-07 | 2023-06-23 | 芯砺智能科技(上海)有限公司 | Convolutional neural network weight address arrangement method suitable for parallel computing architecture |
CN116303108B (en) * | 2022-09-07 | 2024-05-14 | 芯砺智能科技(上海)有限公司 | Weight address arrangement method suitable for parallel computing architecture |
Also Published As
Publication number | Publication date |
---|---|
CN107862378B (en) | 2020-04-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107862378A (en) | Convolutional neural networks accelerated method and system, storage medium and terminal based on multinuclear | |
CN111667051B (en) | Neural network accelerator applicable to edge equipment and neural network acceleration calculation method | |
CN107862650B (en) | Method for accelerating calculation of CNN convolution of two-dimensional image | |
US10394929B2 (en) | Adaptive execution engine for convolution computing systems | |
JP2022037022A (en) | Execution of kernel stride in hardware | |
EP3761235A1 (en) | Transposing neural network matrices in hardware | |
CN107862374A (en) | Processing with Neural Network system and processing method based on streamline | |
US20160240000A1 (en) | Rendering views of a scene in a graphics processing unit | |
TW202414280A (en) | Method, system and non-transitory computer-readable storage medium for performing computations for a layer of neural network | |
GB2600031A (en) | Batch processing in a neural network processor | |
CN107797962A (en) | Computing array based on neutral net | |
CN108122030A (en) | A kind of operation method of convolutional neural networks, device and server | |
CN110309906A (en) | Image processing method, device, machine readable storage medium and processor | |
TW202123093A (en) | Method and system for performing convolution operation | |
WO2019136750A1 (en) | Artificial intelligence-based computer-aided processing device and method, storage medium, and terminal | |
KR20200081044A (en) | Method and apparatus for processing convolution operation of neural network | |
CN108170640A (en) | The method of its progress operation of neural network computing device and application | |
CN110147252A (en) | A kind of parallel calculating method and device of convolutional neural networks | |
CN104967428B (en) | Frequency domain implementation method for FPGA high-order and high-speed FIR filter | |
CN112686379B (en) | Integrated circuit device, electronic apparatus, board and computing method | |
WO2023065983A1 (en) | Computing apparatus, neural network processing device, chip, and data processing method | |
CN110109646A (en) | Data processing method, device and adder and multiplier and storage medium | |
CN109447254B (en) | Convolution neural network reasoning hardware acceleration method and device thereof | |
KR20180125843A (en) | A hardware classifier applicable to various CNN models | |
JP2023541350A (en) | Table convolution and acceleration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 201203 China (Shanghai) Free Trade Pilot Zone 20A, Zhangjiang Building, 289 Chunxiao Road Applicant after: Xinyuan Microelectronics (Shanghai) Co., Ltd. Applicant after: Core chip technology (Shanghai) Co., Ltd. Address before: 201203 Zhangjiang Building 20A, 560 Songtao Road, Zhangjiang High-tech Park, Pudong New Area, Shanghai Applicant before: VeriSilicon Microelectronics (Shanghai) Co., Ltd. Applicant before: Core chip technology (Shanghai) Co., Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |