CN107862378A - Convolutional neural networks accelerated method and system, storage medium and terminal based on multinuclear - Google Patents

Convolutional neural networks accelerated method and system, storage medium and terminal based on multinuclear Download PDF

Info

Publication number
CN107862378A
CN107862378A CN201711273248.5A CN201711273248A CN107862378A CN 107862378 A CN107862378 A CN 107862378A CN 201711273248 A CN201711273248 A CN 201711273248A CN 107862378 A CN107862378 A CN 107862378A
Authority
CN
China
Prior art keywords
convolution kernel
convolutional neural
neural networks
predetermined number
multinuclear
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711273248.5A
Other languages
Chinese (zh)
Other versions
CN107862378B (en
Inventor
张慧明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Core Chip Technology (shanghai) Co Ltd
VeriSilicon Microelectronics Shanghai Co Ltd
Vivante Corp
Original Assignee
Core Chip Technology (shanghai) Co Ltd
VeriSilicon Microelectronics Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Core Chip Technology (shanghai) Co Ltd, VeriSilicon Microelectronics Shanghai Co Ltd filed Critical Core Chip Technology (shanghai) Co Ltd
Priority to CN201711273248.5A priority Critical patent/CN107862378B/en
Publication of CN107862378A publication Critical patent/CN107862378A/en
Application granted granted Critical
Publication of CN107862378B publication Critical patent/CN107862378B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • G06N3/065Analogue means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention provides a kind of convolutional neural networks accelerated method and system based on multinuclear, storage medium and terminal, including one layer of convolutional neural networks are split into at least two subtasks, and each subtask is corresponding with a convolution kernel;It is connected in series between the convolution kernel;Parallel to perform the first predetermined number dot product computing based on each convolution kernel, each dot product computing includes the second predetermined number multiplying;The product of first predetermined number and second predetermined number is the adder and multiplier number in convolution kernel;By the dot product operation result of each convolution kernel according to output priority Sequential output.The convolutional neural networks accelerated method and system based on multinuclear, the storage medium and terminal of the present invention save the data bandwidth of convolutional neural networks by multiple parallel convolution kernels;Under conditions of same hardware data bandwidth, the processing speed of convolutional neural networks is improved by parallel dot product computing in convolution kernel.

Description

Convolutional neural networks accelerated method and system, storage medium and terminal based on multinuclear
Technical field
The present invention relates to the technical field of data processing, accelerates more particularly to a kind of convolutional neural networks based on multinuclear Method and system, storage medium and terminal.
Background technology
At present, deep learning and machine learning have obtained widely in visual processes, speech recognition and art of image analysis Using.Convolutional neural networks are deep learning and the important component of machine learning.Improve the processing speed of convolutional neural networks Degree, deep learning and the processing speed of machine learning can be improved with equal proportion.
In the prior art, visual processes, speech recognition and the application of graphical analysis are all based on multilayer convolutional neural networks. Every layer of convolutional neural networks will pass through substantial amounts of data processing and convolution algorithm, and the consumption to hardware process speed and resource will Ask very high.With wearable device, Internet of Things application and the continuous development of automatic Pilot technology, how convolutional neural networks to be existed The middle realization of embedded product and reach can smooth processing speed, turn into that Current hardware architecture design faces huge chooses War.By taking typical convolutional neural networks ResNet and VGG16 as an example, ResNet is under 16 floating point precisions, if to run to 60 frames speed per second is, it is necessary to 15G byte bandwidth;VGG16 is under 16 floating point precisions, if to run to 60 frames speed per second Degree is, it is necessary to 6.0G byte bandwidth.
At present, in order to realize the acceleration of convolutional neural networks, realized by the multiple convolution units of parallel arranged.In ideal In the case of, convolution unit is more, and processing speed is faster.But in actual applications, data bandwidth can restricted wreath product unit significantly Processing speed, the bandwidth resources of hardware are of great rarity, and the data bandwidth for improving hardware is costly.Therefore, in limited number According to the processing speed under bandwidth and hardware spending, improving convolutional neural networks, turn into Current hardware architecture design and be badly in need of what is solved Problem.
The content of the invention
In view of the above the shortcomings that prior art, it is an object of the invention to provide a kind of convolutional Neural based on multinuclear Network accelerating method and system, storage medium and terminal, the data of convolutional neural networks are saved by multiple parallel convolution kernels Bandwidth;Under conditions of same hardware data bandwidth, convolutional Neural is improved by parallel dot product computing in convolution kernel The processing speed of network.
In order to achieve the above objects and other related objects, the present invention provides a kind of convolutional neural networks based on multinuclear and accelerated Method, comprise the following steps:One layer of convolutional neural networks are split into at least two subtasks, each subtask and a convolution Nuclear phase is corresponding;Serial connection is for input data serial transmission between the convolution kernel between the convolution kernel;Based on each Convolution kernel, parallel to perform the first predetermined number dot product computing, each dot product computing includes the second predetermined number Multiplying;The product of first predetermined number and second predetermined number is the adder and multiplier number in convolution kernel;Will be each The dot product operation result of convolution kernel is according to output priority Sequential output.
In one embodiment of the invention, second predetermined number is 3 to support 3D dot products.
In one embodiment of the invention, the output priority inputs the sequencing of convolution kernel according to the input data It is determined that the convolution kernel for first inputting the input data has higher output preferential than the convolution kernel of the rear input input data Level.
It is parallel to perform the first predetermined number dot product computing based on each convolution kernel in one embodiment of the invention Comprise the following steps:
Obtain (N+M-1) individual input data;Wherein N is first predetermined number, and M is second predetermined number;
The 1st, which is separately input into, by preceding 1 to N number of input data arrives N number of adder and multiplier and the first multiplication;
Preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication;
By that analogy, preceding M to N+M-1 input datas are separately input into (N*M-N+1) to N*M adder and multiplier and the M multiplications;
By in M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.
Accordingly, the present invention provides a kind of convolutional neural networks acceleration system based on multinuclear, including convolution kernel sets mould Block, dot product module and output module;
The convolution kernel setup module is used to one layer of convolutional neural networks splitting at least two subtasks, appoints per height Business is corresponding with a convolution kernel;It is connected in series between the convolution kernel so that input data serially passes between the convolution kernel It is defeated;
The dot product module is used to be based on each convolution kernel, parallel to perform the first predetermined number dot product fortune Calculate, each dot product computing includes the second predetermined number multiplying;First predetermined number and described second is preset The product of quantity is the adder and multiplier number in convolution kernel;
The output module is used for the dot product operation result of each convolution kernel according to output priority Sequential output.
In one embodiment of the invention, second predetermined number is 3 to support 3D dot products.
In one embodiment of the invention, the output priority inputs the sequencing of convolution kernel according to the input data It is determined that the convolution kernel for first inputting the input data has higher output preferential than the convolution kernel of the rear input input data Level.
In one embodiment of the invention, the dot product module is based on each convolution kernel, parallel to perform the first present count Following steps are performed during amount dot product computing:
Obtain (N+M-1) individual input data;Wherein N is first predetermined number, and M is second predetermined number;
The 1st, which is separately input into, by preceding 1 to N number of input data arrives N number of adder and multiplier and the first multiplication;
Preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication;
By that analogy, preceding M to N+M-1 input datas are separately input into (N*M-N+1) to N*M adder and multiplier and the M multiplications;
By in M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.
The present invention provides a kind of storage medium, is stored thereon with computer program, the program is realized when being executed by processor The above-mentioned convolutional neural networks accelerated method based on multinuclear.
Finally, the present invention provides a kind of terminal, including processor and memory;
The memory is used to store computer program;
The processor is used for the computer program for performing the memory storage, so that the terminal performs above-mentioned base In the convolutional neural networks accelerated method of multinuclear.
As described above, the convolutional neural networks accelerated method and system based on multinuclear of the present invention, storage medium and terminal, Have the advantages that:
(1) bandwidth consumption of convolutional neural networks computing dynamic memory is saved;By taking 4 convolution kernel normal forms as an example, in phase Under same data processing speed, 75% input image data bandwidth can be saved when convolutional neural networks are run;
(2) processing speed of convolutional neural networks is improved, under conditions of same hardware data bandwidth, with 3D vector points Exemplified by product, it is possible to increase the processing speed of convolutional neural networks to 300%;
(3) dynamic power consumption of convolutional neural networks is reduced, by taking 4 convolution kernels, 3D dot products as an example, when reducing computing Between to original 33%, reduce the input picture bandwidth of convolutional neural networks to original 25%, reduction dynamic power consumption 85%;
(4) optimize processing speed of the convolutional neural networks in embedded product, have framework is clear, the division of labor clearly, Easily realize, the advantages that flow is simple, can be widely applied in Internet of Things, wearable device and mobile unit.
Brief description of the drawings
Fig. 1 is shown as flow chart of the convolutional neural networks accelerated method in an embodiment based on multinuclear of the present invention;
Fig. 2 is shown as the coordinate schematic diagram of input picture, coefficient and output image;
Fig. 3 is shown as framework signal of the convolutional neural networks accelerated method based on multinuclear of the present invention in an embodiment Figure;
It is real in one that Fig. 4 is shown as parallel 3D dot products in the convolutional neural networks accelerated method based on multinuclear of the present invention Apply the first state figure in example;
It is real in one that Fig. 5 is shown as parallel 3D dot products in the convolutional neural networks accelerated method based on multinuclear of the present invention Apply the second state diagram in example;
It is real in one that Fig. 6 is shown as parallel 3D dot products in the convolutional neural networks accelerated method based on multinuclear of the present invention Apply the third state figure in example;
Fig. 7 be shown as parallel 3D dot products in the convolutional neural networks accelerated method based on multinuclear of the present invention sum in State diagram in one embodiment;
Fig. 8 is shown as structural representation of the convolutional neural networks acceleration system in an embodiment based on multinuclear of the present invention Figure;
Fig. 9 is shown as structural representation of the terminal of the present invention in an embodiment.
Component label instructions
21 convolution kernel setup modules
22 dot product modules
23 output modules
31 processors
32 memories
Embodiment
Illustrate embodiments of the present invention below by way of specific instantiation, those skilled in the art can be by this specification Disclosed content understands other advantages and effect of the present invention easily.The present invention can also pass through specific realities different in addition The mode of applying is embodied or practiced, the various details in this specification can also be based on different viewpoints with application, without departing from Various modifications or alterations are carried out under the spirit of the present invention.It should be noted that in the case where not conflicting, following examples and implementation Feature in example can be mutually combined.
It should be noted that the diagram provided in following examples only illustrates the basic structure of the present invention in a schematic way Think, only show the component relevant with the present invention in schema then rather than according to component count, shape and the size during actual implement Draw, kenel, quantity and the ratio of each component can be a kind of random change during its actual implementation, and its assembly layout kenel It is likely more complexity.
The convolutional neural networks accelerated method and system based on multinuclear of the present invention, storage medium and terminal is in data bandwidth The data bandwidth of convolutional neural networks is saved in the case of limited by multiple parallel convolution kernels;In same hardware data bandwidth Under conditions of, pass through the processing speed of dot product computing raising convolutional neural networks parallel in convolution kernel;Optimize volume Processing speed of the product neutral net in embedded product, it is excellent to have that framework is clear, the division of labor clearly, is easily realized, flow is simple etc. Point, it can be widely applied in Internet of Things, wearable device and mobile unit.
As shown in figure 1, in an embodiment, the convolutional neural networks accelerated method of the invention based on multinuclear includes following Step:
Step S1, one layer of convolutional neural networks are split into at least two subtasks, each subtask and a convolution kernel It is corresponding;Serial connection is for input data serial transmission between the convolution kernel between the convolution kernel.
By taking image procossing as an example, as shown in Fig. 2 input picture and output image are all 3-D views, input picture includes Abscissa inx, ordinate iny and coefficient depth coordinate kz.Output image includes abscissa outx, ordinate outy and depth is sat Mark z.Coefficient is 4 D data, including abscissa kx, ordinate ky, coefficient depth coordinate kz and output depth coordinate z.
When one layer of convolutional neural networks are split as into four subtasks, coefficient sequence is split into four groups according to z directions, As shown in Fig. 2 the convolution kernel of different convolutional neural networks is distributed to per system number.Wherein, data between different convolution kernels Serial UNICOM, so as to save bandwidth by data sharing.
According to the treatment characteristic of convolutional neural networks, convolution algorithm will be carried out with all input pictures per system number, Obtain the output image of a z-plane.Therefore, input picture has very big reusability.As shown in figure 3, different convolution kernels it Between be connected in series by channels in series so that input image data serially passes through between convolution kernel.Specifically, input figure As data are after internal memory is read into convolution kernel 0, input image data and the first system number are carried out convolution algorithm by convolution kernel 0, Input image data is transferred to convolution kernel 1 by convolution kernel 0 by data channels in series simultaneously.It is defeated that convolution core 1 saves reading Enter the bandwidth consumption of view data.Convolution kernel 2 and convolution kernel 3 can carry out same data and grasp to avoid the band of input image data Width consumption.Under 4 convolution kernel normal forms, by tandem data passage, reduce the reading of input image data three times, save 75% input image data bandwidth consumption, while decrease the power consumption of internal memory;Solve physics realization layer coiling well Quantity and the equilibrium problem for realizing frequency.
Step S2, it is parallel to perform the first predetermined number dot product computing, each dot product based on each convolution kernel Computing includes the second predetermined number multiplying;The product of first predetermined number and second predetermined number is convolution kernel In adder and multiplier number.
It is parallel to perform the first predetermined number dot product computing based on each convolution kernel in one embodiment of the invention Comprise the following steps:
21) (N+M-1) individual input data is obtained;Wherein N is first predetermined number, and M is second predetermined number;
22) it is separately input into the 1st by preceding 1 to N number of input data and arrives N number of adder and multiplier and the first multiplication;
23) preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication;
24) preceding M to N+M-1 input datas by that analogy, are separately input into (N*M-N+1) and arrive N*M adder and multiplier With M multiplications;
25) by M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.
It is expanded on further below with dot product comprising multiplication three times (M=3).In this embodiment, in each volume 24 adder and multipliers are included in product core.In order to maximize the utilization rate of adder and multiplier in the case of finite bandwidth, by comprising three times The 3D dot products computing of multiplication to calculate 8 3D dot products results (N=8) simultaneously.Specifically, the computing of 3D dot products is public Formula is as follows:
Out0=in0*k0+in1*k1+in2*k2
Out1=in1*k0+in2*k1+in3*k2
Out2=in2*k0+in3*k1+in4*k2
......
Out7=in7*k0+in8*k1+in9*k2
First, as shown in figure 4, reading 10 input image datas, i.e. in0, in1, int2...in9 from internal memory.By 0 to the 7th data (in0, in1, in2, in3, in4, in5, in6, in7) and coefficient 0 (k0) carry out multiplying, operation result The input of write accumulator, i.e. out00, out01, out02...out07.
Secondly, as shown in figure 5, reusing 10 input image datas of previous step, i.e. in0, in1, int2...in9.Will 1st to the 8th data (in1, in2, in3, in4, in5, in6, in7, in8) and coefficient 1 (k1) carry out multiplying, computing knot The input of fruit write accumulator, i.e. out10, out11, out12...out17.
Again, as shown in fig. 6, reusing 10 input image datas of previous step, i.e. in0, in1, int2...in9.Will 2nd to the 9th data (in2, in3, in4, in5, in6, in7, in8, in9) and coefficient 2 (k2) carry out multiplying, computing knot The input of fruit write accumulator, i.e. out20, out21, out22...out27.
Finally, as shown in fig. 7, the result on the correspondence position of multiplication three times is added up successively, 8 3D dot products are obtained As a result.Wherein, result out00, out10 at the first position of each multiplication are added with out20, obtain first 3D vector Dot product result;Result out01, out11 of the second place of each multiplication are added with out21, obtain second 3D vector Dot product result;By that analogy, result out07, out17 at the 8 positions of each multiplication are added with out27, obtain the 8th Individual 3D dot products result.
Step S3, by the dot product operation result of each convolution kernel according to output priority Sequential output.
Specifically, the output priority inputs the sequencing determination of convolution kernel according to the input data.Obtain defeated Enter that data are more early, the output priority of corresponding convolution kernel is higher.Therefore, the convolution kernel for first inputting the input data is more defeated than rear The convolution kernel for entering the input data has higher output priority.
As shown in figure 8, in an embodiment, the convolutional neural networks acceleration system of the invention based on multinuclear includes convolution Core setup module 21, dot product module 22 and output module 23.
Convolution kernel setup module 21 is used to one layer of convolutional neural networks splitting at least two subtasks, each subtask It is corresponding with a convolution kernel;It is connected in series between the convolution kernel so that input data serially passes between the convolution kernel It is defeated.
By taking image procossing as an example, as shown in Fig. 2 input picture and output image are all 3-D views, input picture includes Abscissa inx, ordinate iny and coefficient depth coordinate kz.Output image includes abscissa outx, ordinate outy and depth is sat Mark z.Coefficient is 4 D data, including abscissa kx, ordinate ky, coefficient depth coordinate kz and output depth coordinate z.
When one layer of convolutional neural networks are split as into four subtasks, coefficient sequence is split into four groups according to z directions, As shown in Fig. 2 the convolution kernel of different convolutional neural networks is distributed to per system number.Wherein, data between different convolution kernels Serial UNICOM, so as to save bandwidth by data sharing.
According to the treatment characteristic of convolutional neural networks, convolution algorithm will be carried out with all input pictures per system number, Obtain the output image of a z-plane.Therefore, input picture has very big reusability.As shown in figure 3, different convolution kernels it Between be connected in series by channels in series so that input image data serially passes through between convolution kernel.Specifically, input figure As data are after internal memory is read into convolution kernel 0, input image data and the first system number are carried out convolution algorithm by convolution kernel 0, Input image data is transferred to convolution kernel 1 by convolution kernel 0 by data channels in series simultaneously.It is defeated that convolution core 1 saves reading Enter the bandwidth consumption of view data.Convolution kernel 2 and convolution kernel 3 can carry out same data and grasp to avoid the band of input image data Width consumption.Under 4 convolution kernel normal forms, by tandem data passage, reduce the reading of input image data three times, save 75% input image data bandwidth consumption, while decrease the power consumption of internal memory;Solve physics realization layer coiling well Quantity and the equilibrium problem for realizing frequency.
Dot product module 22 is connected with convolution kernel setup module 21, for based on each convolution kernel, performing first parallel Predetermined number dot product computing, each dot product computing include the second predetermined number multiplying;Described first is pre- If quantity and the product of second predetermined number are the adder and multiplier number in convolution kernel.
In one embodiment of the invention, dot product module 22 is based on each convolution kernel, parallel to perform the first predetermined number Individual dot product computing performs following steps:
21) (N+M-1) individual input data is obtained;Wherein N is first predetermined number, and M is second predetermined number;
22) it is separately input into the 1st by preceding 1 to N number of input data and arrives N number of adder and multiplier and the first multiplication;
23) preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication;
24) preceding M to N+M-1 input datas by that analogy, are separately input into (N*M-N+1) and arrive N*M adder and multiplier With M multiplications;
25) by M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.
It is expanded on further below with 3D dot products comprising multiplication three times (M=3).In this embodiment, each 24 adder and multipliers are included in convolution kernel.In order to maximize the utilization rate of adder and multiplier in the case of finite bandwidth, by including three The 3D dot products computing of secondary multiplication to calculate 8 3D dot products results (N=8) simultaneously.Specifically, 3D dot products computing Formula is as follows:
Out0=in0*k0+in1*k1+in2*k2
Out1=in1*k0+in2*k1+in3*k2
Out2=in2*k0+in3*k1+in4*k2
......
Out7=in7*k0+in8*k1+in9*k2
First, as shown in figure 4, reading 10 input image datas, i.e. in0, in1, int2...in9 from internal memory.By 0 to the 7th data (in0, in1, in2, in3, in4, in5, in6, in7) and coefficient 0 (k0) carry out multiplying, operation result The input of write accumulator, i.e. out00, out01, out02...out07.
Secondly, as shown in figure 5, reusing 10 input image datas of previous step, i.e. in0, in1, int2...in9.Will 1st to the 8th data (in1, in2, in3, in4, in5, in6, in7, in8) and coefficient 1 (k1) carry out multiplying, computing knot The input of fruit write accumulator, i.e. out10, out11, out12...out17.
Again, as shown in fig. 6, reusing 10 input image datas of previous step, i.e. in0, in1, int2...in9.Will 2nd to the 9th data (in2, in3, in4, in5, in6, in7, in8, in9) and coefficient 2 (k2) carry out multiplying, computing knot The input of fruit write accumulator, i.e. out20, out21, out22...out27.
Finally, as shown in fig. 7, the result on the correspondence position of multiplication three times is added up successively, 8 3D dot products are obtained As a result.Wherein, result out00, out10 at the first position of each multiplication are added with out20, obtain first 3D vector Dot product result;Result out01, out11 of the second place of each multiplication are added with out21, obtain second 3D vector Dot product result;By that analogy, result out07, out17 at the 8 positions of each multiplication are added with out27, obtain the 8th Individual 3D dot products result.
Output module 23 is connected with dot product module 22, for by the dot product operation result of each convolution kernel according to Output priority Sequential output.
Specifically, the output priority inputs the sequencing determination of convolution kernel according to the input data.Obtain defeated Enter that data are more early, the output priority of corresponding convolution kernel is higher.Therefore, the convolution kernel for first inputting the input data is more defeated than rear The convolution kernel for entering the input data has higher output priority.
It should be noted that it should be understood that the modules of system above division be only a kind of division of logic function, Can completely or partially it be integrated on a physical entity when actually realizing, can also be physically separate.And these modules can be with All realized in the form of software is called by treatment element;All it can also realize in the form of hardware;Can also part mould Block calls the form of software to realize by treatment element, and part of module is realized by the form of hardware.For example, x modules can be The treatment element individually set up, it can also be integrated in some chip of said apparatus and realize, in addition it is also possible to program generation The form of code is stored in the memory of said apparatus, is called by some treatment element of said apparatus and is performed above x moulds The function of block.The realization of other modules is similar therewith.In addition these modules can completely or partially integrate, can also be only It is vertical to realize.Treatment element described here can be a kind of integrated circuit, have the disposal ability of signal.In implementation process, Each step of the above method or more modules can pass through the integrated logic circuit or soft of the hardware in processor elements The instruction of part form is completed.
For example, the above module can be arranged to implement one or more integrated circuits of above method, such as: One or more specific integrated circuits (ApplicationSpecificIntegratedCircuit, abbreviation ASIC), or, one Or multi-microprocessor (digitalsingnalprocessor, abbreviation DSP), or, one or more field-programmable gate array Arrange (FieldProgrammableGateArray, abbreviation FPGA) etc..For another example, some module is dispatched by treatment element more than When the form of program code is realized, the treatment element can be general processor, such as central processing unit (CentralProcessingUnit, abbreviation CPU) or it is other can be with the processor of caller code.For another example, these modules can To integrate, realized in the form of on-chip system (system-on-a-chip, abbreviation SOC).
The present invention storage medium on be stored with computer program, the program realized when being executed by processor it is above-mentioned based on The convolutional neural networks accelerated method of multinuclear.Preferably, the storage medium includes:ROM, RAM, magnetic disc or CD etc. are various Can be with the medium of store program codes.
As shown in figure 9, in an embodiment, terminal of the invention includes processor 31 and memory 32.
The memory 32 is used to store computer program.
Preferably, the memory 32 includes:ROM, RAM, magnetic disc or CD etc. are various can be with store program codes Medium.
The processor 31 is connected with the memory 32, the computer program stored for performing the memory 32, So that the terminal performs the above-mentioned convolutional neural networks accelerated method based on multinuclear.
Preferably, processor 31 can be general processor, including central processing unit (CentralProcessingUnit, Abbreviation CPU), network processing unit (NetworkProcessor, abbreviation NP) etc.;It can also be digital signal processor (DigitalSignalProcessing, abbreviation DSP), application specific integrated circuit (ApplicationSpecificIntegratedCircuit, abbreviation ASIC), field programmable gate array (Field- ProgrammableGateArray, abbreviation FPGA) either other PLDs, discrete gate or transistor logic device Part, discrete hardware components.
In summary, the convolutional neural networks accelerated method and system of the invention based on multinuclear, storage medium and terminal Save the bandwidth consumption of convolutional neural networks computing dynamic memory;By taking 4 convolution kernel normal forms as an example, in identical data processing Under speed, 75% input image data bandwidth can be saved when convolutional neural networks are run;Improve convolutional neural networks Processing speed, under conditions of same hardware data bandwidth, by taking 3D dot products as an example, it is possible to increase convolutional neural networks Processing speed is to 300%;Reduce the dynamic power consumption of convolutional neural networks, by taking 4 convolution kernels, 3D dot products as an example, reduce Operation time to original 33%, reduces the input picture bandwidth of convolutional neural networks to original 25%, reduces dynamic power consumption 85%;Optimize processing speed of the convolutional neural networks in embedded product, have framework is clear, the division of labor clearly, is easily realized, The advantages that flow is simple, it can be widely applied in Internet of Things, wearable device and mobile unit.So effective gram of present invention Take various shortcoming of the prior art and have high industrial utilization.
The above-described embodiments merely illustrate the principles and effects of the present invention, not for the limitation present invention.It is any ripe Know the personage of this technology all can carry out modifications and changes under the spirit and scope without prejudice to the present invention to above-described embodiment.Cause This, those of ordinary skill in the art is complete without departing from disclosed spirit and institute under technological thought such as Into all equivalent modifications or change, should by the present invention claim be covered.

Claims (10)

1. a kind of convolutional neural networks accelerated method based on multinuclear, it is characterised in that comprise the following steps:
One layer of convolutional neural networks are split into at least two subtasks, each subtask is corresponding with a convolution kernel;It is described Serial connection is for input data serial transmission between the convolution kernel between convolution kernel;
Parallel to perform the first predetermined number dot product computing based on each convolution kernel, each dot product computing includes the Two predetermined number multiplyings;The product of first predetermined number and second predetermined number is the adder and multiplier in convolution kernel Number;
By the dot product operation result of each convolution kernel according to output priority Sequential output.
2. the convolutional neural networks accelerated method according to claim 1 based on multinuclear, it is characterised in that described second is pre- If quantity is 3 to support 3D dot products.
3. the convolutional neural networks accelerated method according to claim 1 based on multinuclear, it is characterised in that the output is excellent The sequencing that first level inputs convolution kernel according to the input data determines that the convolution kernel for first inputting the input data is more defeated than rear The convolution kernel for entering the input data has higher output priority.
4. the convolutional neural networks accelerated method according to claim 1 based on multinuclear, it is characterised in that based on each volume Product core, the first predetermined number dot product computing of parallel execution comprise the following steps:
Obtain (N+M-1) individual input data;Wherein N is first predetermined number, and M is second predetermined number;
The 1st, which is separately input into, by preceding 1 to N number of input data arrives N number of adder and multiplier and the first multiplication;
Preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication;
By that analogy, preceding M to N+M-1 input datas are separately input into (N*M-N+1) and arrive N*M adder and multiplier and M systems Number is multiplied;
By in M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.
5. a kind of convolutional neural networks acceleration system based on multinuclear, it is characterised in that including convolution kernel setup module, vector point Volume module and output module;
The convolution kernel setup module is used to one layer of convolutional neural networks splitting at least two subtasks, each subtask with One convolution kernel is corresponding;Serial connection is for input data serial transmission between the convolution kernel between the convolution kernel;
The dot product module is used to be based on each convolution kernel, parallel to perform the first predetermined number dot product computing, often Individual dot product computing includes the second predetermined number multiplying;First predetermined number and second predetermined number it Product is the adder and multiplier number in convolution kernel;
The output module is used for the dot product operation result of each convolution kernel according to output priority Sequential output.
6. the convolutional neural networks acceleration system according to claim 5 based on multinuclear, it is characterised in that described second is pre- If quantity is 3 to support 3D dot products.
7. the convolutional neural networks acceleration system according to claim 5 based on multinuclear, it is characterised in that the output is excellent The sequencing that first level inputs convolution kernel according to the input data determines that the convolution kernel for first inputting the input data is more defeated than rear The convolution kernel for entering the input data has higher output priority.
8. the convolutional neural networks acceleration system according to claim 5 based on multinuclear, it is characterised in that the vector point Volume module is based on each convolution kernel, and following steps are performed when performing the first predetermined number dot product computing parallel:
Obtain (N+M-1) individual input data;Wherein N is first predetermined number, and M is second predetermined number;
The 1st, which is separately input into, by preceding 1 to N number of input data arrives N number of adder and multiplier and the first multiplication;
Preceding 2 to N+1 input data is separately input into N+1 to 2N adder and multipliers and the second multiplication;
By that analogy, preceding M to N+M-1 input datas are separately input into (N*M-N+1) and arrive N*M adder and multiplier and M systems Number is multiplied;
By in M multiplication on N number of position corresponding position product accumulation, obtain N number of dot product operation result.
9. a kind of storage medium, is stored thereon with computer program, it is characterised in that power is realized when the program is executed by processor Profit requires the convolutional neural networks accelerated method based on multinuclear any one of 1 to 4.
10. a kind of terminal, it is characterised in that including processor and memory;
The memory is used to store computer program;
The processor is used for the computer program for performing the memory storage, so that terminal perform claim requirement 1 to 4 Any one of the convolutional neural networks accelerated method based on multinuclear.
CN201711273248.5A 2017-12-06 2017-12-06 Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal Active CN107862378B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711273248.5A CN107862378B (en) 2017-12-06 2017-12-06 Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711273248.5A CN107862378B (en) 2017-12-06 2017-12-06 Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal

Publications (2)

Publication Number Publication Date
CN107862378A true CN107862378A (en) 2018-03-30
CN107862378B CN107862378B (en) 2020-04-24

Family

ID=61705060

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711273248.5A Active CN107862378B (en) 2017-12-06 2017-12-06 Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal

Country Status (1)

Country Link
CN (1) CN107862378B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681773A (en) * 2018-05-23 2018-10-19 腾讯科技(深圳)有限公司 Accelerated method, device, terminal and the readable storage medium storing program for executing of data operation
CN108920413A (en) * 2018-06-28 2018-11-30 中国人民解放军国防科技大学 Convolutional neural network multi-core parallel computing method facing GPDSP
CN109117940A (en) * 2018-06-19 2019-01-01 腾讯科技(深圳)有限公司 To accelerated method, apparatus and system before a kind of convolutional neural networks
CN109740733A (en) * 2018-12-27 2019-05-10 深圳云天励飞技术有限公司 Deep learning network model optimization method, device and relevant device
CN109740747A (en) * 2018-12-29 2019-05-10 北京中科寒武纪科技有限公司 Operation method, device and Related product
CN109886400A (en) * 2019-02-19 2019-06-14 合肥工业大学 The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel
CN110109646A (en) * 2019-03-28 2019-08-09 北京迈格威科技有限公司 Data processing method, device and adder and multiplier and storage medium
WO2020003043A1 (en) * 2018-06-27 2020-01-02 International Business Machines Corporation Low precision deep neural network enabled by compensation instructions
CN110689121A (en) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 Method for realizing neural network model splitting by using multi-core processor and related product
CN110689115A (en) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 Neural network model processing method and device, computer equipment and storage medium
CN110738317A (en) * 2019-10-17 2020-01-31 中国科学院上海高等研究院 FPGA-based deformable convolution network operation method, device and system
CN110796245A (en) * 2019-10-25 2020-02-14 浪潮电子信息产业股份有限公司 Method and device for calculating convolutional neural network model
CN110880032A (en) * 2018-09-06 2020-03-13 黑芝麻智能科技(上海)有限公司 Convolutional neural network using adaptive 3D array
CN110889497A (en) * 2018-12-29 2020-03-17 中科寒武纪科技股份有限公司 Learning task compiling method of artificial intelligence processor and related product
CN111563586A (en) * 2019-02-14 2020-08-21 上海寒武纪信息科技有限公司 Splitting method of neural network model and related product
CN111610963A (en) * 2020-06-24 2020-09-01 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
WO2021057746A1 (en) * 2019-09-24 2021-04-01 安徽寒武纪信息科技有限公司 Neural network processing method and apparatus, computer device and storage medium
WO2021081854A1 (en) * 2019-10-30 2021-05-06 华为技术有限公司 Convolution operation circuit and convolution operation method
CN114399828A (en) * 2022-03-25 2022-04-26 深圳比特微电子科技有限公司 Training method of convolution neural network model for image processing
CN116303108A (en) * 2022-09-07 2023-06-23 芯砺智能科技(上海)有限公司 Convolutional neural network weight address arrangement method suitable for parallel computing architecture

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN106203617A (en) * 2016-06-27 2016-12-07 哈尔滨工业大学深圳研究生院 A kind of acceleration processing unit based on convolutional neural networks and array structure
CN106599883A (en) * 2017-03-08 2017-04-26 王华锋 Face recognition method capable of extracting multi-level image semantics based on CNN (convolutional neural network)
CN106845635A (en) * 2017-01-24 2017-06-13 东南大学 CNN convolution kernel hardware design methods based on cascade form
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN106951395A (en) * 2017-02-13 2017-07-14 上海客鹭信息技术有限公司 Towards the parallel convolution operations method and device of compression convolutional neural networks
CN107301456A (en) * 2017-05-26 2017-10-27 中国人民解放军国防科学技术大学 Deep neural network multinuclear based on vector processor speeds up to method
WO2017186829A1 (en) * 2016-04-27 2017-11-02 Commissariat A L'energie Atomique Et Aux Energies Alternatives Device and method for calculating convolution in a convolutional neural network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
WO2017186829A1 (en) * 2016-04-27 2017-11-02 Commissariat A L'energie Atomique Et Aux Energies Alternatives Device and method for calculating convolution in a convolutional neural network
CN106203617A (en) * 2016-06-27 2016-12-07 哈尔滨工业大学深圳研究生院 A kind of acceleration processing unit based on convolutional neural networks and array structure
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN106845635A (en) * 2017-01-24 2017-06-13 东南大学 CNN convolution kernel hardware design methods based on cascade form
CN106951395A (en) * 2017-02-13 2017-07-14 上海客鹭信息技术有限公司 Towards the parallel convolution operations method and device of compression convolutional neural networks
CN106599883A (en) * 2017-03-08 2017-04-26 王华锋 Face recognition method capable of extracting multi-level image semantics based on CNN (convolutional neural network)
CN107301456A (en) * 2017-05-26 2017-10-27 中国人民解放军国防科学技术大学 Deep neural network multinuclear based on vector processor speeds up to method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘进锋: "一种简洁高效的加速卷积神经网络的方法", 《科学技术与工程》 *
李大霞: "CUDA-CONVNET 深层卷积神经网络算法的", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681773A (en) * 2018-05-23 2018-10-19 腾讯科技(深圳)有限公司 Accelerated method, device, terminal and the readable storage medium storing program for executing of data operation
CN109117940B (en) * 2018-06-19 2020-12-15 腾讯科技(深圳)有限公司 Target detection method, device, terminal and storage medium based on convolutional neural network
CN109117940A (en) * 2018-06-19 2019-01-01 腾讯科技(深圳)有限公司 To accelerated method, apparatus and system before a kind of convolutional neural networks
JP2021530761A (en) * 2018-06-27 2021-11-11 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Low-precision deep neural network enabled by compensation instructions
GB2590000A (en) * 2018-06-27 2021-06-16 Ibm Low precision deep neural network enabled by compensation instructions
WO2020003043A1 (en) * 2018-06-27 2020-01-02 International Business Machines Corporation Low precision deep neural network enabled by compensation instructions
GB2590000B (en) * 2018-06-27 2022-12-07 Ibm Low precision deep neural network enabled by compensation instructions
JP7190799B2 (en) 2018-06-27 2022-12-16 インターナショナル・ビジネス・マシーンズ・コーポレーション Low Precision Deep Neural Networks Enabled by Compensation Instructions
CN108920413A (en) * 2018-06-28 2018-11-30 中国人民解放军国防科技大学 Convolutional neural network multi-core parallel computing method facing GPDSP
CN110880032B (en) * 2018-09-06 2022-07-19 黑芝麻智能科技(上海)有限公司 Convolutional neural network using adaptive 3D array
US11954573B2 (en) 2018-09-06 2024-04-09 Black Sesame Technologies Inc. Convolutional neural network using adaptive 3D array
CN110880032A (en) * 2018-09-06 2020-03-13 黑芝麻智能科技(上海)有限公司 Convolutional neural network using adaptive 3D array
CN109740733B (en) * 2018-12-27 2021-07-06 深圳云天励飞技术有限公司 Deep learning network model optimization method and device and related equipment
CN109740733A (en) * 2018-12-27 2019-05-10 深圳云天励飞技术有限公司 Deep learning network model optimization method, device and relevant device
CN109740747B (en) * 2018-12-29 2019-11-12 北京中科寒武纪科技有限公司 Operation method, device and Related product
CN110889497A (en) * 2018-12-29 2020-03-17 中科寒武纪科技股份有限公司 Learning task compiling method of artificial intelligence processor and related product
US11893414B2 (en) 2018-12-29 2024-02-06 Cambricon Technologies Corporation Limited Operation method, device and related products
CN109740747A (en) * 2018-12-29 2019-05-10 北京中科寒武纪科技有限公司 Operation method, device and Related product
CN111563586A (en) * 2019-02-14 2020-08-21 上海寒武纪信息科技有限公司 Splitting method of neural network model and related product
CN111563586B (en) * 2019-02-14 2022-12-09 上海寒武纪信息科技有限公司 Splitting method of neural network model and related product
CN109886400A (en) * 2019-02-19 2019-06-14 合肥工业大学 The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel
CN110109646B (en) * 2019-03-28 2021-08-27 北京迈格威科技有限公司 Data processing method, data processing device, multiplier-adder and storage medium
CN110109646A (en) * 2019-03-28 2019-08-09 北京迈格威科技有限公司 Data processing method, device and adder and multiplier and storage medium
WO2021057720A1 (en) * 2019-09-24 2021-04-01 安徽寒武纪信息科技有限公司 Neural network model processing method and apparatus, computer device, and storage medium
CN110689121A (en) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 Method for realizing neural network model splitting by using multi-core processor and related product
CN110689115A (en) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 Neural network model processing method and device, computer equipment and storage medium
WO2021057746A1 (en) * 2019-09-24 2021-04-01 安徽寒武纪信息科技有限公司 Neural network processing method and apparatus, computer device and storage medium
CN110689115B (en) * 2019-09-24 2023-03-31 安徽寒武纪信息科技有限公司 Neural network model processing method and device, computer equipment and storage medium
CN110738317A (en) * 2019-10-17 2020-01-31 中国科学院上海高等研究院 FPGA-based deformable convolution network operation method, device and system
CN110796245B (en) * 2019-10-25 2022-03-22 浪潮电子信息产业股份有限公司 Method and device for calculating convolutional neural network model
CN110796245A (en) * 2019-10-25 2020-02-14 浪潮电子信息产业股份有限公司 Method and device for calculating convolutional neural network model
WO2021081854A1 (en) * 2019-10-30 2021-05-06 华为技术有限公司 Convolution operation circuit and convolution operation method
CN111610963A (en) * 2020-06-24 2020-09-01 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
CN111610963B (en) * 2020-06-24 2021-08-17 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
CN114399828B (en) * 2022-03-25 2022-07-08 深圳比特微电子科技有限公司 Training method of convolution neural network model for image processing
CN114399828A (en) * 2022-03-25 2022-04-26 深圳比特微电子科技有限公司 Training method of convolution neural network model for image processing
CN116303108A (en) * 2022-09-07 2023-06-23 芯砺智能科技(上海)有限公司 Convolutional neural network weight address arrangement method suitable for parallel computing architecture
CN116303108B (en) * 2022-09-07 2024-05-14 芯砺智能科技(上海)有限公司 Weight address arrangement method suitable for parallel computing architecture

Also Published As

Publication number Publication date
CN107862378B (en) 2020-04-24

Similar Documents

Publication Publication Date Title
CN107862378A (en) Convolutional neural networks accelerated method and system, storage medium and terminal based on multinuclear
CN111667051B (en) Neural network accelerator applicable to edge equipment and neural network acceleration calculation method
CN107862650B (en) Method for accelerating calculation of CNN convolution of two-dimensional image
US10394929B2 (en) Adaptive execution engine for convolution computing systems
JP2022037022A (en) Execution of kernel stride in hardware
EP3761235A1 (en) Transposing neural network matrices in hardware
CN107862374A (en) Processing with Neural Network system and processing method based on streamline
US20160240000A1 (en) Rendering views of a scene in a graphics processing unit
TW202414280A (en) Method, system and non-transitory computer-readable storage medium for performing computations for a layer of neural network
GB2600031A (en) Batch processing in a neural network processor
CN107797962A (en) Computing array based on neutral net
CN108122030A (en) A kind of operation method of convolutional neural networks, device and server
CN110309906A (en) Image processing method, device, machine readable storage medium and processor
TW202123093A (en) Method and system for performing convolution operation
WO2019136750A1 (en) Artificial intelligence-based computer-aided processing device and method, storage medium, and terminal
KR20200081044A (en) Method and apparatus for processing convolution operation of neural network
CN108170640A (en) The method of its progress operation of neural network computing device and application
CN110147252A (en) A kind of parallel calculating method and device of convolutional neural networks
CN104967428B (en) Frequency domain implementation method for FPGA high-order and high-speed FIR filter
CN112686379B (en) Integrated circuit device, electronic apparatus, board and computing method
WO2023065983A1 (en) Computing apparatus, neural network processing device, chip, and data processing method
CN110109646A (en) Data processing method, device and adder and multiplier and storage medium
CN109447254B (en) Convolution neural network reasoning hardware acceleration method and device thereof
KR20180125843A (en) A hardware classifier applicable to various CNN models
JP2023541350A (en) Table convolution and acceleration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 201203 China (Shanghai) Free Trade Pilot Zone 20A, Zhangjiang Building, 289 Chunxiao Road

Applicant after: Xinyuan Microelectronics (Shanghai) Co., Ltd.

Applicant after: Core chip technology (Shanghai) Co., Ltd.

Address before: 201203 Zhangjiang Building 20A, 560 Songtao Road, Zhangjiang High-tech Park, Pudong New Area, Shanghai

Applicant before: VeriSilicon Microelectronics (Shanghai) Co., Ltd.

Applicant before: Core chip technology (Shanghai) Co., Ltd.

GR01 Patent grant
GR01 Patent grant