CN109472361B - Neural network optimization method - Google Patents

Neural network optimization method Download PDF

Info

Publication number
CN109472361B
CN109472361B CN201811344189.0A CN201811344189A CN109472361B CN 109472361 B CN109472361 B CN 109472361B CN 201811344189 A CN201811344189 A CN 201811344189A CN 109472361 B CN109472361 B CN 109472361B
Authority
CN
China
Prior art keywords
energy consumption
neural network
time
model
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn - After Issue
Application number
CN201811344189.0A
Other languages
Chinese (zh)
Other versions
CN109472361A (en
Inventor
张跃进
胡勇
喻蒙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongxiang Boqian Information Technology Co ltd
Original Assignee
Zhongxiang Boqian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongxiang Boqian Information Technology Co ltd filed Critical Zhongxiang Boqian Information Technology Co ltd
Priority to CN201811344189.0A priority Critical patent/CN109472361B/en
Publication of CN109472361A publication Critical patent/CN109472361A/en
Application granted granted Critical
Publication of CN109472361B publication Critical patent/CN109472361B/en
Withdrawn - After Issue legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a neural network optimization method, which comprises the following steps: presetting modeling parameters, wherein the modeling parameters comprise network parameters and hardware parameters; constructing a neural network energy consumption model based on the modeling parameters; constructing a neural network time model based on the modeling parameters; and performing double-target optimization on the neural network energy consumption model and the neural network time model. According to the method, time and energy consumption modeling is carried out on the neural network from the perspective of a hardware computing process of the network, time and energy consumption are predicted layer by layer, leading modeling parameters of time and energy consumption overhead are analyzed, and a neural network model is improved by improving the modeling parameters, an array segmentation method and a cache segmentation method to carry out time and energy consumption double-target optimization on the neural network.

Description

Neural network optimization method
Technical Field
The application relates to the technical field of artificial neural networks, in particular to a neural network optimization method.
Background
With the rise of the neural network technology, neural network hardware suitable for different application scenes comes into play. The neural network has strong reasoning and predicting capability but large calculation amount, so how to increase the calculation speed of the neural network and reduce the energy consumption of the neural network becomes a key.
In the related art, both the neural network training process and the inference process have urgent needs for acceleration of network computation. Network training is basically completed at the cloud by using the GPU, and the parallelization method and the communication method of different hardware can greatly influence the speed of neural network training, so that the neural network computing speed is improved mainly from the aspects of improving parallelization computing capability and reducing communication overhead. The energy consumption of the neural network can be divided into two aspects of calculation energy consumption and memory access energy consumption. By different data multiplexing modes of output fixing and weight fixing, the memory access energy consumption of the neural network can be effectively reduced.
However, the above researches on the calculation of the neural network focus on one of low energy consumption or accelerated calculation speed, and do not consider that there may be a certain contradiction between the acceleration of the neural network and the energy perception in a complex application environment, and when the energy consumption is reduced, the speed may be sacrificed or the energy may be used, and the time for reducing the calculation of the neural network may generate more energy consumption.
Disclosure of Invention
In order to overcome the problem that researches on neural network calculation in the related art focus on one of low energy consumption or accelerated calculation speed at least to a certain extent, and do not consider that the acceleration and energy perception of the neural network in a complex application environment may have certain contradictions, the speed can be sacrificed or energy can be used when the energy consumption is reduced, and the time for reducing the neural network calculation may generate more energy consumption, the application provides a neural network optimization method.
Presetting modeling parameters, wherein the modeling parameters comprise network parameters and hardware parameters;
constructing a neural network energy consumption model based on the modeling parameters;
constructing a neural network time model based on the modeling parameters;
and performing double-target optimization on the neural network energy consumption model and the neural network time model.
Furthermore, the energy consumption calculation formula is E ═ V × T × E, V is the data volume to be read/written/calculated, T is the number of times to be read/written/calculated repeatedly, and E is the unit energy consumption.
Further, the unit energy consumption includes read-write energy consumption and calculation energy consumption.
Further, the building of the neural network energy consumption model includes: and modeling the energy consumption of the neural network layer by layer to obtain a convolutional layer energy consumption model, a full-connection layer energy consumption model, an RNN layer energy consumption model and a pooling layer energy consumption model.
Further, the neural network time model includes a time overhead, the time overhead including convolutional layer time overhead and full link layer time overhead.
Further, the time cost calculation method is that the time cost Tz is max (T)IO,Toperation),TIOFor read-write time, ToperationTo calculate the time.
Further, the performing of the dual-objective optimization on the neural network energy consumption model and the neural network time model includes:
performing cache segmentation on the neural network time model by using a TM calculation process;
carrying out array segmentation on the neural network energy consumption model by using an AP (access point) calculation process;
and selecting the array segmentation method with the minimum corresponding energy consumption when the cache segmentation meets the requirement condition.
Further, the method for selecting the array partition with the smallest corresponding energy consumption when the cache partition meets the requirement condition includes:
setting a task time Tmax;
calculating the time T required by partitioning the neural network time model by using the TM calculation process cacheTM
Calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow arrayAP
Selecting the minimum required energy consumption EAP0And calculating the minimum required energy consumption EAP0Required time TAP0
Comparing Tmax and TAP0Judging whether the set task time meets the requirement or not according to the comparison result;
if the set task time meets the requirement, comparing Tmax with TTMAnd outputting a result of performing dual-objective optimization on the neural network energy consumption model and the neural network time model according to the comparison result.
Further, the determining whether the set task time meets the requirement according to the comparison result includes:
if Tmax>TAP0Judging that the set task time meets the requirement;
otherwise, judging that the task can not be completed within the specified time, and resetting the task time.
Further, the outputting a result of performing dual-objective optimization on the neural network energy consumption model and the neural network time model according to the comparison result includes:
if Tmax<TTMAdjusting modeling parameters, and carrying out array segmentation on the neural network energy consumption model again;
calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow arrayAP’And selecting the minimum energy consumption EAP0’The corresponding modeling parameters and the array segmentation method are dual-target optimization results;
otherwise, selecting the modeling parameters and the cache segmentation method corresponding to the TM calculation process as a dual-target optimization result.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
the neural network optimization method comprises the following steps: presetting modeling parameters, and constructing a neural network energy consumption model and a neural network time model based on the modeling parameters; and performing double-target optimization on the neural network energy consumption model and the neural network time model. According to the method, time and energy consumption modeling is carried out on the neural network from the perspective of a hardware computing process of the network, time and energy consumption are predicted layer by layer, leading modeling parameters of time and energy consumption overhead are analyzed, and a neural network model is improved by improving the modeling parameters, changing an array segmentation method and a cache segmentation method to carry out time and energy consumption double-target optimization on the neural network.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Fig. 1 is a flowchart of a neural network optimization method according to an embodiment of the present application.
Detailed Description
The invention is described in detail below with reference to the figures and examples.
Fig. 1 is a flowchart of a neural network optimization method according to an embodiment of the present application.
As shown in fig. 1, the neural network optimization method of the present embodiment includes:
s1: presetting modeling parameters, wherein the modeling parameters comprise network parameters and hardware parameters;
s2: constructing a neural network energy consumption model based on the modeling parameters;
s3: constructing a neural network time model based on the modeling parameters;
s4: and performing double-target optimization on the neural network energy consumption model and the neural network time model.
In the neural network structure, there are three different structures of a convolutional layer (CONV), a fully connected layer (FC) and a pooling layer (POOL), and the three different structures correspond to three tasks of feature extraction, feature connection and feature compression respectively. The number and type of layers of different network structures are different, and the characteristics of different layers are different. For example, the convolution layer has a large amount of computation and the fully-connected layer has a large amount of data, and this imbalance of resource requirements needs to be realized by a special hardware architecture design.
Some Neural network architectures also include an RNN (Recurrent Neural Networks) layer or a CNN (Convolutional Neural Networks) layer to accomplish specific tasks.
The hardware is, for example, a Thinker chip, and the Thinker chip is composed of a PE array, an on-chip storage system, a finite state controller, an IO (input/output) and a decoder.
The modeling parameters include network parameters and hardware parameters, which are shown in table 1.
TABLE 1 modeling parameter Table
Figure BDA0001863240100000051
Figure BDA0001863240100000061
The neural network algorithm with the lowest energy consumption under the given task time condition is realized by constructing and constructing a neural network energy consumption model and a neural network time model, simultaneously carrying out double-objective optimization on energy consumption and time and adjusting modeling parameters.
As an optional implementation manner of the present invention, the energy consumption calculation formula is E ═ V × T × E, V is a data amount to be read/written/calculated, T is the number of times to be read/written/calculated repeatedly, and E is unit energy consumption.
The data volume V required to be read/written/calculated and the times T required to be read/written/calculated repeatedly can be calculated by utilizing network parameters and hardware parameters; for the unit energy consumption e, the access energy consumption of different storage levels is different, so that the analysis needs to be performed layer by layer.
The memory access energy consumption of different memory hierarchy is shown in table 2.
TABLE 2 memory access normalized energy consumption for different memory levels
Figure BDA0001863240100000062
And if the data is transmitted between the two storage mechanisms, selecting the energy consumption with higher energy consumption as the energy consumption of the data transmission. For example, when data is transferred from a DRAM to an on-chip cache SRAM, the energy consumption is determined to be 200 energy consumption units.
As an optional implementation manner of the present invention, the unit energy consumption includes read-write energy consumption and computational energy consumption.
As an optional implementation manner of the present invention, the constructing the neural network energy consumption model includes: and modeling the energy consumption of the neural network layer by layer to obtain a convolutional layer energy consumption model, a full-connection layer energy consumption model, an RNN layer energy consumption model and a pooling layer energy consumption model.
As an optional implementation manner of the present invention, the method for calculating the unit energy consumption E0 includes:
respectively constructing a convolutional layer energy consumption model, a full-connection layer energy consumption model, an RNN layer energy consumption model and a pooling layer energy consumption model;
and respectively calculating unit read-write/calculated energy consumption E0 corresponding to the convolutional layer energy consumption model, the full-link layer energy consumption model, the RNN layer energy consumption model and the pooling layer energy consumption model.
As an optional implementation manner of the present invention, the unit energy consumption includes read-write energy consumption and computational energy consumption.
The energy consumption in the convolutional layer energy consumption model is calculated as follows:
for a complete sample, a convolution layer is calculated, and the energy consumption is:
E1=E1IO+E1operation(1)
E1IOfor convolutional layer read-write energy consumption, E1operationTo calculate the energy consumption.
Energy consumption for reading and writing E1IO=E1weightIO+E1inputIO+E1outputIO(2)
E1weightIOTo weight energy consumption, E1inputIOFor input of energy consumption, E1outputIOTo output energy consumption.
Respectively calculating weighted energy consumption E1weightIOInput energy consumption E1inputIOOutput energy consumption E1outputIOThe following were used:
(1) weighted energy consumption E1weightIO
E1weightIO=E1DRAM_buffer+E1buffer_PE(3)
E1DRAM_bufferEnergy consumption for reading from DRAM to cache, E1buffer_PEIs the energy consumption from the cache to the PE array.
Figure BDA0001863240100000071
NwIs the total weight of the unit convolutional layer, RDDRAM_CONVThe number of rounds of weight is written to the cache for the DRAM.
Total weight N of unit convolutional layerw=γChout_iKi 2Chin_i(5) Energy consumption e for DRAM to read 1 byte data from cacheDRAM_bufferFor example 200 energy consumption units.
The weight and the number of times of repeated importation are related to the size of the on-chip cache. If the on-chip cache can store all the weight data of the layer, each data only needs to be imported once, and the subsequent access only relates to the interaction between the cache and the PE array; on the contrary, when the PE needs to reuse the weight data, the weight data that needs to be used is not stored in the cache and needs to be repeatedly accessed. In a PE arrayIn term, a two-dimensional output map can be obtained
Figure BDA0001863240100000081
An output point, so that the input data needs to be divided
Figure BDA0001863240100000082
The number of times of sub-import into the PE array, i.e. in case of insufficient cache capacity, the same weight is repeatedly imported is total
Figure BDA0001863240100000083
Therefore, the temperature of the molten metal is controlled,
Figure BDA0001863240100000084
the number of times of repeated reading and writing is
Figure BDA0001863240100000085
Energy consumption e for reading 1 byte data from PE array by cachebuffer_PEFor example, 6 energy consumption units, and, therefore,
Figure BDA0001863240100000086
(2) input energy consumption E1inputIO
Input data needs to pass from the DRAM through a buffer, a register, and to reach the PE array, and needs to be transferred in the PE array. Thus, E1inputIO=E1DRAM_buffer+E1buffer_register+E1register_PE+E1PE_tran(8)。
E1DRAM_bufferFor power consumption from DRAM to cache, E1buffer_registerFor power consumption from cache to register E1register_PEFor power consumption from registers to the PE array, E1PE_tranThe energy consumption for transferring the data input into the first PE at the left side of the PE array in the right direction is reduced.
It should be noted that, due to the existence of the padding, the total amount of input may be different from the number of points of the input feature map. If the filling is empty, the filling does not occur, and the original input size is not changed. If the padding is the same size, the original input size will change even if the size of the input/output graph is not changed, and the changed length and width will be used in the subsequent calculation.
Figure BDA0001863240100000087
Wherein Hin_iWin_iChin_iα is the total input data amount, S, of a unit convolutional layerCONvdatabufferIs the data size. Because the Thinker preferentially multiplexes all the input points, the input points do not need to be repeatedly introduced into the buffer.
Figure BDA0001863240100000091
A Register file (Register file) is added between the cache and the PE array to avoid duplicate input of input data.
Figure BDA0001863240100000092
Show to give
Figure BDA0001863240100000093
The number of input lines required for parallel operation of the line PE, S is the transverse step length, and the longitudinal step length is Hui+Si-Ki. In the process of transverse sliding, data is read only once, so the sliding times required in the longitudinal direction are
Figure BDA0001863240100000094
ebuffer_registerThe energy consumption for the buffer to register unit byte transfer is 6.
Figure BDA0001863240100000095
Figure BDA0001863240100000096
For the number of repeated importations of input data, Chin_iKi 2*Hout_iWout_iFor the total number of input data to be input as the product of the number of output points and the number of input points corresponding to each output point, eregister_PEThe energy consumption for the register-to-PE array unit byte transfer is 2.
Figure BDA0001863240100000097
Due to the multiplexing of the input data, each data input to the first PE on the left side of the PE array is passed to the right. So that the number of repeated accesses to each input point is
Figure BDA0001863240100000098
Energy consumption e for reading one byte dataPE_tranFor example 2.
(3) Output energy consumption E1outputIO
The total output data amount is β Chout_iHout_iWout_i. For output, the output needs to be transferred between PEs and then into the cache, and the data that cannot be stored in the cache is then imported into the DRAM to wait for the next calculation.
Thus, E1outputIO=E1PEout_tran+E1PE_buffer+E1buffer_DRAM(13)
Figure BDA0001863240100000101
The output points generated by each PE calculation need to be passed to the leftmost PE first. Considering a row of PEs, a number of PEs can be generated in a cycle of calculation
Figure BDA0001863240100000102
All the output points need to be transmitted to the leftmost PE and need to be accessed and stored together
Figure BDA0001863240100000103
Next, the process is carried out. Thus, on average, each output point requires memory access
Figure BDA0001863240100000104
Next, the process is carried out.
The total output data amount is β Chout_iHout_iWout_iThe transmission energy per byte is ePEout_tranFor example 2 energy consumption units.
Figure BDA0001863240100000105
Figure BDA0001863240100000106
Calculating energy consumption
Figure BDA0001863240100000107
Co-demand output Ch in one-layer CONV calculationin_iHin_iWin_iPoints, each point needing to be Ki 2Chin_iA sub-multiply-add operation.
In some embodiments, for a given sample, the energy consumed to complete all convolutional layer operations is the sum of the energy consumed by all convolutional layers.
In some embodiments, each sample in the AP computation stream and the TM computation stream is computed serially, so that the energy consumption for multiple samples only needs to be added to the energy consumption of different samples.
The energy consumption in the full connection layer energy consumption model is calculated as follows:
E2=E2IO+E2operation(18)
the full link layers are multi-sample, layer-by-layer computed in the Thinker chip. The read-write energy consumption of batch FC operation is calculated by the following formula:
E2IO=E2weightIO+E2inputIO+E2outputIO(19)
(1)E2weightIOthe calculating method comprises the following steps:
the weight read-write energy consumption of the FC layer is divided into two parts:
E2weightIO=E2DRAM_buffer+E2buffer_PE(20)
in the same way as the convolutional layer, the following can be obtained:
Figure BDA0001863240100000111
wherein BS is the batch size.
Figure BDA0001863240100000112
Figure BDA0001863240100000113
(2)E2inputIOThe calculating method comprises the following steps:
unlike the CONV layer, the input to the FC layer does not need to go through registers designed for convolutional input reuse, so E2inputIOThe device consists of three parts:
E2inputIO==E2DRAM_buffer+E2buffer_PE+E2PE_tran(24)
Figure BDA0001863240100000114
Figure BDA0001863240100000115
Figure BDA0001863240100000116
Figure BDA0001863240100000117
wherein e isPEinput_tranIs a unit of transmission energy consumption, for example 2 energy consumption units.
(3)E2outputIOThe calculating method comprises the following steps:
E2outputIO=E2PE_out_tran+E2PE_buffer+E2buffer_DRAM(29)
wherein:
Figure BDA0001863240100000118
E2PE_buffer=FCi+1BS*β*ePE_buffer(31)
E2buffer_DRAM=ReLU(FCi+1BS-SFCdatabuffer)*β*ebuffer_DRAM(32)
wherein ePEoutput_tran、ePE_bufferAnd ebuffer_DRAMFor transmitting the energy consumption in each storage layer unit, the energy consumption is respectively 2, 6 and 200 energy consumption units.
When a plurality of FC layers exist in the network, the total energy consumption of FC calculation in a batch is the sum of the read-write energy consumption of all the FC layers.
Calculated energy consumption E2operationThe calculating method comprises the following steps:
in the calculation process, FC is sharedi+1BS output points, each output point needs to carry out the times of multiply-add operation as FCi. The energy consumption required for performing the unit byte multiply-add operation is eoperationThen the energy consumption of this part is: e2operation=FCi+ 1BS*FCi*αγeoperation(33)。
The energy consumption in the RNN layer energy consumption model is calculated as follows:
when a plurality of RNN layers exist in the network, the total energy consumption calculated by the RNN in a batch is the sum of the calculated energy consumption of each layer.
The most commonly applied RNN is a sequence model of an LSTM structure, and the following RNN energy consumption analysis mainly aims at the LSTM structure.
In RNN, the calculation flow of an LSTM unit comprises the following steps:
fi=σ(wxfxt+whfht-1+bf)
it=σ(wxixt+whiht-1+bi)
ot=σ(wxoxt+whoht-1+bo)
Figure BDA0001863240100000121
Figure BDA0001863240100000122
the Thinker chip utilizes two kinds of PE of ordinary PE and super PE to calculate the RNN layer. Wherein, Wx_gatext_gate+Wh_ gateht-1_gate+bgatePart of the calculation is the principle of calculation of the FC layer, and calculation is performed in the common PE. After the calculation is finished, the data is imported into the super PE to calculate the RNN-gating: calculating sigmoid or tanh functions to obtain various gate function vectors, and obtaining c through multiplication and addition calculationtAnd ht
Energy consumption of a single RNN layer and a batch is calculated firstly. Considering that RNN is divided into two parts of FC and RNN-gating, energy consumption with Iteration number of RNN operation in a batch as Iteration is calculated, and the expression of RNN layer energy consumption is as follows:
Figure BDA0001863240100000131
(1) the calculation of the RNN-FC energy consumption comprises the following steps:
the power consumption calculation of the two parts is basically the same as the power consumption calculation principle of the FC layer. It differs from FC layer energy consumption calculation in that the dimensions of the weight matrix, input vectors and output vectors need to be adjusted. For RNN, the unified form of FC layer computation is:
FCt=Wx_gatext_gate+Wh_gateht-1_gate+bgate(35)
wherein, the gate can be i/f/o/g, which respectively corresponds to four gates in the LSTM. In a specific implementation, Wx_gateAnd Wh_gateWill be transversely connected and integrated into a Wgate,xt_gateAnd ht-1_gateWill be longitudinally connected and integrated into xgateThus WgateDimension of Olen_i*(Ilen-i+Olen_i),xgateDimension 1 (I)len-i+Olen_i) With the parameter FC of the FC layeriAnd FCi+1Correspondingly, namely:
FCi=Ilen-i+Olen_i
FCi+1=Olen_i
therefore, the method can be classified into an FC layer energy consumption model for analysis. It is noted that the FC operation needs to be repeated four times for each LSTM cell, corresponding to the number of gates. And will not be described in detail herein.
(2) The calculation of the gating memory access energy consumption comprises the following steps:
as described above, in the gating calculation, two calculations can be mainly classified: (a) simple tan/sigmoid function calculation; (b) element-to-element multiplication
Figure BDA0001863240100000132
Thus, it is possible to provide
Figure BDA0001863240100000133
Can be further expressed as:
Figure BDA0001863240100000134
Figure BDA0001863240100000135
the calculation of (a) includes:
tanh/sigmoid is mainly used for calculating 4 gate functions and ctI.e. the vector length needs to be Olen_iThe 5 groups of data are subjected to element-by-element tanh/sigmoid operation, and the total number of operations required to be performed is BS 5Olen_i. The calculation is performed in the PE, and data needs to be imported from the DRAM into the buffer, imported from the buffer into the PE, and finally exported. Suppose that the energy consumption for the unit byte DRAM to buffer transfer is eDRAM_bufferThe energy consumption for transmitting the unit byte from the buffer to the PE is ebuffer_PEOf, singlyThe energy consumption of the bit byte from PE to buffer and from buffer to DRAM is ebuffer_DRAMAnd ePE_buffer. This operation is performed at most once per data, and the interaction of data between DRAM and buffer only occurs when the buffer size cannot accommodate all the input points, so the following equation can be summarized:
Figure BDA0001863240100000141
Figure BDA0001863240100000142
mainly comprises ctIs calculated bytTwo parts are calculated. For htIn other words, only one element-by-element multiplication operation is required, so that the data read only involves a total of 2Olen_iInput data, write-out of data involving Olen_iAnd (4) data. The same way as the above tan/sigmoid calculation:
Figure BDA0001863240100000143
for ctIn other words, an accumulation operation is required in addition to multiplication. For one output datum, two times of multiplication operations are needed, and then two products are summed, so that one product needs to be read out into the buffer and then imported into the register in the PE to complete the summation. Similarly, the energy consumption of the part can be obtained as follows:
Figure BDA0001863240100000144
computing the energy consumption of Gating
For gating, three operations that may occur are: tanh/sigmoid, multiplication, addition. The unit byte operation energy consumption corresponding to the three operations is assumed as follows: e.g. of the typetanh/sigmoid、emultiply、eaddFor example, all values are 1, and the times that three operations need to be executed can be calculated as: 5Olen_i、3Olen_i、Olen_i. For this purpose, it is possible to calculate
Figure BDA0001863240100000145
Comprises the following steps:
Figure BDA0001863240100000146
the energy consumption in the pooling layer energy consumption model is calculated as follows:
for pooling operation, the energy consumption of a single layer, single sample, can be considered first, since there is no multiplexing of data between different layers, different samples. Similarly, the energy consumption of the pooling operation is divided into a discussion of reading and writing energy consumption and calculating energy consumption.
An energy consumption model of the pooling operation is established. If a multi-layer, multi-sample result needs to be computed, only a summation operation needs to be performed.
E4=E4IO+E4operation(41)
Pooling layer read-write energy consumption E4IOIs calculated by
Thinker supports maximum pooling, which acts to reduce the height and width of the output plot while maintaining the number of channels. The input graph and the output graph have dimension height relation:
Figure BDA0001863240100000151
the width relationship:
Figure BDA0001863240100000152
the total pooled block number is then:
X=Hout_i*Wout_i*Chin_i(42)
the data read-write energy consumption of the pooling operation can be divided into two types of input data and output data, and each type needs to complete read-write interaction from a DRAM (dynamic random access memory), a cache and a PE (provider edge) array. Since data does not need to be repeatedly imported, the energy consumption model of the part is simple, namely:
Figure BDA0001863240100000153
wherein:
Figure BDA0001863240100000154
for the read-in energy consumption of the input data,
Figure BDA0001863240100000155
is the write-out power consumption of the output data.
The calculation of the calculated energy consumption involves the total data number that needs to be input. It should be noted that the total data input is also not equal to the size of the input feature map, and a reverse reasoning needs to be performed based on the size of the output map and the size of the kernel, that is: e4operation=pi*pi*Hout_i*Wout_i*Chout_i*eoperation(44)。
As an optional implementation manner of the present invention, the neural network time model includes a time overhead, and the time overhead includes a convolutional layer time overhead and a full link layer time overhead.
As an optional implementation manner of the present invention, the time overhead calculation method is that the time overhead Tz is max (T)IO,Toperation),TIOFor read-write time, ToperationTo calculate the time.
As an optional implementation manner of the present invention, the neural network time model includes: convolutional layer time model, full link layer time model.
In the case of the Thinker chip, the time overhead mainly depends on the amount of calculation, the convolution operation with a large amount of data, and the full-link operation (including the full-link operation in the full-link layer and RNN). Due to the "blocking" effect of time, the less time consuming RNN-gating and pooling operations do not require model building. Next, time modeling analysis is performed from both the convolutional layer and the fully-connected layer.
As an optional implementation manner of the present invention, the convolutional layer time model and the fully-connected layer time model include time overheads, and the time overheads are calculated by a method in which the time overheads Tz is max (TIO,Toperation),TIOFor read-write time, ToperationTo calculate the time.
The convolutional layer time model includes:
the calculation of the read-write time comprises the following steps:
in the Thinker calculation process, the convolution layers are calculated one by a single-layer single-sample meter, so that the time required by convolution operation of one layer in one sample is firstly analyzed.
Data reading and writing needs to pass through a plurality of layers such as DRAM, cache, PE and the like. After the data is imported into the on-chip buffer, the transmission time of the data is very fast, so that the part can be ignored. Therefore, TIOWill be reduced to the time consumed by the data in the DRAM and cache interaction.
In the Thinker chip, input and output data and weights are imported from the DRAM into two different buffers, which have different bandwidths and can be adaptively changed according to the architecture of the neural network. Bandwidth BW shared by input and output datadataconvThe bandwidth of the weighted data is BWweightconv. In one layer network, the total amount of input and output data is Hin_iWin_iChin_iα+Hout_iWout_iChout_iβ, the total weight data is
Figure BDA0001863240100000161
Therefore, the read-write time of the input data and the read-write time of the output data and the read-write time of the weight data can be calculated respectively. The larger of the two is TIOThus:
Figure BDA0001863240100000171
calculating the time ToperationThe calculation of (a) includes:
for a layer of a convolutional network of a sample, its computation requires several rounds of operation by the PE array. In one round of computation, the PE array will compute several output points in parallel, after several rounds all output points of the layer can be computed. Suppose to complete oneThe number of rounds required for layer network computation is roundconvlayerThe time required for each round is troundconvThen the computation time can be expressed as:
Toperation=roundconvlayer*troundconv(46)。
Figure BDA0001863240100000172
since Thinker multiplexes input data per line, Hout_iWout_iIs the total number of lines that need to be input in total, and each round can be calculated
Figure BDA0001863240100000173
Data of a row, thus requiring a total of
Figure BDA0001863240100000174
The input can be completely calculated by the wheel; for a set of input data, the same weighting requires input
Figure BDA0001863240100000175
Next, the process is carried out.
Each output point comprises
Figure BDA0001863240100000176
The time it takes to perform a round of computation is therefore the time required for these sequential multiply-add operations:
Figure BDA0001863240100000177
thus, the time to compute a convolutional layer in a sample is:
the operation time of all the convolution layers in one sample only needs to sum layer by layer time; if the convolution layer operation time is to be calculated for all samples in a batch, it is only necessary to multiply the batch size BS by the total time of one sample.
The full connection layer time model includes:
since different samples of a batch in the FC layer are calculated in parallel, one layer of FC operation time of one batch of data is directly analyzed.
The read and write time calculation only takes into account the data interaction between the DRAM and the cache. The bandwidth allocated to the input and output data is assumed to be BWdata_FCThe bandwidth allocated by the weight is BWweight_FCAnd the total number of times of inputting and outputting data is (FC)iα*RDDRAM_in_FC+FCi+1β) BS, total number of weighted inputs of FCiFCi+1γ*RDDRAM_weight_FCThe formula of the read-write time is as follows:
Figure BDA0001863240100000181
the calculation of the calculation time includes: roundFClayerAnd tround_FC
Figure BDA0001863240100000182
Wherein:
Figure BDA0001863240100000183
it will be appreciated that the computation time for all FC layers in a batch need only be summed layer-by-layer.
The hardware parameters of the model are also variable, so that the operation time and energy consumption of the same network under different hardware parameters can be tested by adjusting an algorithm, an optimal solution under the condition of hardware area limitation can be obtained by adjusting various hardware parameters (such as bandwidth, cache size, PE array size and channel number), the time and energy consumption are minimized from the perspective of hardware design, and the leading modeling parameters of the time and energy consumption overhead are analyzed while the time and energy consumption are predicted layer by layer, so that the design of the hardware can be helped to a certain extent.
As an optional implementation manner of the present invention, the performing dual-objective optimization on the neural network energy consumption model and the neural network time model includes:
performing cache segmentation on the neural network time model by using a TM calculation process;
carrying out array segmentation on the neural network energy consumption model by using an AP (access point) calculation process;
and selecting the array segmentation method with the minimum corresponding energy consumption when the cache segmentation meets the requirement condition.
The Thinker chip supports two calculation flows of AP and TM. The AP is called Array-Partitioning, namely Array Partitioning; the TM is called Time-Multiplexing, Time division Multiplexing. The difference between the two is that the calculation order is different for the same network inference process. For the TM computation flow, the network inference is performed one by one in the order of layers, with all PE arrays computing the same layer at the same time. But for the AP computation flow, some adjustment is made to the partitioning of the array, so the PE array may compute multiple layers at the same time. Since the AP computation flow can compute multiple types of layers at the same time, the characteristics of high computation amount and small data amount of convolution operation, small computation amount and large data amount of full-connection operation are greatly balanced, and the time can be shortened. However, because the TM calculation process has a large number of data multiplexing times, this results in a small number of data retransmission times, and thus the energy consumption is small.
As an optional implementation manner of the present invention, the selecting the array partition method in which the cache partition satisfies the requirement condition that the corresponding energy consumption is minimum includes:
setting a task time Tmax;
calculating the time T required by partitioning the neural network time model by using the TM calculation process cacheTM
Calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow arrayAP
Selecting the minimum required energy consumption EAP0And calculating the minimum required energy consumption EAP0Required time TAP0
Comparing Tmax and TAP0Judging whether the set task time meets the requirement or not according to the comparison result;
if the set task time is satisfiedCalculating and comparing Tmax and TTMAnd outputting a result of performing dual-objective optimization on the neural network energy consumption model and the neural network time model according to the comparison result.
As an optional implementation manner of the present invention, the determining whether the set task time meets the requirement according to the comparison result includes:
if Tmax>TAP0Judging that the set task time meets the requirement;
otherwise, judging that the task can not be completed within the specified time, and resetting the task time.
As an optional implementation manner of the present invention, the outputting a result of performing a dual-target optimization on the neural network energy consumption model and the neural network time model according to the comparison result includes:
if Tmax<TTMAdjusting modeling parameters, and carrying out array segmentation on the neural network energy consumption model again;
calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow arrayAP’And selecting the minimum energy consumption EAP0’The corresponding modeling parameters and the array segmentation method are dual-target optimization results;
otherwise, selecting the modeling parameters and the cache segmentation method corresponding to the TM calculation process as a dual-target optimization result.
In the actual execution process, in order to achieve the purpose of executing different layers simultaneously, not only the number of executable PEs is adjusted, but also the allocation condition of the cache is adjusted. Since the total buffer size and array size of the Thinker chip are given and only vertical dicing is a reasonable way, the parameters that need to be cycled through include the number of array columns allocated to CONV
Figure BDA0001863240100000201
Data buffer space size SCONvdatabufferAnd the size S of the buffer spaceconvweightbuffer
It can be known during modeling that the time overhead of network computation has a "hidden effect". "hiding effect" means that the time calculation of a layer can be split into the calculation time and the data transmission time, and the total time of the layer should be the maximum of the two. And the data transmission time takes the greater of the data transmission time and the weight transmission time. To further study the time overhead, the dominant factor of time consumption per layer needs to be studied. Therefore, taking an LRCN typical neural network as an example, for FC operation, both TM and AP0 calculation processes are time overhead of weight reading and writing, and are dominant; while for CONV operation it is basically the calculation time that prevails. The full connection layer is small in calculation amount and large in data amount.
The data read-write time, the weight read-write time and the calculation time of the TM calculation process are greatly different. The computation time overhead of the convolutional layer is far greater than the data read-write time and the weight read-write; in this case, when the PE array is busy computing for a long period of time, the data transfer is idle; while FC calculation is performed, the PE array is idle due to the long time spent on weight reading and writing. In the time layer-by-layer overhead of the AP0, the data read-write time, the weight read-write time, and the calculation time are well balanced. Therefore, the total time can be effectively reduced by greatly increasing the time of convolution operation which is originally smaller than a certain value. The time overhead of the FC layer is dominated by weight reading and writing, and if the FC time is shortened, the proportion of the buffer allocated to the FC weight is required to be increased or the total time is reduced by increasing the total buffer amount or increasing the data transmission bandwidth.
Hardware parameters of the neural network energy consumption model and the neural network time model are also variable, so that the operation time and energy consumption of the same network under different hardware parameters can be tested by adjusting an algorithm, and certain help can be brought to the design of hardware. By adjusting various hardware parameters (such as bandwidth, cache size, PE array size, and number of channels), it is possible to obtain an optimal solution under the hardware area constraint, minimizing time and power consumption from a hardware design perspective.
Therefore, the TM calculation process is used for carrying out cache segmentation on the neural network time model; carrying out array segmentation on the neural network energy consumption model by using an AP (access point) calculation process; the array segmentation method with the minimum energy consumption corresponding to the condition that the cache segmentation meets the requirement is selected, so that the design of hardware in the future can be helped, and different optimal solutions can be realized under the conditions of different task times.
In this embodiment, time and energy consumption modeling is performed on the neural network from the perspective of a hardware calculation process of the network, time and energy consumption are predicted layer by layer, a leading modeling parameter of time and energy consumption overhead is analyzed, and a time and energy consumption dual-target optimization is performed on the neural network by improving the modeling parameter, an array segmentation method and a cache segmentation method, so that a neural network model is improved. Furthermore, different optimal solutions can be obtained under the conditions of different task times through layer-by-layer modeling.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.
It should be noted that the present invention is not limited to the above-mentioned preferred embodiments, and those skilled in the art can obtain other products in various forms without departing from the spirit of the present invention, but any changes in shape or structure can be made within the scope of the present invention with the same or similar technical solutions as those of the present invention.

Claims (8)

1. A neural network optimization method, comprising:
presetting modeling parameters, wherein the modeling parameters comprise network parameters and hardware parameters;
constructing a neural network energy consumption model based on the modeling parameters;
constructing a neural network time model based on the modeling parameters;
performing dual-objective optimization on the neural network energy consumption model and the neural network time model, including:
performing cache segmentation on the neural network time model by using a TM calculation process;
carrying out array segmentation on the neural network energy consumption model by using an AP (access point) calculation process;
the array segmentation method with the minimum corresponding energy consumption when the cache segmentation meets the requirement is selected, and comprises the following steps: setting a task time Tmax;
calculating the time T required by partitioning the neural network time model by using the TM calculation process cacheTM
Calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow arrayAP
Selecting the minimum required energy consumption EAP0And calculating the minimum required energy consumption EAP0Required time TAP0
Comparing Tmax and TAP0Judging whether the set task time meets the requirement or not according to the comparison result;
if the set task time meets the requirement, comparing Tmax with TTMSize of (2)And outputting a dual-target optimization result of the neural network energy consumption model and the neural network time model according to the comparison result.
2. The neural network optimization method according to claim 1, wherein the energy consumption calculation formula is E = V × T × E, V is a data amount to be read/written/calculated, T is a number of times to be read/written/calculated repeatedly, and E is unit energy consumption.
3. The neural network optimization method of claim 2, wherein the unit energy consumption comprises read-write energy consumption and computational energy consumption.
4. The neural network optimization method of claim 1, wherein the constructing the neural network energy consumption model comprises: and modeling the energy consumption of the neural network layer by layer to obtain a convolutional layer energy consumption model, a full-connection layer energy consumption model, an RNN layer energy consumption model and a pooling layer energy consumption model.
5. The neural network optimization method of claim 1, wherein the neural network temporal model comprises a temporal cost comprising a convolutional layer temporal cost and a fully-connected layer temporal cost.
6. The neural network optimization method of claim 5, wherein the time cost calculation method is time cost Tz max (T ═ max)IO,Toperation),TIOFor read-write time, ToperationTo calculate the time.
7. The neural network optimization method of claim 1, wherein the determining whether the set task time meets the requirement according to the comparison result comprises:
if Tmax>TAP0Judging that the set task time meets the requirement;
otherwise, judging that the task can not be completed within the specified time, and resetting the task time.
8. The neural network optimization method of claim 1, wherein the performing a dual-objective optimization on the neural network energy consumption model and the neural network time model according to the comparison output comprises:
if Tmax<TTMAdjusting modeling parameters, and carrying out array segmentation on the neural network energy consumption model again;
calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow arrayAP', and selecting the minimum energy consumption EAP0The corresponding modeling parameters and array segmentation method are dual-target optimization results;
otherwise, selecting the modeling parameters and the cache segmentation method corresponding to the TM calculation process as a dual-target optimization result.
CN201811344189.0A 2018-11-13 2018-11-13 Neural network optimization method Withdrawn - After Issue CN109472361B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811344189.0A CN109472361B (en) 2018-11-13 2018-11-13 Neural network optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811344189.0A CN109472361B (en) 2018-11-13 2018-11-13 Neural network optimization method

Publications (2)

Publication Number Publication Date
CN109472361A CN109472361A (en) 2019-03-15
CN109472361B true CN109472361B (en) 2020-08-28

Family

ID=65671820

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811344189.0A Withdrawn - After Issue CN109472361B (en) 2018-11-13 2018-11-13 Neural network optimization method

Country Status (1)

Country Link
CN (1) CN109472361B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738318B (en) * 2019-09-11 2023-05-26 北京百度网讯科技有限公司 Network structure operation time evaluation and evaluation model generation method, system and device
CN110929860B (en) * 2019-11-07 2020-10-23 深圳云天励飞技术有限公司 Convolution acceleration operation method and device, storage medium and terminal equipment
CN111753950B (en) * 2020-01-19 2024-02-27 杭州海康威视数字技术股份有限公司 Forward time consumption determination method, device and equipment
CN112085195B (en) * 2020-09-04 2022-09-23 西北工业大学 X-ADMM-based deep learning model environment self-adaption method
CN112468533B (en) * 2020-10-20 2023-01-10 安徽网萌科技发展股份有限公司 Agricultural product planting-oriented edge learning model online segmentation method and system
CN113377546B (en) * 2021-07-12 2022-02-01 中科弘云科技(北京)有限公司 Communication avoidance method, apparatus, electronic device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102773981A (en) * 2012-07-16 2012-11-14 南京航空航天大学 Implementation method of energy-saving and optimizing system of injection molding machine
CN105302973A (en) * 2015-11-06 2016-02-03 重庆科技学院 MOEA/D algorithm based aluminum electrolysis production optimization method
CN106427589A (en) * 2016-10-17 2017-02-22 江苏大学 Electric car driving range estimation method based on prediction of working condition and fuzzy energy consumption

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10621486B2 (en) * 2016-08-12 2020-04-14 Beijing Deephi Intelligent Technology Co., Ltd. Method for optimizing an artificial neural network (ANN)

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102773981A (en) * 2012-07-16 2012-11-14 南京航空航天大学 Implementation method of energy-saving and optimizing system of injection molding machine
CN105302973A (en) * 2015-11-06 2016-02-03 重庆科技学院 MOEA/D algorithm based aluminum electrolysis production optimization method
CN106427589A (en) * 2016-10-17 2017-02-22 江苏大学 Electric car driving range estimation method based on prediction of working condition and fuzzy energy consumption

Also Published As

Publication number Publication date
CN109472361A (en) 2019-03-15

Similar Documents

Publication Publication Date Title
CN109472361B (en) Neural network optimization method
CN108564168B (en) Design method for neural network processor supporting multi-precision convolution
US20220051087A1 (en) Neural Network Architecture Using Convolution Engine Filter Weight Buffers
CN111667051A (en) Neural network accelerator suitable for edge equipment and neural network acceleration calculation method
US11775430B1 (en) Memory access for multiple circuit components
US11500959B2 (en) Multiple output fusion for operations performed in a multi-dimensional array of processing units
JP2020521195A (en) Scheduling neural network processing
Zhou et al. Transpim: A memory-based acceleration via software-hardware co-design for transformer
CN110738316B (en) Operation method and device based on neural network and electronic equipment
Mittal A survey of accelerator architectures for 3D convolution neural networks
US11669443B2 (en) Data layout optimization on processing in memory architecture for executing neural network model
KR20180123846A (en) Logical-3d array reconfigurable accelerator for convolutional neural networks
CN113361695B (en) Convolutional neural network accelerator
CN111105023A (en) Data stream reconstruction method and reconfigurable data stream processor
TWI775210B (en) Data dividing method and processor for convolution operation
CN112380793A (en) Turbulence combustion numerical simulation parallel acceleration implementation method based on GPU
Yan et al. FPGAN: an FPGA accelerator for graph attention networks with software and hardware co-optimization
WO2019182059A1 (en) Model generation device, model generation method, and program
CN115668222A (en) Data processing method and device of neural network
US20230025068A1 (en) Hybrid machine learning architecture with neural processing unit and compute-in-memory processing elements
CN113031853A (en) Interconnection device, operation method of interconnection device, and artificial intelligence accelerator system
CN116090518A (en) Feature map processing method and device based on systolic operation array and storage medium
KR20240036594A (en) Subsum management and reconfigurable systolic flow architectures for in-memory computation
US11954580B2 (en) Spatial tiling of compute arrays with shared control
CN113392959A (en) Method for reconstructing architecture in computing system and computing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
AV01 Patent right actively abandoned
AV01 Patent right actively abandoned
AV01 Patent right actively abandoned

Granted publication date: 20200828

Effective date of abandoning: 20210125

AV01 Patent right actively abandoned

Granted publication date: 20200828

Effective date of abandoning: 20210125