CN109472361B - Neural network optimization method - Google Patents
Neural network optimization method Download PDFInfo
- Publication number
- CN109472361B CN109472361B CN201811344189.0A CN201811344189A CN109472361B CN 109472361 B CN109472361 B CN 109472361B CN 201811344189 A CN201811344189 A CN 201811344189A CN 109472361 B CN109472361 B CN 109472361B
- Authority
- CN
- China
- Prior art keywords
- energy consumption
- neural network
- time
- model
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn - After Issue
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Image Analysis (AREA)
Abstract
The application relates to a neural network optimization method, which comprises the following steps: presetting modeling parameters, wherein the modeling parameters comprise network parameters and hardware parameters; constructing a neural network energy consumption model based on the modeling parameters; constructing a neural network time model based on the modeling parameters; and performing double-target optimization on the neural network energy consumption model and the neural network time model. According to the method, time and energy consumption modeling is carried out on the neural network from the perspective of a hardware computing process of the network, time and energy consumption are predicted layer by layer, leading modeling parameters of time and energy consumption overhead are analyzed, and a neural network model is improved by improving the modeling parameters, an array segmentation method and a cache segmentation method to carry out time and energy consumption double-target optimization on the neural network.
Description
Technical Field
The application relates to the technical field of artificial neural networks, in particular to a neural network optimization method.
Background
With the rise of the neural network technology, neural network hardware suitable for different application scenes comes into play. The neural network has strong reasoning and predicting capability but large calculation amount, so how to increase the calculation speed of the neural network and reduce the energy consumption of the neural network becomes a key.
In the related art, both the neural network training process and the inference process have urgent needs for acceleration of network computation. Network training is basically completed at the cloud by using the GPU, and the parallelization method and the communication method of different hardware can greatly influence the speed of neural network training, so that the neural network computing speed is improved mainly from the aspects of improving parallelization computing capability and reducing communication overhead. The energy consumption of the neural network can be divided into two aspects of calculation energy consumption and memory access energy consumption. By different data multiplexing modes of output fixing and weight fixing, the memory access energy consumption of the neural network can be effectively reduced.
However, the above researches on the calculation of the neural network focus on one of low energy consumption or accelerated calculation speed, and do not consider that there may be a certain contradiction between the acceleration of the neural network and the energy perception in a complex application environment, and when the energy consumption is reduced, the speed may be sacrificed or the energy may be used, and the time for reducing the calculation of the neural network may generate more energy consumption.
Disclosure of Invention
In order to overcome the problem that researches on neural network calculation in the related art focus on one of low energy consumption or accelerated calculation speed at least to a certain extent, and do not consider that the acceleration and energy perception of the neural network in a complex application environment may have certain contradictions, the speed can be sacrificed or energy can be used when the energy consumption is reduced, and the time for reducing the neural network calculation may generate more energy consumption, the application provides a neural network optimization method.
Presetting modeling parameters, wherein the modeling parameters comprise network parameters and hardware parameters;
constructing a neural network energy consumption model based on the modeling parameters;
constructing a neural network time model based on the modeling parameters;
and performing double-target optimization on the neural network energy consumption model and the neural network time model.
Furthermore, the energy consumption calculation formula is E ═ V × T × E, V is the data volume to be read/written/calculated, T is the number of times to be read/written/calculated repeatedly, and E is the unit energy consumption.
Further, the unit energy consumption includes read-write energy consumption and calculation energy consumption.
Further, the building of the neural network energy consumption model includes: and modeling the energy consumption of the neural network layer by layer to obtain a convolutional layer energy consumption model, a full-connection layer energy consumption model, an RNN layer energy consumption model and a pooling layer energy consumption model.
Further, the neural network time model includes a time overhead, the time overhead including convolutional layer time overhead and full link layer time overhead.
Further, the time cost calculation method is that the time cost Tz is max (T)IO,Toperation),TIOFor read-write time, ToperationTo calculate the time.
Further, the performing of the dual-objective optimization on the neural network energy consumption model and the neural network time model includes:
performing cache segmentation on the neural network time model by using a TM calculation process;
carrying out array segmentation on the neural network energy consumption model by using an AP (access point) calculation process;
and selecting the array segmentation method with the minimum corresponding energy consumption when the cache segmentation meets the requirement condition.
Further, the method for selecting the array partition with the smallest corresponding energy consumption when the cache partition meets the requirement condition includes:
setting a task time Tmax;
calculating the time T required by partitioning the neural network time model by using the TM calculation process cacheTM;
Calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow arrayAP;
Selecting the minimum required energy consumption EAP0And calculating the minimum required energy consumption EAP0Required time TAP0;
Comparing Tmax and TAP0Judging whether the set task time meets the requirement or not according to the comparison result;
if the set task time meets the requirement, comparing Tmax with TTMAnd outputting a result of performing dual-objective optimization on the neural network energy consumption model and the neural network time model according to the comparison result.
Further, the determining whether the set task time meets the requirement according to the comparison result includes:
if Tmax>TAP0Judging that the set task time meets the requirement;
otherwise, judging that the task can not be completed within the specified time, and resetting the task time.
Further, the outputting a result of performing dual-objective optimization on the neural network energy consumption model and the neural network time model according to the comparison result includes:
if Tmax<TTMAdjusting modeling parameters, and carrying out array segmentation on the neural network energy consumption model again;
calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow arrayAP’And selecting the minimum energy consumption EAP0’The corresponding modeling parameters and the array segmentation method are dual-target optimization results;
otherwise, selecting the modeling parameters and the cache segmentation method corresponding to the TM calculation process as a dual-target optimization result.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
the neural network optimization method comprises the following steps: presetting modeling parameters, and constructing a neural network energy consumption model and a neural network time model based on the modeling parameters; and performing double-target optimization on the neural network energy consumption model and the neural network time model. According to the method, time and energy consumption modeling is carried out on the neural network from the perspective of a hardware computing process of the network, time and energy consumption are predicted layer by layer, leading modeling parameters of time and energy consumption overhead are analyzed, and a neural network model is improved by improving the modeling parameters, changing an array segmentation method and a cache segmentation method to carry out time and energy consumption double-target optimization on the neural network.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Fig. 1 is a flowchart of a neural network optimization method according to an embodiment of the present application.
Detailed Description
The invention is described in detail below with reference to the figures and examples.
Fig. 1 is a flowchart of a neural network optimization method according to an embodiment of the present application.
As shown in fig. 1, the neural network optimization method of the present embodiment includes:
s1: presetting modeling parameters, wherein the modeling parameters comprise network parameters and hardware parameters;
s2: constructing a neural network energy consumption model based on the modeling parameters;
s3: constructing a neural network time model based on the modeling parameters;
s4: and performing double-target optimization on the neural network energy consumption model and the neural network time model.
In the neural network structure, there are three different structures of a convolutional layer (CONV), a fully connected layer (FC) and a pooling layer (POOL), and the three different structures correspond to three tasks of feature extraction, feature connection and feature compression respectively. The number and type of layers of different network structures are different, and the characteristics of different layers are different. For example, the convolution layer has a large amount of computation and the fully-connected layer has a large amount of data, and this imbalance of resource requirements needs to be realized by a special hardware architecture design.
Some Neural network architectures also include an RNN (Recurrent Neural Networks) layer or a CNN (Convolutional Neural Networks) layer to accomplish specific tasks.
The hardware is, for example, a Thinker chip, and the Thinker chip is composed of a PE array, an on-chip storage system, a finite state controller, an IO (input/output) and a decoder.
The modeling parameters include network parameters and hardware parameters, which are shown in table 1.
TABLE 1 modeling parameter Table
The neural network algorithm with the lowest energy consumption under the given task time condition is realized by constructing and constructing a neural network energy consumption model and a neural network time model, simultaneously carrying out double-objective optimization on energy consumption and time and adjusting modeling parameters.
As an optional implementation manner of the present invention, the energy consumption calculation formula is E ═ V × T × E, V is a data amount to be read/written/calculated, T is the number of times to be read/written/calculated repeatedly, and E is unit energy consumption.
The data volume V required to be read/written/calculated and the times T required to be read/written/calculated repeatedly can be calculated by utilizing network parameters and hardware parameters; for the unit energy consumption e, the access energy consumption of different storage levels is different, so that the analysis needs to be performed layer by layer.
The memory access energy consumption of different memory hierarchy is shown in table 2.
TABLE 2 memory access normalized energy consumption for different memory levels
And if the data is transmitted between the two storage mechanisms, selecting the energy consumption with higher energy consumption as the energy consumption of the data transmission. For example, when data is transferred from a DRAM to an on-chip cache SRAM, the energy consumption is determined to be 200 energy consumption units.
As an optional implementation manner of the present invention, the unit energy consumption includes read-write energy consumption and computational energy consumption.
As an optional implementation manner of the present invention, the constructing the neural network energy consumption model includes: and modeling the energy consumption of the neural network layer by layer to obtain a convolutional layer energy consumption model, a full-connection layer energy consumption model, an RNN layer energy consumption model and a pooling layer energy consumption model.
As an optional implementation manner of the present invention, the method for calculating the unit energy consumption E0 includes:
respectively constructing a convolutional layer energy consumption model, a full-connection layer energy consumption model, an RNN layer energy consumption model and a pooling layer energy consumption model;
and respectively calculating unit read-write/calculated energy consumption E0 corresponding to the convolutional layer energy consumption model, the full-link layer energy consumption model, the RNN layer energy consumption model and the pooling layer energy consumption model.
As an optional implementation manner of the present invention, the unit energy consumption includes read-write energy consumption and computational energy consumption.
The energy consumption in the convolutional layer energy consumption model is calculated as follows:
for a complete sample, a convolution layer is calculated, and the energy consumption is:
E1=E1IO+E1operation(1)
E1IOfor convolutional layer read-write energy consumption, E1operationTo calculate the energy consumption.
Energy consumption for reading and writing E1IO=E1weightIO+E1inputIO+E1outputIO(2)
E1weightIOTo weight energy consumption, E1inputIOFor input of energy consumption, E1outputIOTo output energy consumption.
Respectively calculating weighted energy consumption E1weightIOInput energy consumption E1inputIOOutput energy consumption E1outputIOThe following were used:
(1) weighted energy consumption E1weightIO
E1weightIO=E1DRAM_buffer+E1buffer_PE(3)
E1DRAM_bufferEnergy consumption for reading from DRAM to cache, E1buffer_PEIs the energy consumption from the cache to the PE array.
NwIs the total weight of the unit convolutional layer, RDDRAM_CONVThe number of rounds of weight is written to the cache for the DRAM.
Total weight N of unit convolutional layerw=γChout_iKi 2Chin_i(5) Energy consumption e for DRAM to read 1 byte data from cacheDRAM_bufferFor example 200 energy consumption units.
The weight and the number of times of repeated importation are related to the size of the on-chip cache. If the on-chip cache can store all the weight data of the layer, each data only needs to be imported once, and the subsequent access only relates to the interaction between the cache and the PE array; on the contrary, when the PE needs to reuse the weight data, the weight data that needs to be used is not stored in the cache and needs to be repeatedly accessed. In a PE arrayIn term, a two-dimensional output map can be obtainedAn output point, so that the input data needs to be dividedThe number of times of sub-import into the PE array, i.e. in case of insufficient cache capacity, the same weight is repeatedly imported is totalTherefore, the temperature of the molten metal is controlled,
the number of times of repeated reading and writing isEnergy consumption e for reading 1 byte data from PE array by cachebuffer_PEFor example, 6 energy consumption units, and, therefore,
(2) input energy consumption E1inputIO
Input data needs to pass from the DRAM through a buffer, a register, and to reach the PE array, and needs to be transferred in the PE array. Thus, E1inputIO=E1DRAM_buffer+E1buffer_register+E1register_PE+E1PE_tran(8)。
E1DRAM_bufferFor power consumption from DRAM to cache, E1buffer_registerFor power consumption from cache to register E1register_PEFor power consumption from registers to the PE array, E1PE_tranThe energy consumption for transferring the data input into the first PE at the left side of the PE array in the right direction is reduced.
It should be noted that, due to the existence of the padding, the total amount of input may be different from the number of points of the input feature map. If the filling is empty, the filling does not occur, and the original input size is not changed. If the padding is the same size, the original input size will change even if the size of the input/output graph is not changed, and the changed length and width will be used in the subsequent calculation.
Wherein Hin_iWin_iChin_iα is the total input data amount, S, of a unit convolutional layerCONvdatabufferIs the data size. Because the Thinker preferentially multiplexes all the input points, the input points do not need to be repeatedly introduced into the buffer.
A Register file (Register file) is added between the cache and the PE array to avoid duplicate input of input data.Show to giveThe number of input lines required for parallel operation of the line PE, S is the transverse step length, and the longitudinal step length is Hui+Si-Ki. In the process of transverse sliding, data is read only once, so the sliding times required in the longitudinal direction areebuffer_registerThe energy consumption for the buffer to register unit byte transfer is 6.
For the number of repeated importations of input data, Chin_iKi 2*Hout_iWout_iFor the total number of input data to be input as the product of the number of output points and the number of input points corresponding to each output point, eregister_PEThe energy consumption for the register-to-PE array unit byte transfer is 2.
Due to the multiplexing of the input data, each data input to the first PE on the left side of the PE array is passed to the right. So that the number of repeated accesses to each input point isEnergy consumption e for reading one byte dataPE_tranFor example 2.
(3) Output energy consumption E1outputIO
The total output data amount is β Chout_iHout_iWout_i. For output, the output needs to be transferred between PEs and then into the cache, and the data that cannot be stored in the cache is then imported into the DRAM to wait for the next calculation.
Thus, E1outputIO=E1PEout_tran+E1PE_buffer+E1buffer_DRAM(13)
The output points generated by each PE calculation need to be passed to the leftmost PE first. Considering a row of PEs, a number of PEs can be generated in a cycle of calculationAll the output points need to be transmitted to the leftmost PE and need to be accessed and stored togetherNext, the process is carried out. Thus, on average, each output point requires memory accessNext, the process is carried out.
The total output data amount is β Chout_iHout_iWout_iThe transmission energy per byte is ePEout_tranFor example 2 energy consumption units.
Co-demand output Ch in one-layer CONV calculationin_iHin_iWin_iPoints, each point needing to be Ki 2Chin_iA sub-multiply-add operation.
In some embodiments, for a given sample, the energy consumed to complete all convolutional layer operations is the sum of the energy consumed by all convolutional layers.
In some embodiments, each sample in the AP computation stream and the TM computation stream is computed serially, so that the energy consumption for multiple samples only needs to be added to the energy consumption of different samples.
The energy consumption in the full connection layer energy consumption model is calculated as follows:
E2=E2IO+E2operation(18)
the full link layers are multi-sample, layer-by-layer computed in the Thinker chip. The read-write energy consumption of batch FC operation is calculated by the following formula:
E2IO=E2weightIO+E2inputIO+E2outputIO(19)
(1)E2weightIOthe calculating method comprises the following steps:
the weight read-write energy consumption of the FC layer is divided into two parts:
E2weightIO=E2DRAM_buffer+E2buffer_PE(20)
in the same way as the convolutional layer, the following can be obtained:
wherein BS is the batch size.
(2)E2inputIOThe calculating method comprises the following steps:
unlike the CONV layer, the input to the FC layer does not need to go through registers designed for convolutional input reuse, so E2inputIOThe device consists of three parts:
E2inputIO==E2DRAM_buffer+E2buffer_PE+E2PE_tran(24)
wherein e isPEinput_tranIs a unit of transmission energy consumption, for example 2 energy consumption units.
(3)E2outputIOThe calculating method comprises the following steps:
E2outputIO=E2PE_out_tran+E2PE_buffer+E2buffer_DRAM(29)
wherein:
E2PE_buffer=FCi+1BS*β*ePE_buffer(31)
E2buffer_DRAM=ReLU(FCi+1BS-SFCdatabuffer)*β*ebuffer_DRAM(32)
wherein ePEoutput_tran、ePE_bufferAnd ebuffer_DRAMFor transmitting the energy consumption in each storage layer unit, the energy consumption is respectively 2, 6 and 200 energy consumption units.
When a plurality of FC layers exist in the network, the total energy consumption of FC calculation in a batch is the sum of the read-write energy consumption of all the FC layers.
Calculated energy consumption E2operationThe calculating method comprises the following steps:
in the calculation process, FC is sharedi+1BS output points, each output point needs to carry out the times of multiply-add operation as FCi. The energy consumption required for performing the unit byte multiply-add operation is eoperationThen the energy consumption of this part is: e2operation=FCi+ 1BS*FCi*αγeoperation(33)。
The energy consumption in the RNN layer energy consumption model is calculated as follows:
when a plurality of RNN layers exist in the network, the total energy consumption calculated by the RNN in a batch is the sum of the calculated energy consumption of each layer.
The most commonly applied RNN is a sequence model of an LSTM structure, and the following RNN energy consumption analysis mainly aims at the LSTM structure.
In RNN, the calculation flow of an LSTM unit comprises the following steps:
fi=σ(wxfxt+whfht-1+bf)
it=σ(wxixt+whiht-1+bi)
ot=σ(wxoxt+whoht-1+bo)
the Thinker chip utilizes two kinds of PE of ordinary PE and super PE to calculate the RNN layer. Wherein, Wx_gatext_gate+Wh_ gateht-1_gate+bgatePart of the calculation is the principle of calculation of the FC layer, and calculation is performed in the common PE. After the calculation is finished, the data is imported into the super PE to calculate the RNN-gating: calculating sigmoid or tanh functions to obtain various gate function vectors, and obtaining c through multiplication and addition calculationtAnd ht。
Energy consumption of a single RNN layer and a batch is calculated firstly. Considering that RNN is divided into two parts of FC and RNN-gating, energy consumption with Iteration number of RNN operation in a batch as Iteration is calculated, and the expression of RNN layer energy consumption is as follows:
(1) the calculation of the RNN-FC energy consumption comprises the following steps:
the power consumption calculation of the two parts is basically the same as the power consumption calculation principle of the FC layer. It differs from FC layer energy consumption calculation in that the dimensions of the weight matrix, input vectors and output vectors need to be adjusted. For RNN, the unified form of FC layer computation is:
FCt=Wx_gatext_gate+Wh_gateht-1_gate+bgate(35)
wherein, the gate can be i/f/o/g, which respectively corresponds to four gates in the LSTM. In a specific implementation, Wx_gateAnd Wh_gateWill be transversely connected and integrated into a Wgate,xt_gateAnd ht-1_gateWill be longitudinally connected and integrated into xgateThus WgateDimension of Olen_i*(Ilen-i+Olen_i),xgateDimension 1 (I)len-i+Olen_i) With the parameter FC of the FC layeriAnd FCi+1Correspondingly, namely:
FCi=Ilen-i+Olen_i
FCi+1=Olen_i
therefore, the method can be classified into an FC layer energy consumption model for analysis. It is noted that the FC operation needs to be repeated four times for each LSTM cell, corresponding to the number of gates. And will not be described in detail herein.
(2) The calculation of the gating memory access energy consumption comprises the following steps:
as described above, in the gating calculation, two calculations can be mainly classified: (a) simple tan/sigmoid function calculation; (b) element-to-element multiplicationThus, it is possible to provideCan be further expressed as:
tanh/sigmoid is mainly used for calculating 4 gate functions and ctI.e. the vector length needs to be Olen_iThe 5 groups of data are subjected to element-by-element tanh/sigmoid operation, and the total number of operations required to be performed is BS 5Olen_i. The calculation is performed in the PE, and data needs to be imported from the DRAM into the buffer, imported from the buffer into the PE, and finally exported. Suppose that the energy consumption for the unit byte DRAM to buffer transfer is eDRAM_bufferThe energy consumption for transmitting the unit byte from the buffer to the PE is ebuffer_PEOf, singlyThe energy consumption of the bit byte from PE to buffer and from buffer to DRAM is ebuffer_DRAMAnd ePE_buffer. This operation is performed at most once per data, and the interaction of data between DRAM and buffer only occurs when the buffer size cannot accommodate all the input points, so the following equation can be summarized:
mainly comprises ctIs calculated bytTwo parts are calculated. For htIn other words, only one element-by-element multiplication operation is required, so that the data read only involves a total of 2Olen_iInput data, write-out of data involving Olen_iAnd (4) data. The same way as the above tan/sigmoid calculation:
for ctIn other words, an accumulation operation is required in addition to multiplication. For one output datum, two times of multiplication operations are needed, and then two products are summed, so that one product needs to be read out into the buffer and then imported into the register in the PE to complete the summation. Similarly, the energy consumption of the part can be obtained as follows:
computing the energy consumption of Gating
For gating, three operations that may occur are: tanh/sigmoid, multiplication, addition. The unit byte operation energy consumption corresponding to the three operations is assumed as follows: e.g. of the typetanh/sigmoid、emultiply、eaddFor example, all values are 1, and the times that three operations need to be executed can be calculated as: 5Olen_i、3Olen_i、Olen_i. For this purpose, it is possible to calculateComprises the following steps:
the energy consumption in the pooling layer energy consumption model is calculated as follows:
for pooling operation, the energy consumption of a single layer, single sample, can be considered first, since there is no multiplexing of data between different layers, different samples. Similarly, the energy consumption of the pooling operation is divided into a discussion of reading and writing energy consumption and calculating energy consumption.
An energy consumption model of the pooling operation is established. If a multi-layer, multi-sample result needs to be computed, only a summation operation needs to be performed.
E4=E4IO+E4operation(41)
Pooling layer read-write energy consumption E4IOIs calculated by
Thinker supports maximum pooling, which acts to reduce the height and width of the output plot while maintaining the number of channels. The input graph and the output graph have dimension height relation:the width relationship:the total pooled block number is then:
X=Hout_i*Wout_i*Chin_i(42)
the data read-write energy consumption of the pooling operation can be divided into two types of input data and output data, and each type needs to complete read-write interaction from a DRAM (dynamic random access memory), a cache and a PE (provider edge) array. Since data does not need to be repeatedly imported, the energy consumption model of the part is simple, namely:
wherein:
for the read-in energy consumption of the input data,is the write-out power consumption of the output data.
The calculation of the calculated energy consumption involves the total data number that needs to be input. It should be noted that the total data input is also not equal to the size of the input feature map, and a reverse reasoning needs to be performed based on the size of the output map and the size of the kernel, that is: e4operation=pi*pi*Hout_i*Wout_i*Chout_i*eoperation(44)。
As an optional implementation manner of the present invention, the neural network time model includes a time overhead, and the time overhead includes a convolutional layer time overhead and a full link layer time overhead.
As an optional implementation manner of the present invention, the time overhead calculation method is that the time overhead Tz is max (T)IO,Toperation),TIOFor read-write time, ToperationTo calculate the time.
As an optional implementation manner of the present invention, the neural network time model includes: convolutional layer time model, full link layer time model.
In the case of the Thinker chip, the time overhead mainly depends on the amount of calculation, the convolution operation with a large amount of data, and the full-link operation (including the full-link operation in the full-link layer and RNN). Due to the "blocking" effect of time, the less time consuming RNN-gating and pooling operations do not require model building. Next, time modeling analysis is performed from both the convolutional layer and the fully-connected layer.
As an optional implementation manner of the present invention, the convolutional layer time model and the fully-connected layer time model include time overheads, and the time overheads are calculated by a method in which the time overheads Tz is max (TIO,Toperation),TIOFor read-write time, ToperationTo calculate the time.
The convolutional layer time model includes:
the calculation of the read-write time comprises the following steps:
in the Thinker calculation process, the convolution layers are calculated one by a single-layer single-sample meter, so that the time required by convolution operation of one layer in one sample is firstly analyzed.
Data reading and writing needs to pass through a plurality of layers such as DRAM, cache, PE and the like. After the data is imported into the on-chip buffer, the transmission time of the data is very fast, so that the part can be ignored. Therefore, TIOWill be reduced to the time consumed by the data in the DRAM and cache interaction.
In the Thinker chip, input and output data and weights are imported from the DRAM into two different buffers, which have different bandwidths and can be adaptively changed according to the architecture of the neural network. Bandwidth BW shared by input and output datadataconvThe bandwidth of the weighted data is BWweightconv. In one layer network, the total amount of input and output data is Hin_iWin_iChin_iα+Hout_iWout_iChout_iβ, the total weight data isTherefore, the read-write time of the input data and the read-write time of the output data and the read-write time of the weight data can be calculated respectively. The larger of the two is TIOThus:
calculating the time ToperationThe calculation of (a) includes:
for a layer of a convolutional network of a sample, its computation requires several rounds of operation by the PE array. In one round of computation, the PE array will compute several output points in parallel, after several rounds all output points of the layer can be computed. Suppose to complete oneThe number of rounds required for layer network computation is roundconvlayerThe time required for each round is troundconvThen the computation time can be expressed as:
Toperation=roundconvlayer*troundconv(46)。
since Thinker multiplexes input data per line, Hout_iWout_iIs the total number of lines that need to be input in total, and each round can be calculatedData of a row, thus requiring a total ofThe input can be completely calculated by the wheel; for a set of input data, the same weighting requires inputNext, the process is carried out.
Each output point comprisesThe time it takes to perform a round of computation is therefore the time required for these sequential multiply-add operations:
thus, the time to compute a convolutional layer in a sample is:
the operation time of all the convolution layers in one sample only needs to sum layer by layer time; if the convolution layer operation time is to be calculated for all samples in a batch, it is only necessary to multiply the batch size BS by the total time of one sample.
The full connection layer time model includes:
since different samples of a batch in the FC layer are calculated in parallel, one layer of FC operation time of one batch of data is directly analyzed.
The read and write time calculation only takes into account the data interaction between the DRAM and the cache. The bandwidth allocated to the input and output data is assumed to be BWdata_FCThe bandwidth allocated by the weight is BWweight_FCAnd the total number of times of inputting and outputting data is (FC)iα*RDDRAM_in_FC+FCi+1β) BS, total number of weighted inputs of FCiFCi+1γ*RDDRAM_weight_FCThe formula of the read-write time is as follows:
the calculation of the calculation time includes: roundFClayerAnd tround_FC:
Wherein:
it will be appreciated that the computation time for all FC layers in a batch need only be summed layer-by-layer.
The hardware parameters of the model are also variable, so that the operation time and energy consumption of the same network under different hardware parameters can be tested by adjusting an algorithm, an optimal solution under the condition of hardware area limitation can be obtained by adjusting various hardware parameters (such as bandwidth, cache size, PE array size and channel number), the time and energy consumption are minimized from the perspective of hardware design, and the leading modeling parameters of the time and energy consumption overhead are analyzed while the time and energy consumption are predicted layer by layer, so that the design of the hardware can be helped to a certain extent.
As an optional implementation manner of the present invention, the performing dual-objective optimization on the neural network energy consumption model and the neural network time model includes:
performing cache segmentation on the neural network time model by using a TM calculation process;
carrying out array segmentation on the neural network energy consumption model by using an AP (access point) calculation process;
and selecting the array segmentation method with the minimum corresponding energy consumption when the cache segmentation meets the requirement condition.
The Thinker chip supports two calculation flows of AP and TM. The AP is called Array-Partitioning, namely Array Partitioning; the TM is called Time-Multiplexing, Time division Multiplexing. The difference between the two is that the calculation order is different for the same network inference process. For the TM computation flow, the network inference is performed one by one in the order of layers, with all PE arrays computing the same layer at the same time. But for the AP computation flow, some adjustment is made to the partitioning of the array, so the PE array may compute multiple layers at the same time. Since the AP computation flow can compute multiple types of layers at the same time, the characteristics of high computation amount and small data amount of convolution operation, small computation amount and large data amount of full-connection operation are greatly balanced, and the time can be shortened. However, because the TM calculation process has a large number of data multiplexing times, this results in a small number of data retransmission times, and thus the energy consumption is small.
As an optional implementation manner of the present invention, the selecting the array partition method in which the cache partition satisfies the requirement condition that the corresponding energy consumption is minimum includes:
setting a task time Tmax;
calculating the time T required by partitioning the neural network time model by using the TM calculation process cacheTM;
Calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow arrayAP;
Selecting the minimum required energy consumption EAP0And calculating the minimum required energy consumption EAP0Required time TAP0;
Comparing Tmax and TAP0Judging whether the set task time meets the requirement or not according to the comparison result;
if the set task time is satisfiedCalculating and comparing Tmax and TTMAnd outputting a result of performing dual-objective optimization on the neural network energy consumption model and the neural network time model according to the comparison result.
As an optional implementation manner of the present invention, the determining whether the set task time meets the requirement according to the comparison result includes:
if Tmax>TAP0Judging that the set task time meets the requirement;
otherwise, judging that the task can not be completed within the specified time, and resetting the task time.
As an optional implementation manner of the present invention, the outputting a result of performing a dual-target optimization on the neural network energy consumption model and the neural network time model according to the comparison result includes:
if Tmax<TTMAdjusting modeling parameters, and carrying out array segmentation on the neural network energy consumption model again;
calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow arrayAP’And selecting the minimum energy consumption EAP0’The corresponding modeling parameters and the array segmentation method are dual-target optimization results;
otherwise, selecting the modeling parameters and the cache segmentation method corresponding to the TM calculation process as a dual-target optimization result.
In the actual execution process, in order to achieve the purpose of executing different layers simultaneously, not only the number of executable PEs is adjusted, but also the allocation condition of the cache is adjusted. Since the total buffer size and array size of the Thinker chip are given and only vertical dicing is a reasonable way, the parameters that need to be cycled through include the number of array columns allocated to CONVData buffer space size SCONvdatabufferAnd the size S of the buffer spaceconvweightbuffer。
It can be known during modeling that the time overhead of network computation has a "hidden effect". "hiding effect" means that the time calculation of a layer can be split into the calculation time and the data transmission time, and the total time of the layer should be the maximum of the two. And the data transmission time takes the greater of the data transmission time and the weight transmission time. To further study the time overhead, the dominant factor of time consumption per layer needs to be studied. Therefore, taking an LRCN typical neural network as an example, for FC operation, both TM and AP0 calculation processes are time overhead of weight reading and writing, and are dominant; while for CONV operation it is basically the calculation time that prevails. The full connection layer is small in calculation amount and large in data amount.
The data read-write time, the weight read-write time and the calculation time of the TM calculation process are greatly different. The computation time overhead of the convolutional layer is far greater than the data read-write time and the weight read-write; in this case, when the PE array is busy computing for a long period of time, the data transfer is idle; while FC calculation is performed, the PE array is idle due to the long time spent on weight reading and writing. In the time layer-by-layer overhead of the AP0, the data read-write time, the weight read-write time, and the calculation time are well balanced. Therefore, the total time can be effectively reduced by greatly increasing the time of convolution operation which is originally smaller than a certain value. The time overhead of the FC layer is dominated by weight reading and writing, and if the FC time is shortened, the proportion of the buffer allocated to the FC weight is required to be increased or the total time is reduced by increasing the total buffer amount or increasing the data transmission bandwidth.
Hardware parameters of the neural network energy consumption model and the neural network time model are also variable, so that the operation time and energy consumption of the same network under different hardware parameters can be tested by adjusting an algorithm, and certain help can be brought to the design of hardware. By adjusting various hardware parameters (such as bandwidth, cache size, PE array size, and number of channels), it is possible to obtain an optimal solution under the hardware area constraint, minimizing time and power consumption from a hardware design perspective.
Therefore, the TM calculation process is used for carrying out cache segmentation on the neural network time model; carrying out array segmentation on the neural network energy consumption model by using an AP (access point) calculation process; the array segmentation method with the minimum energy consumption corresponding to the condition that the cache segmentation meets the requirement is selected, so that the design of hardware in the future can be helped, and different optimal solutions can be realized under the conditions of different task times.
In this embodiment, time and energy consumption modeling is performed on the neural network from the perspective of a hardware calculation process of the network, time and energy consumption are predicted layer by layer, a leading modeling parameter of time and energy consumption overhead is analyzed, and a time and energy consumption dual-target optimization is performed on the neural network by improving the modeling parameter, an array segmentation method and a cache segmentation method, so that a neural network model is improved. Furthermore, different optimal solutions can be obtained under the conditions of different task times through layer-by-layer modeling.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.
It should be noted that the present invention is not limited to the above-mentioned preferred embodiments, and those skilled in the art can obtain other products in various forms without departing from the spirit of the present invention, but any changes in shape or structure can be made within the scope of the present invention with the same or similar technical solutions as those of the present invention.
Claims (8)
1. A neural network optimization method, comprising:
presetting modeling parameters, wherein the modeling parameters comprise network parameters and hardware parameters;
constructing a neural network energy consumption model based on the modeling parameters;
constructing a neural network time model based on the modeling parameters;
performing dual-objective optimization on the neural network energy consumption model and the neural network time model, including:
performing cache segmentation on the neural network time model by using a TM calculation process;
carrying out array segmentation on the neural network energy consumption model by using an AP (access point) calculation process;
the array segmentation method with the minimum corresponding energy consumption when the cache segmentation meets the requirement is selected, and comprises the following steps: setting a task time Tmax;
calculating the time T required by partitioning the neural network time model by using the TM calculation process cacheTM;
Calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow arrayAP;
Selecting the minimum required energy consumption EAP0And calculating the minimum required energy consumption EAP0Required time TAP0;
Comparing Tmax and TAP0Judging whether the set task time meets the requirement or not according to the comparison result;
if the set task time meets the requirement, comparing Tmax with TTMSize of (2)And outputting a dual-target optimization result of the neural network energy consumption model and the neural network time model according to the comparison result.
2. The neural network optimization method according to claim 1, wherein the energy consumption calculation formula is E = V × T × E, V is a data amount to be read/written/calculated, T is a number of times to be read/written/calculated repeatedly, and E is unit energy consumption.
3. The neural network optimization method of claim 2, wherein the unit energy consumption comprises read-write energy consumption and computational energy consumption.
4. The neural network optimization method of claim 1, wherein the constructing the neural network energy consumption model comprises: and modeling the energy consumption of the neural network layer by layer to obtain a convolutional layer energy consumption model, a full-connection layer energy consumption model, an RNN layer energy consumption model and a pooling layer energy consumption model.
5. The neural network optimization method of claim 1, wherein the neural network temporal model comprises a temporal cost comprising a convolutional layer temporal cost and a fully-connected layer temporal cost.
6. The neural network optimization method of claim 5, wherein the time cost calculation method is time cost Tz max (T ═ max)IO,Toperation),TIOFor read-write time, ToperationTo calculate the time.
7. The neural network optimization method of claim 1, wherein the determining whether the set task time meets the requirement according to the comparison result comprises:
if Tmax>TAP0Judging that the set task time meets the requirement;
otherwise, judging that the task can not be completed within the specified time, and resetting the task time.
8. The neural network optimization method of claim 1, wherein the performing a dual-objective optimization on the neural network energy consumption model and the neural network time model according to the comparison output comprises:
if Tmax<TTMAdjusting modeling parameters, and carrying out array segmentation on the neural network energy consumption model again;
calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow arrayAP', and selecting the minimum energy consumption EAP0The corresponding modeling parameters and array segmentation method are dual-target optimization results;
otherwise, selecting the modeling parameters and the cache segmentation method corresponding to the TM calculation process as a dual-target optimization result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811344189.0A CN109472361B (en) | 2018-11-13 | 2018-11-13 | Neural network optimization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811344189.0A CN109472361B (en) | 2018-11-13 | 2018-11-13 | Neural network optimization method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109472361A CN109472361A (en) | 2019-03-15 |
CN109472361B true CN109472361B (en) | 2020-08-28 |
Family
ID=65671820
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811344189.0A Withdrawn - After Issue CN109472361B (en) | 2018-11-13 | 2018-11-13 | Neural network optimization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109472361B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110738318B (en) * | 2019-09-11 | 2023-05-26 | 北京百度网讯科技有限公司 | Network structure operation time evaluation and evaluation model generation method, system and device |
CN110929860B (en) * | 2019-11-07 | 2020-10-23 | 深圳云天励飞技术有限公司 | Convolution acceleration operation method and device, storage medium and terminal equipment |
CN111753950B (en) * | 2020-01-19 | 2024-02-27 | 杭州海康威视数字技术股份有限公司 | Forward time consumption determination method, device and equipment |
CN112085195B (en) * | 2020-09-04 | 2022-09-23 | 西北工业大学 | X-ADMM-based deep learning model environment self-adaption method |
CN112468533B (en) * | 2020-10-20 | 2023-01-10 | 安徽网萌科技发展股份有限公司 | Agricultural product planting-oriented edge learning model online segmentation method and system |
CN113377546B (en) * | 2021-07-12 | 2022-02-01 | 中科弘云科技(北京)有限公司 | Communication avoidance method, apparatus, electronic device, and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102773981A (en) * | 2012-07-16 | 2012-11-14 | 南京航空航天大学 | Implementation method of energy-saving and optimizing system of injection molding machine |
CN105302973A (en) * | 2015-11-06 | 2016-02-03 | 重庆科技学院 | MOEA/D algorithm based aluminum electrolysis production optimization method |
CN106427589A (en) * | 2016-10-17 | 2017-02-22 | 江苏大学 | Electric car driving range estimation method based on prediction of working condition and fuzzy energy consumption |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10621486B2 (en) * | 2016-08-12 | 2020-04-14 | Beijing Deephi Intelligent Technology Co., Ltd. | Method for optimizing an artificial neural network (ANN) |
-
2018
- 2018-11-13 CN CN201811344189.0A patent/CN109472361B/en not_active Withdrawn - After Issue
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102773981A (en) * | 2012-07-16 | 2012-11-14 | 南京航空航天大学 | Implementation method of energy-saving and optimizing system of injection molding machine |
CN105302973A (en) * | 2015-11-06 | 2016-02-03 | 重庆科技学院 | MOEA/D algorithm based aluminum electrolysis production optimization method |
CN106427589A (en) * | 2016-10-17 | 2017-02-22 | 江苏大学 | Electric car driving range estimation method based on prediction of working condition and fuzzy energy consumption |
Also Published As
Publication number | Publication date |
---|---|
CN109472361A (en) | 2019-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109472361B (en) | Neural network optimization method | |
CN108564168B (en) | Design method for neural network processor supporting multi-precision convolution | |
US20220051087A1 (en) | Neural Network Architecture Using Convolution Engine Filter Weight Buffers | |
CN111667051A (en) | Neural network accelerator suitable for edge equipment and neural network acceleration calculation method | |
US11775430B1 (en) | Memory access for multiple circuit components | |
US11500959B2 (en) | Multiple output fusion for operations performed in a multi-dimensional array of processing units | |
JP2020521195A (en) | Scheduling neural network processing | |
Zhou et al. | Transpim: A memory-based acceleration via software-hardware co-design for transformer | |
CN110738316B (en) | Operation method and device based on neural network and electronic equipment | |
Mittal | A survey of accelerator architectures for 3D convolution neural networks | |
US11669443B2 (en) | Data layout optimization on processing in memory architecture for executing neural network model | |
KR20180123846A (en) | Logical-3d array reconfigurable accelerator for convolutional neural networks | |
CN113361695B (en) | Convolutional neural network accelerator | |
CN111105023A (en) | Data stream reconstruction method and reconfigurable data stream processor | |
TWI775210B (en) | Data dividing method and processor for convolution operation | |
CN112380793A (en) | Turbulence combustion numerical simulation parallel acceleration implementation method based on GPU | |
Yan et al. | FPGAN: an FPGA accelerator for graph attention networks with software and hardware co-optimization | |
WO2019182059A1 (en) | Model generation device, model generation method, and program | |
CN115668222A (en) | Data processing method and device of neural network | |
US20230025068A1 (en) | Hybrid machine learning architecture with neural processing unit and compute-in-memory processing elements | |
CN113031853A (en) | Interconnection device, operation method of interconnection device, and artificial intelligence accelerator system | |
CN116090518A (en) | Feature map processing method and device based on systolic operation array and storage medium | |
KR20240036594A (en) | Subsum management and reconfigurable systolic flow architectures for in-memory computation | |
US11954580B2 (en) | Spatial tiling of compute arrays with shared control | |
CN113392959A (en) | Method for reconstructing architecture in computing system and computing system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
AV01 | Patent right actively abandoned | ||
AV01 | Patent right actively abandoned | ||
AV01 | Patent right actively abandoned |
Granted publication date: 20200828 Effective date of abandoning: 20210125 |
|
AV01 | Patent right actively abandoned |
Granted publication date: 20200828 Effective date of abandoning: 20210125 |