CN109472361B

CN109472361B - Neural network optimization method

Info

Publication number: CN109472361B
Application number: CN201811344189.0A
Authority: CN
Inventors: 张跃进; 胡勇; 喻蒙
Original assignee: Zhongxiang Boqian Information Technology Co ltd
Current assignee: Zhongxiang Boqian Information Technology Co ltd
Priority date: 2018-11-13
Filing date: 2018-11-13
Publication date: 2020-08-28
Anticipated expiration: 2038-11-13
Also published as: CN109472361A

Abstract

The application relates to a neural network optimization method, which comprises the following steps: presetting modeling parameters, wherein the modeling parameters comprise network parameters and hardware parameters; constructing a neural network energy consumption model based on the modeling parameters; constructing a neural network time model based on the modeling parameters; and performing double-target optimization on the neural network energy consumption model and the neural network time model. According to the method, time and energy consumption modeling is carried out on the neural network from the perspective of a hardware computing process of the network, time and energy consumption are predicted layer by layer, leading modeling parameters of time and energy consumption overhead are analyzed, and a neural network model is improved by improving the modeling parameters, an array segmentation method and a cache segmentation method to carry out time and energy consumption double-target optimization on the neural network.

Description

Neural network optimization method

Technical Field

The application relates to the technical field of artificial neural networks, in particular to a neural network optimization method.

Background

With the rise of the neural network technology, neural network hardware suitable for different application scenes comes into play. The neural network has strong reasoning and predicting capability but large calculation amount, so how to increase the calculation speed of the neural network and reduce the energy consumption of the neural network becomes a key.

In the related art, both the neural network training process and the inference process have urgent needs for acceleration of network computation. Network training is basically completed at the cloud by using the GPU, and the parallelization method and the communication method of different hardware can greatly influence the speed of neural network training, so that the neural network computing speed is improved mainly from the aspects of improving parallelization computing capability and reducing communication overhead. The energy consumption of the neural network can be divided into two aspects of calculation energy consumption and memory access energy consumption. By different data multiplexing modes of output fixing and weight fixing, the memory access energy consumption of the neural network can be effectively reduced.

However, the above researches on the calculation of the neural network focus on one of low energy consumption or accelerated calculation speed, and do not consider that there may be a certain contradiction between the acceleration of the neural network and the energy perception in a complex application environment, and when the energy consumption is reduced, the speed may be sacrificed or the energy may be used, and the time for reducing the calculation of the neural network may generate more energy consumption.

Disclosure of Invention

In order to overcome the problem that researches on neural network calculation in the related art focus on one of low energy consumption or accelerated calculation speed at least to a certain extent, and do not consider that the acceleration and energy perception of the neural network in a complex application environment may have certain contradictions, the speed can be sacrificed or energy can be used when the energy consumption is reduced, and the time for reducing the neural network calculation may generate more energy consumption, the application provides a neural network optimization method.

Presetting modeling parameters, wherein the modeling parameters comprise network parameters and hardware parameters;

constructing a neural network energy consumption model based on the modeling parameters;

constructing a neural network time model based on the modeling parameters;

and performing double-target optimization on the neural network energy consumption model and the neural network time model.

Furthermore, the energy consumption calculation formula is E ═ V × T × E, V is the data volume to be read/written/calculated, T is the number of times to be read/written/calculated repeatedly, and E is the unit energy consumption.

Further, the unit energy consumption includes read-write energy consumption and calculation energy consumption.

Further, the building of the neural network energy consumption model includes: and modeling the energy consumption of the neural network layer by layer to obtain a convolutional layer energy consumption model, a full-connection layer energy consumption model, an RNN layer energy consumption model and a pooling layer energy consumption model.

Further, the neural network time model includes a time overhead, the time overhead including convolutional layer time overhead and full link layer time overhead.

Further, the time cost calculation method is that the time cost Tz is max (T)^IO,T^operation)，T^IOFor read-write time, T^operationTo calculate the time.

Further, the performing of the dual-objective optimization on the neural network energy consumption model and the neural network time model includes:

performing cache segmentation on the neural network time model by using a TM calculation process;

carrying out array segmentation on the neural network energy consumption model by using an AP (access point) calculation process;

and selecting the array segmentation method with the minimum corresponding energy consumption when the cache segmentation meets the requirement condition.

Further, the method for selecting the array partition with the smallest corresponding energy consumption when the cache partition meets the requirement condition includes:

setting a task time Tmax;

calculating the time T required by partitioning the neural network time model by using the TM calculation process cache_TM；

Calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow array_AP；

Selecting the minimum required energy consumption E_AP0And calculating the minimum required energy consumption E_AP0Required time T_AP0；

Comparing Tmax and T_AP0Judging whether the set task time meets the requirement or not according to the comparison result;

if the set task time meets the requirement, comparing Tmax with T_TMAnd outputting a result of performing dual-objective optimization on the neural network energy consumption model and the neural network time model according to the comparison result.

Further, the determining whether the set task time meets the requirement according to the comparison result includes:

if Tmax>T_AP0Judging that the set task time meets the requirement;

otherwise, judging that the task can not be completed within the specified time, and resetting the task time.

Further, the outputting a result of performing dual-objective optimization on the neural network energy consumption model and the neural network time model according to the comparison result includes:

if Tmax<T_TMAdjusting modeling parameters, and carrying out array segmentation on the neural network energy consumption model again;

calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow array_AP’And selecting the minimum energy consumption E_AP0’The corresponding modeling parameters and the array segmentation method are dual-target optimization results;

otherwise, selecting the modeling parameters and the cache segmentation method corresponding to the TM calculation process as a dual-target optimization result.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

the neural network optimization method comprises the following steps: presetting modeling parameters, and constructing a neural network energy consumption model and a neural network time model based on the modeling parameters; and performing double-target optimization on the neural network energy consumption model and the neural network time model. According to the method, time and energy consumption modeling is carried out on the neural network from the perspective of a hardware computing process of the network, time and energy consumption are predicted layer by layer, leading modeling parameters of time and energy consumption overhead are analyzed, and a neural network model is improved by improving the modeling parameters, changing an array segmentation method and a cache segmentation method to carry out time and energy consumption double-target optimization on the neural network.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flowchart of a neural network optimization method according to an embodiment of the present application.

Detailed Description

The invention is described in detail below with reference to the figures and examples.

As shown in fig. 1, the neural network optimization method of the present embodiment includes:

s1: presetting modeling parameters, wherein the modeling parameters comprise network parameters and hardware parameters;

s2: constructing a neural network energy consumption model based on the modeling parameters;

s3: constructing a neural network time model based on the modeling parameters;

s4: and performing double-target optimization on the neural network energy consumption model and the neural network time model.

In the neural network structure, there are three different structures of a convolutional layer (CONV), a fully connected layer (FC) and a pooling layer (POOL), and the three different structures correspond to three tasks of feature extraction, feature connection and feature compression respectively. The number and type of layers of different network structures are different, and the characteristics of different layers are different. For example, the convolution layer has a large amount of computation and the fully-connected layer has a large amount of data, and this imbalance of resource requirements needs to be realized by a special hardware architecture design.

Some Neural network architectures also include an RNN (Recurrent Neural Networks) layer or a CNN (Convolutional Neural Networks) layer to accomplish specific tasks.

The hardware is, for example, a Thinker chip, and the Thinker chip is composed of a PE array, an on-chip storage system, a finite state controller, an IO (input/output) and a decoder.

The modeling parameters include network parameters and hardware parameters, which are shown in table 1.

TABLE 1 modeling parameter Table

The neural network algorithm with the lowest energy consumption under the given task time condition is realized by constructing and constructing a neural network energy consumption model and a neural network time model, simultaneously carrying out double-objective optimization on energy consumption and time and adjusting modeling parameters.

As an optional implementation manner of the present invention, the energy consumption calculation formula is E ═ V × T × E, V is a data amount to be read/written/calculated, T is the number of times to be read/written/calculated repeatedly, and E is unit energy consumption.

The data volume V required to be read/written/calculated and the times T required to be read/written/calculated repeatedly can be calculated by utilizing network parameters and hardware parameters; for the unit energy consumption e, the access energy consumption of different storage levels is different, so that the analysis needs to be performed layer by layer.

The memory access energy consumption of different memory hierarchy is shown in table 2.

TABLE 2 memory access normalized energy consumption for different memory levels

And if the data is transmitted between the two storage mechanisms, selecting the energy consumption with higher energy consumption as the energy consumption of the data transmission. For example, when data is transferred from a DRAM to an on-chip cache SRAM, the energy consumption is determined to be 200 energy consumption units.

As an optional implementation manner of the present invention, the unit energy consumption includes read-write energy consumption and computational energy consumption.

As an optional implementation manner of the present invention, the constructing the neural network energy consumption model includes: and modeling the energy consumption of the neural network layer by layer to obtain a convolutional layer energy consumption model, a full-connection layer energy consumption model, an RNN layer energy consumption model and a pooling layer energy consumption model.

As an optional implementation manner of the present invention, the method for calculating the unit energy consumption E0 includes:

respectively constructing a convolutional layer energy consumption model, a full-connection layer energy consumption model, an RNN layer energy consumption model and a pooling layer energy consumption model;

and respectively calculating unit read-write/calculated energy consumption E0 corresponding to the convolutional layer energy consumption model, the full-link layer energy consumption model, the RNN layer energy consumption model and the pooling layer energy consumption model.

The energy consumption in the convolutional layer energy consumption model is calculated as follows:

for a complete sample, a convolution layer is calculated, and the energy consumption is:

E1＝E1^IO+E1^operation(1)

E1^IOfor convolutional layer read-write energy consumption, E1^operationTo calculate the energy consumption.

Energy consumption for reading and writing E1^IO＝E1^weightIO+E1^inputIO+E1^outputIO(2)

E1^weightIOTo weight energy consumption, E1^inputIOFor input of energy consumption, E1^outputIOTo output energy consumption.

Respectively calculating weighted energy consumption E1^weightIOInput energy consumption E^1inputIOOutput energy consumption E1^outputIOThe following were used:

(1) weighted energy consumption E1^weightIO

E1^weightIO＝E1^DRAM_buffer+E1^buffer_PE(3)

E1^DRAM_bufferEnergy consumption for reading from DRAM to cache, E1^buffer_PEIs the energy consumption from the cache to the PE array.

N_wIs the total weight of the unit convolutional layer, RD_{DRAM_CONV}The number of rounds of weight is written to the cache for the DRAM.

Total weight N of unit convolutional layer_w＝γCh_{out_i}K_i ²Ch_{in_i}(5) Energy consumption e for DRAM to read 1 byte data from cache^DRAM_bufferFor example 200 energy consumption units.

The weight and the number of times of repeated importation are related to the size of the on-chip cache. If the on-chip cache can store all the weight data of the layer, each data only needs to be imported once, and the subsequent access only relates to the interaction between the cache and the PE array; on the contrary, when the PE needs to reuse the weight data, the weight data that needs to be used is not stored in the cache and needs to be repeatedly accessed. In a PE arrayIn term, a two-dimensional output map can be obtained

An output point, so that the input data needs to be divided

The number of times of sub-import into the PE array, i.e. in case of insufficient cache capacity, the same weight is repeatedly imported is total

Therefore, the temperature of the molten metal is controlled,

the number of times of repeated reading and writing is

Energy consumption e for reading 1 byte data from PE array by cache^buffer_PEFor example, 6 energy consumption units, and, therefore,

(2) input energy consumption E1^inputIO

Input data needs to pass from the DRAM through a buffer, a register, and to reach the PE array, and needs to be transferred in the PE array. Thus, E1^inputIO＝E1^DRAM_buffer+E1^{buffer_register}+E1^register_PE+E1^PE_tran(8)。

E1^DRAM_bufferFor power consumption from DRAM to cache, E1^{buffer_register}For power consumption from cache to register E1^register_PEFor power consumption from registers to the PE array, E1^PE_tranThe energy consumption for transferring the data input into the first PE at the left side of the PE array in the right direction is reduced.

It should be noted that, due to the existence of the padding, the total amount of input may be different from the number of points of the input feature map. If the filling is empty, the filling does not occur, and the original input size is not changed. If the padding is the same size, the original input size will change even if the size of the input/output graph is not changed, and the changed length and width will be used in the subsequent calculation.

Wherein H_{in_i}W_{in_i}Ch_{in_i}α is the total input data amount, S, of a unit convolutional layer_{CONvdatabuffer}Is the data size. Because the Thinker preferentially multiplexes all the input points, the input points do not need to be repeatedly introduced into the buffer.

A Register file (Register file) is added between the cache and the PE array to avoid duplicate input of input data.

Show to give

The number of input lines required for parallel operation of the line PE, S is the transverse step length, and the longitudinal step length is H_ui+S_i-K_i. In the process of transverse sliding, data is read only once, so the sliding times required in the longitudinal direction are

e^{buffer_register}The energy consumption for the buffer to register unit byte transfer is 6.

For the number of repeated importations of input data, Ch_{in_i}K_i ²*H_{out_i}W_{out_i}For the total number of input data to be input as the product of the number of output points and the number of input points corresponding to each output point, e^register_PEThe energy consumption for the register-to-PE array unit byte transfer is 2.

Due to the multiplexing of the input data, each data input to the first PE on the left side of the PE array is passed to the right. So that the number of repeated accesses to each input point is

Energy consumption e for reading one byte data^PE_tranFor example 2.

(3) Output energy consumption E1^outputIO

The total output data amount is β Ch_{out_i}H_{out_i}W_{out_i}. For output, the output needs to be transferred between PEs and then into the cache, and the data that cannot be stored in the cache is then imported into the DRAM to wait for the next calculation.

Thus, E1^outputIO＝E1^PEout_tran+E1^PE_buffer+E1^buffer_DRAM(13)

The output points generated by each PE calculation need to be passed to the leftmost PE first. Considering a row of PEs, a number of PEs can be generated in a cycle of calculation

All the output points need to be transmitted to the leftmost PE and need to be accessed and stored together

Next, the process is carried out. Thus, on average, each output point requires memory access

Next, the process is carried out.

The total output data amount is β Ch_{out_i}H_{out_i}W_{out_i}The transmission energy per byte is e^PEout_tranFor example 2 energy consumption units.

Calculating energy consumption

Co-demand output Ch in one-layer CONV calculation_{in_i}H_{in_i}W_{in_i}Points, each point needing to be K_i ²Ch_{in_i}A sub-multiply-add operation.

In some embodiments, for a given sample, the energy consumed to complete all convolutional layer operations is the sum of the energy consumed by all convolutional layers.

In some embodiments, each sample in the AP computation stream and the TM computation stream is computed serially, so that the energy consumption for multiple samples only needs to be added to the energy consumption of different samples.

The energy consumption in the full connection layer energy consumption model is calculated as follows:

E2＝E2^IO+E2^operation(18)

the full link layers are multi-sample, layer-by-layer computed in the Thinker chip. The read-write energy consumption of batch FC operation is calculated by the following formula:

E2^IO＝E2^weightIO+E2^inputIO+E2^outputIO(19)

(1)E2^weightIOthe calculating method comprises the following steps:

the weight read-write energy consumption of the FC layer is divided into two parts:

E2^weightIO＝E2^DRAM_buffer+E2^buffer_PE(20)

in the same way as the convolutional layer, the following can be obtained:

wherein BS is the batch size.

(2)E2^inputIOThe calculating method comprises the following steps:

unlike the CONV layer, the input to the FC layer does not need to go through registers designed for convolutional input reuse, so E2^inputIOThe device consists of three parts:

E2^inputIO＝＝E2^DRAM_buffer+E2^buffer_PE+E2^PE_tran(24)

wherein e is^PEinput_tranIs a unit of transmission energy consumption, for example 2 energy consumption units.

(3)E2^outputIOThe calculating method comprises the following steps:

E2^outputIO＝E2^PE_out_tran+E2^PE_buffer+E2^buffer_DRAM(29)

wherein:

E2^PE_buffer＝FC_i+1BS*β*e^PE_buffer(31)

E2^buffer_DRAM＝ReLU(FC_i+1BS-S_FCdatabuffer)*β*e^buffer_DRAM(32)

wherein e^{PEoutput_tran}、e^PE_bufferAnd e^buffer_DRAMFor transmitting the energy consumption in each storage layer unit, the energy consumption is respectively 2, 6 and 200 energy consumption units.

When a plurality of FC layers exist in the network, the total energy consumption of FC calculation in a batch is the sum of the read-write energy consumption of all the FC layers.

Calculated energy consumption E2^operationThe calculating method comprises the following steps:

in the calculation process, FC is shared_i+1BS output points, each output point needs to carry out the times of multiply-add operation as FC_i. The energy consumption required for performing the unit byte multiply-add operation is e^operationThen the energy consumption of this part is: e2^operation＝FC_i+ ₁BS*FC_i*αγe^operation(33)。

The energy consumption in the RNN layer energy consumption model is calculated as follows:

when a plurality of RNN layers exist in the network, the total energy consumption calculated by the RNN in a batch is the sum of the calculated energy consumption of each layer.

The most commonly applied RNN is a sequence model of an LSTM structure, and the following RNN energy consumption analysis mainly aims at the LSTM structure.

In RNN, the calculation flow of an LSTM unit comprises the following steps:

f_i＝σ(w_xfx_t+w_hfh_t-1+b_f)

i_t＝σ(w_xix_t+w_hih_t-1+b_i)

o_t＝σ(w_xox_t+w_hoh_t-1+b_o)

the Thinker chip utilizes two kinds of PE of ordinary PE and super PE to calculate the RNN layer. Wherein, W_{x_gate}x_{t_gate}+W_{h_} _gateh_{t-1_gate}+b_gatePart of the calculation is the principle of calculation of the FC layer, and calculation is performed in the common PE. After the calculation is finished, the data is imported into the super PE to calculate the RNN-gating: calculating sigmoid or tanh functions to obtain various gate function vectors, and obtaining c through multiplication and addition calculation_tAnd h_t。

Energy consumption of a single RNN layer and a batch is calculated firstly. Considering that RNN is divided into two parts of FC and RNN-gating, energy consumption with Iteration number of RNN operation in a batch as Iteration is calculated, and the expression of RNN layer energy consumption is as follows:

(1) the calculation of the RNN-FC energy consumption comprises the following steps:

the power consumption calculation of the two parts is basically the same as the power consumption calculation principle of the FC layer. It differs from FC layer energy consumption calculation in that the dimensions of the weight matrix, input vectors and output vectors need to be adjusted. For RNN, the unified form of FC layer computation is:

FC_t＝W_{x_gate}x_{t_gate}+W_{h_gate}h_{t-1_gate}+b_gate(35)

wherein, the gate can be i/f/o/g, which respectively corresponds to four gates in the LSTM. In a specific implementation, W_{x_gate}And W_{h_gate}Will be transversely connected and integrated into a W_gate，x_{t_gate}And h_{t-1_gate}Will be longitudinally connected and integrated into x_gateThus W_gateDimension of O_{len_i}*(I_len-i+O_{len_i})，x_gateDimension 1 (I)_len-i+O_{len_i}) With the parameter FC of the FC layer_iAnd FC_i+1Correspondingly, namely:

FC_i＝I_len-i+O_{len_i}

FC_i+1＝O_{len_i}

therefore, the method can be classified into an FC layer energy consumption model for analysis. It is noted that the FC operation needs to be repeated four times for each LSTM cell, corresponding to the number of gates. And will not be described in detail herein.

(2) The calculation of the gating memory access energy consumption comprises the following steps:

as described above, in the gating calculation, two calculations can be mainly classified: (a) simple tan/sigmoid function calculation; (b) element-to-element multiplication

Thus, it is possible to provide

Can be further expressed as:

the calculation of (a) includes:

tanh/sigmoid is mainly used for calculating 4 gate functions and c_tI.e. the vector length needs to be O_{len_i}The 5 groups of data are subjected to element-by-element tanh/sigmoid operation, and the total number of operations required to be performed is BS 5O_{len_i}. The calculation is performed in the PE, and data needs to be imported from the DRAM into the buffer, imported from the buffer into the PE, and finally exported. Suppose that the energy consumption for the unit byte DRAM to buffer transfer is e^DRAM_bufferThe energy consumption for transmitting the unit byte from the buffer to the PE is e^buffer_PEOf, singlyThe energy consumption of the bit byte from PE to buffer and from buffer to DRAM is e^buffer_DRAMAnd e^PE_buffer. This operation is performed at most once per data, and the interaction of data between DRAM and buffer only occurs when the buffer size cannot accommodate all the input points, so the following equation can be summarized:

mainly comprises c_tIs calculated by_tTwo parts are calculated. For h_tIn other words, only one element-by-element multiplication operation is required, so that the data read only involves a total of 2O_{len_i}Input data, write-out of data involving O_{len_i}And (4) data. The same way as the above tan/sigmoid calculation:

for c_tIn other words, an accumulation operation is required in addition to multiplication. For one output datum, two times of multiplication operations are needed, and then two products are summed, so that one product needs to be read out into the buffer and then imported into the register in the PE to complete the summation. Similarly, the energy consumption of the part can be obtained as follows:

computing the energy consumption of Gating

For gating, three operations that may occur are: tanh/sigmoid, multiplication, addition. The unit byte operation energy consumption corresponding to the three operations is assumed as follows: e.g. of the type^tanh/sigmoid、e^multiply、e^addFor example, all values are 1, and the times that three operations need to be executed can be calculated as: 5O_{len_i}、3O_{len_i}、O_{len_i}. For this purpose, it is possible to calculate

Comprises the following steps:

the energy consumption in the pooling layer energy consumption model is calculated as follows:

for pooling operation, the energy consumption of a single layer, single sample, can be considered first, since there is no multiplexing of data between different layers, different samples. Similarly, the energy consumption of the pooling operation is divided into a discussion of reading and writing energy consumption and calculating energy consumption.

An energy consumption model of the pooling operation is established. If a multi-layer, multi-sample result needs to be computed, only a summation operation needs to be performed.

E4＝E4^IO+E4^operation(41)

Pooling layer read-write energy consumption E4^IOIs calculated by

Thinker supports maximum pooling, which acts to reduce the height and width of the output plot while maintaining the number of channels. The input graph and the output graph have dimension height relation:

the width relationship:

the total pooled block number is then:

X＝H_{out_i}*W_{out_i}*Ch_{in_i}(42)

the data read-write energy consumption of the pooling operation can be divided into two types of input data and output data, and each type needs to complete read-write interaction from a DRAM (dynamic random access memory), a cache and a PE (provider edge) array. Since data does not need to be repeatedly imported, the energy consumption model of the part is simple, namely:

wherein:

for the read-in energy consumption of the input data,

is the write-out power consumption of the output data.

The calculation of the calculated energy consumption involves the total data number that needs to be input. It should be noted that the total data input is also not equal to the size of the input feature map, and a reverse reasoning needs to be performed based on the size of the output map and the size of the kernel, that is: e4^operation＝p_i*p_i*H_{out_i}*W_{out_i}*Ch_{out_i}*e^operation(44)。

As an optional implementation manner of the present invention, the neural network time model includes a time overhead, and the time overhead includes a convolutional layer time overhead and a full link layer time overhead.

As an optional implementation manner of the present invention, the time overhead calculation method is that the time overhead Tz is max (T)^IO,T^operation)，T^IOFor read-write time, T^operationTo calculate the time.

As an optional implementation manner of the present invention, the neural network time model includes: convolutional layer time model, full link layer time model.

In the case of the Thinker chip, the time overhead mainly depends on the amount of calculation, the convolution operation with a large amount of data, and the full-link operation (including the full-link operation in the full-link layer and RNN). Due to the "blocking" effect of time, the less time consuming RNN-gating and pooling operations do not require model building. Next, time modeling analysis is performed from both the convolutional layer and the fully-connected layer.

As an optional implementation manner of the present invention, the convolutional layer time model and the fully-connected layer time model include time overheads, and the time overheads are calculated by a method in which the time overheads Tz is max (T^IO,T^operation)，T^IOFor read-write time, T^operationTo calculate the time.

The convolutional layer time model includes:

the calculation of the read-write time comprises the following steps:

in the Thinker calculation process, the convolution layers are calculated one by a single-layer single-sample meter, so that the time required by convolution operation of one layer in one sample is firstly analyzed.

Data reading and writing needs to pass through a plurality of layers such as DRAM, cache, PE and the like. After the data is imported into the on-chip buffer, the transmission time of the data is very fast, so that the part can be ignored. Therefore, T^IOWill be reduced to the time consumed by the data in the DRAM and cache interaction.

In the Thinker chip, input and output data and weights are imported from the DRAM into two different buffers, which have different bandwidths and can be adaptively changed according to the architecture of the neural network. Bandwidth BW shared by input and output data_dataconvThe bandwidth of the weighted data is BW_weightconv. In one layer network, the total amount of input and output data is H_{in_i}W_{in_i}Ch_{in_i}α+H_{out_i}W_{out_i}Ch_{out_i}β, the total weight data is

Therefore, the read-write time of the input data and the read-write time of the output data and the read-write time of the weight data can be calculated respectively. The larger of the two is T^IOThus:

calculating the time T^operationThe calculation of (a) includes:

for a layer of a convolutional network of a sample, its computation requires several rounds of operation by the PE array. In one round of computation, the PE array will compute several output points in parallel, after several rounds all output points of the layer can be computed. Suppose to complete oneThe number of rounds required for layer network computation is round_convlayerThe time required for each round is t_roundconvThen the computation time can be expressed as:

T^operation＝round_convlayer*t_roundconv(46)。

since Thinker multiplexes input data per line, H_{out_i}W_{out_i}Is the total number of lines that need to be input in total, and each round can be calculated

Data of a row, thus requiring a total of

The input can be completely calculated by the wheel; for a set of input data, the same weighting requires input

Next, the process is carried out.

Each output point comprises

The time it takes to perform a round of computation is therefore the time required for these sequential multiply-add operations:

thus, the time to compute a convolutional layer in a sample is:

the operation time of all the convolution layers in one sample only needs to sum layer by layer time; if the convolution layer operation time is to be calculated for all samples in a batch, it is only necessary to multiply the batch size BS by the total time of one sample.

The full connection layer time model includes:

since different samples of a batch in the FC layer are calculated in parallel, one layer of FC operation time of one batch of data is directly analyzed.

The read and write time calculation only takes into account the data interaction between the DRAM and the cache. The bandwidth allocated to the input and output data is assumed to be BW_{data_FC}The bandwidth allocated by the weight is BW_{weight_FC}And the total number of times of inputting and outputting data is (FC)_iα*RD_{DRAM_in_FC}+FC_i+1β) BS, total number of weighted inputs of FC_iFC_i+1γ*RD_{DRAM_weight_FC}The formula of the read-write time is as follows:

the calculation of the calculation time includes: round_FClayerAnd t_{round_FC}：

Wherein:

it will be appreciated that the computation time for all FC layers in a batch need only be summed layer-by-layer.

The hardware parameters of the model are also variable, so that the operation time and energy consumption of the same network under different hardware parameters can be tested by adjusting an algorithm, an optimal solution under the condition of hardware area limitation can be obtained by adjusting various hardware parameters (such as bandwidth, cache size, PE array size and channel number), the time and energy consumption are minimized from the perspective of hardware design, and the leading modeling parameters of the time and energy consumption overhead are analyzed while the time and energy consumption are predicted layer by layer, so that the design of the hardware can be helped to a certain extent.

As an optional implementation manner of the present invention, the performing dual-objective optimization on the neural network energy consumption model and the neural network time model includes:

The Thinker chip supports two calculation flows of AP and TM. The AP is called Array-Partitioning, namely Array Partitioning; the TM is called Time-Multiplexing, Time division Multiplexing. The difference between the two is that the calculation order is different for the same network inference process. For the TM computation flow, the network inference is performed one by one in the order of layers, with all PE arrays computing the same layer at the same time. But for the AP computation flow, some adjustment is made to the partitioning of the array, so the PE array may compute multiple layers at the same time. Since the AP computation flow can compute multiple types of layers at the same time, the characteristics of high computation amount and small data amount of convolution operation, small computation amount and large data amount of full-connection operation are greatly balanced, and the time can be shortened. However, because the TM calculation process has a large number of data multiplexing times, this results in a small number of data retransmission times, and thus the energy consumption is small.

As an optional implementation manner of the present invention, the selecting the array partition method in which the cache partition satisfies the requirement condition that the corresponding energy consumption is minimum includes:

setting a task time Tmax;

if the set task time is satisfiedCalculating and comparing Tmax and T_TMAnd outputting a result of performing dual-objective optimization on the neural network energy consumption model and the neural network time model according to the comparison result.

As an optional implementation manner of the present invention, the determining whether the set task time meets the requirement according to the comparison result includes:

if Tmax>T_AP0Judging that the set task time meets the requirement;

As an optional implementation manner of the present invention, the outputting a result of performing a dual-target optimization on the neural network energy consumption model and the neural network time model according to the comparison result includes:

In the actual execution process, in order to achieve the purpose of executing different layers simultaneously, not only the number of executable PEs is adjusted, but also the allocation condition of the cache is adjusted. Since the total buffer size and array size of the Thinker chip are given and only vertical dicing is a reasonable way, the parameters that need to be cycled through include the number of array columns allocated to CONV

Data buffer space size S_{CONvdatabuffer}And the size S of the buffer space_{convweightbuffer}。

It can be known during modeling that the time overhead of network computation has a "hidden effect". "hiding effect" means that the time calculation of a layer can be split into the calculation time and the data transmission time, and the total time of the layer should be the maximum of the two. And the data transmission time takes the greater of the data transmission time and the weight transmission time. To further study the time overhead, the dominant factor of time consumption per layer needs to be studied. Therefore, taking an LRCN typical neural network as an example, for FC operation, both TM and AP0 calculation processes are time overhead of weight reading and writing, and are dominant; while for CONV operation it is basically the calculation time that prevails. The full connection layer is small in calculation amount and large in data amount.

The data read-write time, the weight read-write time and the calculation time of the TM calculation process are greatly different. The computation time overhead of the convolutional layer is far greater than the data read-write time and the weight read-write; in this case, when the PE array is busy computing for a long period of time, the data transfer is idle; while FC calculation is performed, the PE array is idle due to the long time spent on weight reading and writing. In the time layer-by-layer overhead of the AP0, the data read-write time, the weight read-write time, and the calculation time are well balanced. Therefore, the total time can be effectively reduced by greatly increasing the time of convolution operation which is originally smaller than a certain value. The time overhead of the FC layer is dominated by weight reading and writing, and if the FC time is shortened, the proportion of the buffer allocated to the FC weight is required to be increased or the total time is reduced by increasing the total buffer amount or increasing the data transmission bandwidth.

Hardware parameters of the neural network energy consumption model and the neural network time model are also variable, so that the operation time and energy consumption of the same network under different hardware parameters can be tested by adjusting an algorithm, and certain help can be brought to the design of hardware. By adjusting various hardware parameters (such as bandwidth, cache size, PE array size, and number of channels), it is possible to obtain an optimal solution under the hardware area constraint, minimizing time and power consumption from a hardware design perspective.

Therefore, the TM calculation process is used for carrying out cache segmentation on the neural network time model; carrying out array segmentation on the neural network energy consumption model by using an AP (access point) calculation process; the array segmentation method with the minimum energy consumption corresponding to the condition that the cache segmentation meets the requirement is selected, so that the design of hardware in the future can be helped, and different optimal solutions can be realized under the conditions of different task times.

In this embodiment, time and energy consumption modeling is performed on the neural network from the perspective of a hardware calculation process of the network, time and energy consumption are predicted layer by layer, a leading modeling parameter of time and energy consumption overhead is analyzed, and a time and energy consumption dual-target optimization is performed on the neural network by improving the modeling parameter, an array segmentation method and a cache segmentation method, so that a neural network model is improved. Furthermore, different optimal solutions can be obtained under the conditions of different task times through layer-by-layer modeling.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

It should be noted that the present invention is not limited to the above-mentioned preferred embodiments, and those skilled in the art can obtain other products in various forms without departing from the spirit of the present invention, but any changes in shape or structure can be made within the scope of the present invention with the same or similar technical solutions as those of the present invention.

Claims

1. A neural network optimization method, comprising:

constructing a neural network time model based on the modeling parameters;

performing dual-objective optimization on the neural network energy consumption model and the neural network time model, including:

the array segmentation method with the minimum corresponding energy consumption when the cache segmentation meets the requirement is selected, and comprises the following steps: setting a task time Tmax;

if the set task time meets the requirement, comparing Tmax with T_TMSize of (2)And outputting a dual-target optimization result of the neural network energy consumption model and the neural network time model according to the comparison result.

2. The neural network optimization method according to claim 1, wherein the energy consumption calculation formula is E = V × T × E, V is a data amount to be read/written/calculated, T is a number of times to be read/written/calculated repeatedly, and E is unit energy consumption.

3. The neural network optimization method of claim 2, wherein the unit energy consumption comprises read-write energy consumption and computational energy consumption.

4. The neural network optimization method of claim 1, wherein the constructing the neural network energy consumption model comprises: and modeling the energy consumption of the neural network layer by layer to obtain a convolutional layer energy consumption model, a full-connection layer energy consumption model, an RNN layer energy consumption model and a pooling layer energy consumption model.

5. The neural network optimization method of claim 1, wherein the neural network temporal model comprises a temporal cost comprising a convolutional layer temporal cost and a fully-connected layer temporal cost.

6. The neural network optimization method of claim 5, wherein the time cost calculation method is time cost Tz max (T ═ max)^IO,T^operation)，T^IOFor read-write time, T^operationTo calculate the time.

7. The neural network optimization method of claim 1, wherein the determining whether the set task time meets the requirement according to the comparison result comprises:

if Tmax>T_AP0Judging that the set task time meets the requirement;

8. The neural network optimization method of claim 1, wherein the performing a dual-objective optimization on the neural network energy consumption model and the neural network time model according to the comparison output comprises:

calculating the energy consumption E required by the energy consumption model of the neural network partitioned by using the AP calculation flow array_AP', and selecting the minimum energy consumption E_AP0The corresponding modeling parameters and array segmentation method are dual-target optimization results;