WO2024124866A1 - Data processing method and electronic device - Google Patents

Data processing method and electronic device Download PDF

Info

Publication number
WO2024124866A1
WO2024124866A1 PCT/CN2023/103990 CN2023103990W WO2024124866A1 WO 2024124866 A1 WO2024124866 A1 WO 2024124866A1 CN 2023103990 W CN2023103990 W CN 2023103990W WO 2024124866 A1 WO2024124866 A1 WO 2024124866A1
Authority
WO
WIPO (PCT)
Prior art keywords
operations
bit
arithmetic
neural network
multiplication
Prior art date
Application number
PCT/CN2023/103990
Other languages
French (fr)
Inventor
Aydarkhanov RUSLAN
Kirill Igorevich SOLODSKIKH
Vladimir Mikhailovich KRYZHANOVSKIY
Dehua SONG
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Publication of WO2024124866A1 publication Critical patent/WO2024124866A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/48Indexing scheme relating to groups G06F7/48 - G06F7/575
    • G06F2207/4802Special implementations
    • G06F2207/4818Threshold devices
    • G06F2207/4824Neural networks

Definitions

  • Embodiments of the present application relate to the field of artificial intelligence (AI) , and more specifically, to a data processing method and an electronic device.
  • AI artificial intelligence
  • Accumulators are widely used in various computing devices. An accumulator with a lower bit width requires less computation and occupies a smaller area on a computing device, such as a chip. However, saturation or overflow will take place when a result exceeds the bit width range of the accumulator, which will distort the result. The lower the accumulator bit width, the higher the probability of saturation or overflow.
  • a neural network is taken as an example. There are some operations including dot products between two vectors in the neural network. When the result of a dot product exceeds the bit width range of the accumulator, the saturation or overflow occurs. The distorted result will affect the performance of the neural network. If the rate of saturation or overflow is high enough, the neural network performance will degrade significantly.
  • Embodiments of the present application provide a data processing method and an electronic device.
  • the technical solutions can be conducive to avoiding saturation and/or overflow when accumulation is performed.
  • an embodiment of the present application provides a data processing method including: performing one or more multiplication operations and one or more accumulation operations on multiple operands; and performing one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations to obtain a result shifted N bit (s) to the right relative to a result of a first arithmetic operation on the multiple operands, where the first arithmetic operation includes the one or more multiplication operations and the one or more accumulation operations, and N is a positive integer.
  • the multiplication can be performed on two operands.
  • the objects of the accumulation are based on the results of the multiplication.
  • the first arithmetic operation refers to an arithmetic operation that can be implemented by multiplication and accumulation.
  • the first arithmetic operation satisfied: a*b ⁇ c, where a, b and c are the multiple operands.
  • the first arithmetic operation is any of the following types of arithmetic operations: a dot product operation, a matrix multiplication operation, or a convolution operation.
  • the first arithmetic operation can be an integer arithmetic operation.
  • the numbers in the multiple operands are integers.
  • the one or more accumulation operations can be implemented with M-bit arithmetic.
  • N can be used to make the result within a range represented by M bits.
  • M is a positive integer.
  • N can be used to make the result within a range represented by M bits, or N can be used to make the result fit into the accumulator bit width. It is conducive to avoiding saturation and/or overflow when accumulation is achieved with M-bit arithmetic.
  • the solutions in the embodiment of the present application are conducive to avoiding saturation and/or overflow when accumulation is achieved with a lower bit width.
  • the right bit shift inserted into the first arithmetic operation allows the accumulation to be achieved with lower bit-width accumulators, thereby reducing the requirement for the capacity of the accumulators.
  • the reduction of the capacity of the accumulator can lead to faster computing speed, lower power consumption and smaller chip area.
  • the multiple operands are K-bit integers
  • the one or more accumulation operations can be implemented with M-bit integer arithmetic, M ⁇ 4*K, K is a positive integer, and M is a positive integer.
  • M 2*K.
  • the performing one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations includes: performing an N-bit right shift operation after every multiplication.
  • the one or more accumulation operations are implemented as cascade summations involving multiple summation iterations
  • the performing one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations includes: performing right bit shift operations before at least two summation iterations of the multiple summation iterations.
  • the incremental bit shift mode enables the right-most bit to be truncated gradually during the accumulation. Compared with truncating the right-most N bit (s) after multiplication at one time, the precision of the results obtained by this mode is higher. In other words, the incremental bit shift mode can significantly reduce errors compared to the single N-bit shift mode.
  • the performing one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations includes: performing right bit shift operations before every summation iteration.
  • the data processing method is applied to inference of a quantized neural network, and the quantized neural network includes the first arithmetic operation.
  • models implemented with lower bit-width accumulators can be obtained quickly by inserting right bit shift operations into the first arithmetic operations of the original quantized neural network.
  • the embodiment of the present application can work without additional training, which leads to an efficient and fast conversion.
  • the models implemented with lower bit-width accumulators have similar performance to the original quantized neural network. If the original quantized neural network is robust to small perturbation, the models converted by the method in the embodiment of the present application will keep their performance similar to the original quantized neural network using one or more greater bit-width accumulators.
  • the N corresponding to all first arithmetic operations in a first layer of the quantized neural network are the same.
  • the N corresponding to at least two first arithmetic operations in a second layer of the quantized neural network are different.
  • N corresponding to at least two first arithmetic operations can be determined separately, which is conducive to finding more appropriate N, so as to avoid overflow and ensure performance.
  • N is determined according to the maximum number of bits used to represent the results of the J first arithmetic operations based on a first dataset, where J is a positive integer.
  • N can be obtained with a small calibration dataset, which can lead to an efficient and fast conversion from the original quantized neural networks to the quantized neural networks including first arithmetic operations with bit shift operations.
  • an embodiment of the present application provides a training method of a neural network implemented by one or more computing devices, including: training a first neural network, where in forward propagation during the training, the following steps are performed on a result of a first arithmetic operation in the first neural network: dividing the result of the first arithmetic operation by a scale S, where S>0; rounding the division result; and multiplying the rounding result by the scale S, where the multiplication result is used to replace the result of the first arithmetic operation in the forward propagation.
  • Right bit shift operations inserted into a first arithmetic operation may degrade the performance of models sensitive to perturbations introduced by bit shift.
  • precision reduction caused by right bit shift operations is simulated in the training process.
  • the models sensitive to perturbations introduced by bit shift could be trained by the simulation of bit shift for floating point numbers in the neural network to adapt to these perturbations, which is conducive to keeping the inference performance of the quantized neural network using one or more smaller sized accumulators similar to the original quantized neural network using one or more greater sized accumulators.
  • the first arithmetic operation is any of the following types of arithmetic operations: a dot product operation, a matrix multiplication operation, or a convolution operation.
  • the scale S is used to make the rounding result within a range represented by M bits, and M is a positive integer.
  • M is a bit width of an accumulator used for the first arithmetic operation in a quantized neural network corresponding to the first neural network.
  • multiple operands of the first arithmetic operation in the quantized neural network are K-bit integers, where M ⁇ 4*K, K is a positive integer, and M is a positive integer.
  • M 2*K.
  • an embodiment of the present application provides an electronic device, including a function of implementing the method in the first aspect.
  • the function may be implemented by hardware, or may be implemented by the hardware executing corresponding software.
  • the hardware or the software includes one or more modules corresponding to the function.
  • an embodiment of the present application provides an electronic device, including a function of implementing the method in the second aspect.
  • the function may be implemented by hardware, or may be implemented by the hardware executing corresponding software.
  • the hardware or the software includes one or more modules corresponding to the function.
  • an embodiment of the present application provides a computer-readable storage medium having instructions which, when run on a computer, cause the computer to perform the method in the first aspect or any possible design of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium having instructions which, when run on a computer, cause the computer to perform the method in the second aspect or any possible design of the second aspect.
  • an electronic device including a processor and a memory.
  • the processor is connected to the memory, the memory is configured to store instructions, and the processor is configured to execute the instructions.
  • the processor executes the instructions stored in the memory, the processor is caused to perform the method in any possible design of the first aspect or the second aspect.
  • a chip system including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to invoke the computer program from the memory and run the computer program, so that a server on which a chip is disposed performs the method in any possible design of the first aspect or the second aspect.
  • the chip can include a circuit.
  • a computer program product which, when run on an electronic device, causes the electronic device to perform the method in any possible design of the first aspect or the second aspect.
  • an embodiment of the present application provides a computer-readable storage medium storing instructions for execution by a device, where the instructions include: performing a multiplication operation on a first operand and a second operand; performing a right bit shift operation on a result of the multiplication operation; and performing an accumulation operation on a result of the right bit shift operation and a third operand.
  • the first operand and the second operand are K-bit integers, where the accumulation operation is implemented with M-bit integer arithmetic, M ⁇ 4*K, K is a positive integer, and M is a positive integer.
  • M 2*K.
  • FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a system according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an example of a neural network layer.
  • FIG. 4 is a flowchart of an embodiment of a data processing method according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an example of a dot product according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of an example of a dot product with an incremental bit shift mode according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the representative distribution of the maximum bit of all accumulators of a single neural network layer.
  • FIG. 8 is a schematic diagram of the representative distribution of the maximum bit of all accumulators of a single neural network layer according to an embodiment of the present application.
  • FIG. 9 is a flowchart of an example method for training a neural network according to an embodiment of the present application.
  • FIG. 10 a schematic block diagram of an electronic device 1800 according to an embodiment of the present application.
  • FIG. 11 is a schematic block diagram of an electronic device 1900 according to an embodiment of the present application.
  • FC fully-connected
  • linear layer is the simplest and the oldest neural network block, which is still frequently used in all modern neural networks.
  • the FC layer performs linear projection of input data X of a shape N b ⁇ C in with a weight tensor W of a shape C in ⁇ C out and produces output data Y of a shape N b ⁇ C out , where N b is a size of a batch, C in is the number of input data channels, and C out is the number of output data channels.
  • the input data and the output data are 2-dimensional (2D) tensors, however they can be of any shape as long as the last dimension (feature space size) is equal to C in .
  • the first can be flattened into a single one and not flattened after the layer is applied.
  • a CNN is a subclass of deep neural networks (DNN) which typically consists of the following operations: convolution, batch normalization, non-linear activation (sigmoid, ReLU, etc. ) , fully-connected layers, element-wise operations (addition, multiplication) , down sampling (max pooling, average pooling) and up sampling (transposed convolution, interpolation) .
  • DNN deep neural networks
  • the convolution is a widely-used operation in a modern neural network which convolves input data of a shape N b ⁇ C in ⁇ H in ⁇ W in (or N b ⁇ H in ⁇ W in ⁇ C in in some implementations) with a convolutional kernel of a size K e ⁇ K e ⁇ C in ⁇ C out and produces output data of a shape N b ⁇ C out ⁇ H out ⁇ W out , where N b is the size of the batch, H in and H out are a height of the input data and a height of the output data, respectively, W in and W out are a width of the input data and a width of the output data, respectively, C in is the number of the input data channels, C out is the number of the output data channels ( “filters” ) .
  • C out is a rectangular cuboid of a size K e ⁇ K e ⁇ C in in the slide along an image and is point-wise multiplied with underlying pixels. Values are added together and a number is provided as an output for each cube at each location. H out and W out depend on a size of kernel, stride (size of sliding window step) and padding (extra pixels added at edges of the image) .
  • the constructed convolutions have inductive biases: locality and translation equivariance, which allows them to better perform spatially correlated data (images) .
  • a CNN is usually constructed in a pyramidal way. It starts from high resolution input data with a small number of input data channels. At each stage, the resolution is decreased by some factors via pooling or convolution with the stride > 1, and the number of channels is increased to compress information inside a local embedding.
  • One of the most well- known examples of CNN architectures of such kind is a residual neural network (ResNet) family.
  • ResNet residual neural network
  • integer and floating-point types There are two major numerical types implemented on hardware: integer and floating-point types. Integer type is used to represent integer numbers, such as 0, 1, 2, or 3. Floating-point type is used to represent real numbers, such as 0.1, -3.1, or 100.0134.
  • Deep learning has reached significant breakthroughs in many practical problems, such as computer vision, natural language processing, and speech recognition.
  • computer vision tasks can include classification, Face ID, person re-identification, car brand recognition, object detection, semantic and instance segmentation, or the like.
  • neural networks As a main instrument of the deep learning, neural networks have been widely used in various fields.
  • the following is a simplified example to illustrate the neural network model.
  • the description vector could be some characteristics of objects to be processed by a neural network.
  • the neural network can be a classifier used to determine whether to give the credit to a client.
  • the description vector could be some characteristics of the client.
  • the classifier will determine to give the credit to the client otherwise the classifier will determine not to give the credit to the client.
  • a neural network usually contains thousands or millions of parameters represented by the floating-point type, which means there are a large number of floating-point operations. If floating-point arithmetic can be switched to integer arithmetic, there will be significant acceleration. But in general, it is not so straightforward to the convert floating-point parameters vector and description vector (or inputs) to the integer parameters vector and description vector by scaling as in the example above.
  • Quantization is a powerful method for acceleration and compression of deep neural network (DNN) models.
  • DNN deep neural network
  • some part of weights or/and activations are represented as low-precision integer numbers instead of 32-bit floating-point value.
  • Quantization enables deep neural networks to be deployed on more types of devices, such as edge devices or embedded devices.
  • Quantized NN requires less storage space, which means it can be more easily distributed over resource-limited devices.
  • a lower bit width can also be used if hardware supports low-bit calculations. But the typical problem is that the quality of the neural network can degrade, so it needs to be either fine-tuned or retrained from scratch.
  • QAT Quantization-aware training
  • QAT is to train the network in the process of quantification, so that the network parameters can better adapt to the loss caused by quantification.
  • QAT is to insert a fake quant module into the network to simulate the rounding and clamping operations of the quantized model in the inference process, so as to improve the adaptability of the network to the effect of quantification during the training process and obtain higher precision of the quantized network.
  • all calculations are implemented by floating-point calculation. And the quantized neural network can be obtained after QAT.
  • the result after processing by the fake quant module can meet the following formula:
  • w is a floating-point weight or activation.
  • s is the quantitative scale.
  • q max is the maximum value of the quantization range, an q min is the minimum value of the quantization range. represents the function of rounding.
  • clamp () represents the function of clamping, where clamp can meet the following formula:
  • STE Straight through estimator
  • BP back propagation
  • FIG. 1 depicts a schematic diagram of a system architecture 100.
  • a data acquisition device 160 is configured to collect training data.
  • the training data can include the training image and processing results corresponding to a training image such as a classification result corresponding to the training image, which can be a result of manual pre-annotation.
  • a database 130 After collecting the training data, a database 130 obtains the training data from the data acquisition device 160, and a training device 120 trains a target model/rule 101 based on the training data maintained in the database 130.
  • the target model/rule 101 can be a neural network model provided by the embodiment of the present application. It should be noted that in practical applications, the training data maintained in the database 130 may not be collected by the data acquisition device 160, but may also be received from other devices. In addition, it should be noted that the training device 120 does not necessarily train the target model/rule 101 completely based on the training data maintained in the database 130. It is also possible to obtain training data from the cloud or other places for model training. The above description should not be used as a limitation to the embodiment of the present application.
  • the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as an execution device 110 shown in FIG. 1.
  • the execution device 110 can be a terminal, such as a mobile terminal, a tablet computer, a laptop computer, augmented reality (AR) , virtual reality (VR) , an on-board terminal, or a server or cloud terminal.
  • the execution device 110 configures an input/output (I/O) interface 112 for data interaction with external devices.
  • the data can be input to the I/O interface 112 through a client device 140.
  • the input data in the embodiment of the present application can include data to be processed input by the client device 140.
  • the execution device 110 can call data, code, or the like, in a data storage system 150 for corresponding processing by a calculation module 111, and can also store the corresponding processed data, instructions, or the like, in the data storage system 150.
  • the I/O interface 112 returns the processing result, such as the processing result of the data obtained above, to the client device 140, so as to provide it to a user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be configured to achieve the above goals or tasks.
  • FIG. 1 is only a schematic diagram of a system architecture provided by the embodiment of the application.
  • a positional relationship between devices and modules shown in FIG. 1 does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed in the execution device 110.
  • the embodiment of the present application provides a system 200.
  • the system 200 includes a local device 201, a local device 202, an execution device 210 and a data storage system 250, where the local device 201 and the local device 202 are connected with the execution device 210 through a communication network.
  • the execution device 210 may be implemented by one or more servers. Alternatively, the execution device 210 may be configured in conjunction with other computing devices, such as data storage devices, routers, load balancers, or the like. The execution device 210 may be provided on a physical site or distributed on a plurality of physical sites. The execution device 210 may use data in the data storage system 250 or call a program code in the data storage system 250 to implement the method provided by the embodiment of the present application, the calculation of the neural network provided by the embodiment of the present application, or the training of the neural network provided by the embodiment of the present application.
  • Each local device can represent any computing device, such as a personal computer, a computer workstation, a smart phone, a tablet computer, a smart camera, a smart car or other types of cellular phone, a media consumption device, a wearable device, a set-top box or a game console.
  • the local device can interact with the execution device 210 through a communication network of any communication mechanism/communication standard.
  • the communication network can be a wide area network, a local area network, a point-to-point connection, or any combination of them.
  • the local device 201 and the local device 202 acquire relevant parameters of the neural network from the execution device 210, deploy the neural network on the local device 201 and the local device 202, and use the neural network for data processing like image processing, speech processing or text processing.
  • the neural network can be directly deployed on the execution device 210.
  • the execution device 210 obtains the data to be processed from the local device 201 and the local device 202, and processes the data using the neural network.
  • the execution device 210 can be a cloud device. In this case, the execution device 210 can be deployed on the cloud. Alternatively, the execution device 210 can be a terminal device. In this case, the execution device 210 can be deployed on the user terminal.
  • Accumulators are widely used in various computing devices. An accumulator with a lower bit width requires less computation, which means it is faster and power-efficient. And an accumulator with a lower bit width occupies a smaller area on a computing device, such as a chip. However, saturation or overflow will take place when a result exceed the bit width range of the accumulator, which will distort the result. The lower the accumulator bit width, the higher the probability of saturation or overflow.
  • a neural network is taken as an example. There are some operations including dot products in the neural network.
  • the most commonly used neural network layers are linear layers and convolutional layers containing matrix multiplication operations and convolution operations both of which consist of a set of dot products between two vectors.
  • the dot product is done incrementally by multiplying two numbers at a time and adding the result to the previously accumulated sum which is called an accumulator.
  • FIG. 3 depicts a schematic diagram of an example of a neural network layer.
  • inputs and weights are quantized to 8-bit integers.
  • the multiplication result of two 8-bit integer numbers requires 16 bits. And the sum of these intermediate multiplication may require an expanded bit width depending on the length of the input vectors and the actual values.
  • a 32-bit integer accumulator is used with 8-bit integer inputs, because its capacity is sufficient in practical scenarios.
  • the accumulator is implemented as a 32-bit integer which means all the additions in the dot product are completed with 32-bit integer arithmetic.
  • the results of intermediate multiplication in the dot product are accumulated in the 32-bit integer. For example, as shown in FIG.
  • the inputs and weights are vector (I1, I2, I3, I4) and vector (W1, W2, W3, W4) , respectively.
  • I1, I2, I3, I4, W1, W2, W3 and W4 are represented as 8-bit integers.
  • the values of the corresponding position in the two vectors are multiplied to obtain C1, C2, C3 and C4.
  • C1, C2, C3 and C4 are represented as 16-bit integers.
  • C1, C2, C3 and C4 are added to obtain A.
  • A is represented as 32-bit integers in FIG. 3. If the input of the next layer needs to be quantized into 8-bit integer, the result of the dot production can be quantized into 8-bit integer outputs as shown in FIG. 3.
  • bit width of inputs When the result exceeds the bit width range of the accumulator, saturation or overflow occurs. The distorted result will affect the performance of the neural network. If the rate of saturation or overflow is high enough, the neural network performance will degrade significantly. In some related technical solutions, the bit width of inputs is further reduced. A smaller bit width of inputs reduces the bit width of the intermediate multiplication and, therefore, the final accumulator. However, the bit width of the inputs affects the precision of the inputs. The additional loss of precision will lead to the quality drop.
  • the embodiment of the present application provides a data processing method to avoid computational errors introduced by saturation and/or overflow by inserting the bit shift between multiplication and accumulation.
  • FIG. 4 depicts a flow chart diagram of an example method 400 for data processing according to an embodiment of the present application.
  • the method 400 shown in FIG. 4 can be executed by one or more computing devices.
  • computing devices can be cloud service devices or terminal devices, such as computers or servers, or a system composed of cloud service devices and terminal devices.
  • the method 400 may be performed by the execution device 110 in FIG. 1, or the execution device 210 or the local device in FIG. 2.
  • computing devices can be chips or circuits configured in cloud service devices or terminal devices.
  • step number in the embodiment of the present application is only for the convenience of description, and does not limit the execution order of the steps.
  • the method 400 shown in FIG. 4 includes at least step 401 and step 402.
  • the multiplication can be performed on two operands.
  • the objects of the accumulation are based on the results of the multiplication.
  • the method 400 can be understood as follows: insert one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations originally included in the first arithmetic operation. “Multiply –shift–accumulate” is used as a building block of the first arithmetic operation instead of the standard “multiply -accumulate (MAC) ” operation. The result of the first arithmetic operation with one or more right bit shift operations inserted inside shifted N bit (s) to the right compared with the result of the first arithmetic operation without the one or more right bit shift operations.
  • MAC multiply -accumulate
  • the one or more right shift operations inserted into the first operation can also be referred to as the one or more right shift operations inside the first operation.
  • the final accumulated result is shifted N bit (s) to the right relative to the first arithmetic operation result.
  • the final accumulated result is the result obtained in step 402.
  • the first arithmetic operation refers to an arithmetic operation that can be implemented by multiplication and accumulation.
  • the first arithmetic operation satisfied: a*b ⁇ c, where a, b and c are the multiple operands.
  • the first arithmetic operation is any of the following types of arithmetic operations: a dot product operation, a matrix multiplication operation, or a convolution operation.
  • Matrix multiplication and convolution can be regarded as including a set of dot products.
  • the embodiments of the present application take the first arithmetic operation as the dot product operation as an example to explain, without limiting the solutions in the embodiments of the present application.
  • the multiple operands of the dot product can be two vectors.
  • the dot product is done incrementally by multiplying two numbers from the two vectors at a time and adding the result to the previously accumulated sum stored in an accumulator to update the accumulated sum in the accumulator. Multiplication on the corresponding numbers in the two vectors is performed. The objects of the accumulation are based on the results of multiplication.
  • the most significant left-most bits are discarded whereas the least significant right-most bits are kept.
  • one or more right bit shift operations are performed between multiplication and accumulation. This helps to keep the most significant left-most bits of the first arithmetic result, which is conducive to avoiding saturation and/or overflow when accumulation is performed.
  • the embodiment of the present application does not require changing the effective bit width of the first arithmetic operation inputs, which avoids the loss of precision caused by the reduction of the bit width of the inputs, and is conducive to ensuring the calculation quality.
  • the first arithmetic operation is an integer arithmetic operation.
  • the numbers in the multiple operands are integers.
  • the one or more accumulation operations can be implemented with M-bit arithmetic.
  • N can be used to make the result within a range represented by M bits.
  • M is a positive integer.
  • M is the bit width of one or more accumulators used to perform the one or more accumulation operations.
  • N can be used to make the result within a range represented by M bits, or N can be used to make the result fit into the accumulator bit-width. It is conducive to avoiding saturation and/or overflow when accumulation is achieved with M-bit arithmetic.
  • the solutions in the embodiment of the present application are conducive to avoiding saturation and/or overflow when accumulation is achieved with a lower bit width.
  • the right bit shift inserted into the first arithmetic operation allows the accumulation to be achieved with lower bit-width accumulators, thereby reducing the requirement for the capacity of the accumulators.
  • the reduction of the capacity of the accumulator can lead to faster computing speed, lower power consumption, and a smaller chip area.
  • the multiple operands of the first arithmetic operation can be K-bit integers, and the one or more accumulation operations can be implemented with M-bit integer arithmetic. M ⁇ 4*K. K is a positive integer.
  • M 2*K.
  • the one or more right bit shift operations in step 402 are described as follows.
  • step 402 can include: performing an N-bit right shift operation after every multiplication.
  • the bit shift mode described above can be called a single bit shift mode.
  • a single N-bit right shift operation is performed on the result of every multiplication.
  • the result of the final accumulation can be shifted N bit (s) to the right relative to the first arithmetic operation result.
  • the result of the final accumulation can be a value shifted by N bit (s) to the right relative to the dot product of the two vectors.
  • FIG. 5 depicts a schematic diagram of an example of a dot product with an N-bit shift.
  • multiple operands of a dot product operation are vector#1 (I1, I2, I3, I4) and vector#2 (W1, W2, W3, W4) .
  • Multiplication is performed on the values of the corresponding position in the two vectors to obtain C1, C2, C3, and C4.
  • the results of the multiplication (C1, C2, C3, and C4) are shifted N bit (s) to the right respectively to obtain S1, S2, S3, and S4.
  • S1, S2, S3, and S4 are added to obtain A’.
  • A’ is a value shifted by N bit (s) to the right.
  • I1, I2, I3, I4, W1, W2, W3 and W4 can be represented as 8-bit integers.
  • C1, C2, C3, and C4 are represented as 16-bit integers.
  • S1, S2, S3, and S4 are represented as 16-bit integers.
  • A’ is represented as a 16-bit integer. In this case, M can be 16.
  • FIG. 5 is only an example and does not limit the solutions in the embodiments of the present application.
  • the one or more accumulation operations can be implemented as cascade summations involving multiple summation iterations.
  • step 402 can include: performing right bit shift operations before at least two summation iterations of the multiple summation iterations.
  • the bit shift mode described above can be called an incremental bit shift mode.
  • right bit shift operations are done incrementally before at least two summation iterations.
  • the total number of bits of right bit shift before the at least two summation iterations is N, in which way, the result of the final accumulation can be shifted N bit (s) to the right relative to the first arithmetic operation result.
  • the result of the final accumulation can be a value shifted by N bit (s) to the right relative to the dot product of the two vectors.
  • the number of bits shifted before the at least two summation iterations can be the same or different.
  • step 402 can include: performing right bit shift operations before every summation iteration.
  • the number of bits shifted before different summation iterations can be the same or different.
  • step 402 can include: performing N 1 -bit right shift operations before every summation iteration.
  • N 1 is a positive integer.
  • N 1 can be 1.
  • N can be equal to the number of the summation iterations in the cascade summations.
  • FIG. 6 depicts a schematic diagram of an example of a dot product with an incremental 2-bit shift.
  • multiple operands of a dot product operation are vector (I1, I2, I3, I4) and vector (W1, W2, W3, W4) .
  • Multiplication is performed on the values of the corresponding position in the two vectors to obtain C1, C2, C3, and C4.
  • FIG. 6 there are two summation iterations. Before the first summation iteration, the results of the multiplication (C1, C2, C3, and C4) are shifted 1 bit to the right respectively to obtain S’1, S’2, S’3, and S’4. In the first summation iteration, S’1 and S’2 are added to obtain A” 1, and S’3 and S’4 are added to obtain A” 2.
  • A” 1 and A 2 are shifted 1 bit to the right respectively to obtain S” 1 and S” 2.
  • S” 1 and S” 2 are added to obtain A”’ .
  • A”’ is a value shifted by 2 bits to the right.
  • I1, I2, I3, I4, W1, W2, W3 and W4 can be represented as 8-bit integers.
  • C1, C2, C3, C4, S’1, S’2, S’3, S’4, A” 1, A” 2, S” 1 and S” 2 are represented as 16-bit integers.
  • A” ’ is represented as a 16-bit integer.
  • M can be 16.
  • FIG. 6 is only an example and does not limit the solutions in the embodiments of the present application.
  • the incremental bit shift mode enables the right-most bit to be truncated gradually during the accumulation. Compared with truncating the right-most N bit (s) after multiplication at one time, the precision of the results obtained by this mode is higher. In other words, the incremental bit shift mode can significantly reduce errors compared to the single N-bit shift mode.
  • the embodiment of the present application is applicable to any arithmetic operations containing multiplication and accumulation, such as a dot product, which could benefit from the reduced capacity accumulator. Moreover, it can be applied to any bit-width setups.
  • the method 400 can be applied to inference of a quantized neural network.
  • the quantized neural network includes the first arithmetic operation.
  • Input data and output data of a quantized neural network are related to a task of the quantized neural network.
  • the input of the quantized neural network can be image data.
  • the computer vision task can include image classification, image detection, image segmentation, and image recognition or image generation.
  • the computer vision task can be image classification, in which case, the output of the quantized neural network can be used to indicate the classification to which the image belongs.
  • computer vision task can be image classification, in which case, the output of the quantized neural network can be used to confirm the identity of the object in the image.
  • the input of the quantized neural network can be text data.
  • the text processing task can include text recognition or text translation.
  • the text processing task can be text translation, in which case, the output of the quantized neural network can be the translation result of the input text.
  • the input of the quantized neural network can be voice data.
  • the voice processing task can include speech recognition, in which case, the output of the quantized neural network can be the recognition result of the input voice.
  • a type of quantized neural network in the embodiment of the present application is not limited.
  • FIG. 7 depicts a schematic diagram of the representative distribution of the maximum bit of all accumulators of a single neural network layer in a quantized neural network.
  • FIG. 8 depicts a schematic diagram of the representative distribution of the maximum bit of all accumulators of a single neural network layer based on the solutions in the embodiment of the present application.
  • the above example distribution is obtained from a quantized neural network with 8-bit integer weights and activations and 32-bit accumulators.
  • the values in the accumulators may need up to 18 bits to represent.
  • 16-bit accumulators some of the accumulators would experience overflow.
  • the 2-bit right shift can be inserted into the dot products of the layer according to the solution in the embodiment of the application. In this case, as shown in FIG. 8, the distribution of the maximum bit will stay very similar but shifted by 2 bits to the right, which can fit into a 16-bit integer type.
  • the output of a network layer in the quantized neural network is typically re-quantized to become an input to the following network layer.
  • the numbers in the accumulator with a large bit width will be scaled down to fit into the type of a smaller bit width.
  • the output of the network layer is represented as a 32-bit integer, which does not match the bit width of the input of the following network layer.
  • the 32-bit integer needs to be re-quantized to an 8-bit integer. In the case of scaling integers by the multiples of 2, it is equivalent to a right bit shift operation.
  • the method 400 When the method 400 is applied to a network layer of a quantized neural network, the output of the network layer will be right-shifted.
  • the output of the original quantized neural network can be approximately obtained by adjusting the re-quantization scales and offsets accordingly. For every bit shifted the re-quantization parameters need to be scaled by 2.
  • models implemented with lower bit-width accumulators can be obtained quickly by inserting one or more right bit shift operations into the first arithmetic operations of the original quantized neural network.
  • the embodiment of the present application can work without additional training, which leads to an efficient and fast conversion.
  • a small calibration dataset can be used to estimate the optimal bit shift, which is the best value of N.
  • the models implemented with lower bit-width accumulators have similar performance to the original quantized neural network. If the original quantized neural network is robust to small perturbation, the models converted by the method in the embodiment of the present application will keep their performance similar to the original quantized neural network using one or more accumulators with a full size, such as 32 bits. Others have a small quality degradation after being converted by the method in the embodiment of the present application. Models sensitive to perturbations introduced by the bit shift could be further trained to adapt to these perturbations. The details are described in method 900.
  • N corresponding to all first arithmetic operations in the quantized neural network can be the same.
  • N corresponding to a first arithmetic operation refers to the number of bits shifted caused by the one or more right bit shift operations inserted into the first arithmetic operation.
  • the number of bits shifted caused by the one or more right bit shift operations inserted into the first arithmetic operation can also be called the number of bits shifted inside the first arithmetic operation.
  • the number of bits shifted inside all first arithmetic operations in the quantized neural network can be the same. In other words, a single N can be used for all first arithmetic operations in a quantized neural network.
  • the first arithmetic operations in a quantized neural network can be of the same type or of different types. For example, there can be matrix multiplication and convolution in a quantized neural network.
  • N corresponding to all first arithmetic operations in a first layer of the quantized neural network can be the same.
  • the first layer can be any network layer in the quantized neural network, as long as the network layer includes multiple first arithmetic operations. “The first” of the first network layer is only used to limit that the network layer has a plurality of first arithmetic operations, and has no other limiting effect.
  • a single N can be used for the whole layer.
  • a single N corresponds to the whole layer.
  • the N corresponding to different network layers can be independent.
  • the N corresponding to the multiple layers can be different.
  • N corresponding to at least two first arithmetic operations in a second layer of the quantized neural network are different.
  • the second layer can be any network layer in the neural network, as long as the network layer includes multiple first arithmetic operations. “The second” of the second network layer is only used to limit that the network layer has a plurality of first arithmetic operations, and has no other limiting effect.
  • N corresponding to at least two first arithmetic operations can be determined separately, which is conducive to finding more appropriate N, so as to avoid overflow and ensure performance.
  • a convolution layer is taken as an example.
  • the output channels of a convolution layer can be multiple.
  • the output feature maps of a convolution layer can be multiple.
  • the convolution operations between each filter in the convolution layer and the input data of the convolution layer are independent. Accordingly, the accumulators for each output channel of a convolutional layer can be independent of each other.
  • Individual N can be used for convolution operations corresponding to different filters in a convolution layer. In other words, individual N can be used for each output channel. N corresponding to different filters in a convolution layer can be the same or different.
  • N 2 and N 3 can be the same or different.
  • the above only takes the convolution layer as an example, and other network layers with multiple independent outputs are also applicable to the above solution. That is, for different outputs of a network layer, the number of bits shifted to the right can be the same or different.
  • individual N can be used for each output channel or each output feature of a network layer.
  • each output feature of an FC layer is independent of each other.
  • the accumulators for each output feature of a FC layer can be independent of each other.
  • Individual N can be used for each output feature.
  • N can be the same or different.
  • the following is an exemplary description of determining the number of bits shifted inside a first arithmetic operation, that is, the value of N.
  • N can be determined according to the number of bits required to represent the results of J first arithmetic operation corresponding to the same N.
  • J is a positive integer.
  • J can be an integer greater than 1.
  • the number of bits shifted to the right inside some first arithmetic operations of a quantized neural network may be the same.
  • the N corresponding to these first arithmetic operations can be determined according to the number of bits required to represent the results of these first arithmetic operations.
  • N is determined according to the maximum number of bits used to represent the results of the J first arithmetic operations based on a first dataset.
  • N is determined according to the maximum number of bits used to represent the results of the J first arithmetic operations based on a first dataset and the bit width of one or more accumulators used for the J first arithmetic operations.
  • the J first arithmetic operations correspond to the same value of N, which means the number of bits shifted to the right inside these first arithmetic operations is the same.
  • the first dataset can be called a calibration dataset.
  • the size of the first dataset can be selected as required.
  • N can be obtained with a small calibration dataset, which can lead to an efficient and fast conversion from the original quantized neural networks to the quantized neural networks including first arithmetic operations with bit shift operations.
  • the J first arithmetic operations can belong to a quantized neural network.
  • the first dataset can be understood as a dataset of the quantized neural network.
  • the data in the first dataset may belong to a training dataset or a test dataset of the quantized neural network.
  • the first dataset also can be obtained in other ways. The embodiment of the present application does not limit this.
  • the maximum result of the J first arithmetic operations based on a first dataset can be expressed as follows:
  • a max is the maximum result of the J first arithmetic operations based on the first dataset.
  • Acc is the minimum bit width of one or more accumulators used for the J first arithmetic operations.
  • the bit width of one or more accumulators can be the same, in which case, Accis the bit width of the one or more accumulators used for the J first arithmetic operations.
  • Acc can be M.
  • N can also be determined by other forms of formulas, which is not limited in the embodiments of the present application.
  • a single N can be used for the whole quantized neural network.
  • J can be the number of the first arithmetic operations in the quantized neural network.
  • a single N can be used for the whole network layer of a quantized neural network.
  • the N corresponding to the different layers can be independent.
  • J can be the number of the first arithmetic operations in one network layer.
  • the J can be the same or different.
  • individual N can be used for each output channel or each output feature.
  • J can be the number of the first arithmetic operations for each output channel or each output feature.
  • the embodiment of the present application is implemented at the cost of truncating the rightmost bit at earlier stages. Some models may be sensitive to such perturbations introduced by the bit shift.
  • FIG. 9 depicts a flow chart diagram of an example method 900 for training a neural network according to an embodiment of the present application.
  • the method 900 shown in FIG. 9 can be executed by one or more computing devices.
  • computing devices can be cloud service devices or terminal devices, such as computers or servers, or a system composed of cloud service devices and terminal devices.
  • the method 900 may be performed by the training device 120 in FIG. 1, or the execution device 210 or the local device in FIG. 2.
  • the method 900 can include: training a first neural network. In forward propagation during training, performing the following steps on a result of a first arithmetic operation in the first neural network:
  • the first neural network can be any type of neural network.
  • Step 901 to step 903 can be understood as a simulation of the bit shift for floating point numbers in the neural network. In other words, step 901 to step 903 are inserted in the forward propagation.
  • the method 900 can be called bit shift-aware training.
  • the multiplication result can meet the following formula:
  • A is the result of the first arithmetic operation.
  • scale S can be chosen in such a way that the significand of the maximum absolute value of the accumulator is mapped to 2 M-1 after the first re-scaling.
  • S can be 2 N .
  • the first arithmetic operation can be implemented by multiplication and accumulation.
  • the first arithmetic operation is any of the following types of arithmetic operations: a dot product operation, a matrix multiplication operation, or a convolution operation.
  • the scale S is used to make the rounding result within a range represented by M bits.
  • M is a positive integer.
  • M can be a bit width of an accumulator used for the first arithmetic operation in a quantized neural network corresponding to the first neural network.
  • multiple operands of the first arithmetic operation in the quantized neural network are K-bit integers, where M ⁇ 4*K, K is a positive integer, and M is a positive integer.
  • M 2*K.
  • Right bit shift operations inside a first operation may degrade the performance of models sensitive to perturbations introduced by the bit shift.
  • precision reduction caused by right bit shift operations is simulated in the training process by the above method.
  • the models sensitive to perturbations introduced by the bit shift could be trained by the simulation of bit shift for floating point numbers in the neural network to adapt to these perturbations, which is conducive to keeping the inference performance of the quantized neural network using one or more smaller sized accumulators similar to the original quantized neural network using one or more greater sized accumulators.
  • the training a neural network can include: performing QAT on the neural network.
  • QAT with a simulation of bit shift for floating point numbers can be used to obtain the quantized neural network.
  • the device embodiments provided in the embodiments of the present application will be described below with reference to FIG. 10 and FIG. 11.
  • the device provided in the embodiments of the present application can be used to perform the methods described above, such as method 400 or method 900.
  • the specific description can refer to method 400 or method 900. In order to avoid unnecessary repetition, it is appropriate to omit part of the description when describing the device embodiments.
  • FIG. 10 is a schematic block diagram of an electronic device 1800 according to an embodiment of the present application. As shown in FIG. 10, the electronic device 1800 includes: a calculation module 1801 and a shift module 1802.
  • the calculation module 1801 is configured to one or more multiplication operations and one or more accumulation operations on multiple operands.
  • the shift module 1802 is configured to perform one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations to obtain a result shifted N bit (s) to the right relative to a result of a first arithmetic operation on the multiple operands, where the first arithmetic operation includes the one or more multiplication operations and the one or more accumulation operations, and N is a positive integer.
  • the first arithmetic operation is any of the following types of arithmetic operations: a dot product operation, a matrix multiplication operation, or a convolution operation.
  • the shift module 1802 is further configured to perform an N-bit right shift operation after every multiplication.
  • the one or more accumulation operations are implemented as cascade summations involving multiple summation iterations, and the shift module 1802 is further configured to perform right bit shift operations before at least two summation iterations of the multiple summation iterations.
  • the electronic device is applied to inference of a quantized neural network, and the quantized neural network includes the first arithmetic operation.
  • the multiple operands are K-bit integers
  • the one or more accumulation operations are implemented with M-bit integer arithmetic, M ⁇ 4*K
  • K is a positive integer
  • M is a positive integer
  • the calculation module 1801 can include one or more accumulators configured to perform the accumulation, where M is the bit width of the one or more accumulators.
  • M 2*K.
  • the N corresponding to all first arithmetic operations in a first layer of the quantized neural network are the same.
  • the N corresponding to at least two first arithmetic operations in a second layer of the quantized neural network are different.
  • module herein may be implemented in software and/or hardware without specific limitation.
  • a “module” may be a software program, a hardware circuit, or a combination of the above functions.
  • calculation module 1801 can be configured to execute any step in the data processing method and the shift module 1802 can be configured to execute any step in the data processing method.
  • the steps to be implemented by the calculation module 1801 and shift module 1802 can be specified as required to implement all functions of the electronic device 1800.
  • FIG. 11 is a schematic block diagram of an electronic device 1900 according to an embodiment of the present application.
  • the electronic device 1900 may include a transceiver 1901, a processor 1902, and a memory 1903.
  • the memory 1903 may be configured to store codes, instructions, or the like, executed by the processor 1902.
  • the processor 1902 may be an integrated circuit chip and has a signal processing capability. In an implementation process, steps of the foregoing method embodiments may be completed by using a hardware integrated logic circuit in a processor, or by using instructions in a form of software.
  • the processor may be a general purpose processor, a digital signal processor (digital signal processor, DSP) , an application specific integrated circuit (application specific integrated circuit, ASIC) , a field programmable gate array (field programmable gate array, FPGA) , or another programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • the processor may implement or perform the methods, the steps, and the logical block diagrams that are disclosed in the embodiments of the present application.
  • the general purpose processor may be a microprocessor, or the processor may be any conventional processor or the like.
  • the steps of the methods disclosed with reference to the embodiments of the present application may be directly performed and completed by a hardware decoding processor, or may be performed and completed by using a combination of hardware in the decoding processor and a software module.
  • the software module may be located in a mature storage medium in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, an electrically erasable programmable memory, or a register.
  • the storage medium is located in the memory, and the processor reads information in the memory and completes the steps of the foregoing methods in combination with hardware in the processor.
  • the memory 1903 in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both a volatile memory and a nonvolatile memory.
  • the nonvolatile memory may be a read-only memory (Read-Only Memory, ROM) , a programmable read-only memory (Programmable ROM, PROM) , an erasable programmable read-only memory (Erasable PROM, EPROM) , an electrically erasable programmable read-only memory (Electrically EPROM, EEPROM) , or a flash memory.
  • the volatile memory may be a random access memory (Random Access Memory, RAM) and is used as an external cache.
  • RAMs may be used, and are, for example, a static random access memory (Static RAM, SRAM) , a dynamic random access memory (Dynamic RAM, DRAM) , a synchronous dynamic random access memory (Synchronous DRAM, SDRAM) , a double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDR SDRAM) , an enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM) , a synchronous link dynamic random access memory (Synchronous link DRAM, SLDRAM) , and a direct rambus random access memory (Direct Rambus RAM, DR RAM) .
  • Static RAM Static RAM
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • Enhanced SDRAM, ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • Direct Rambus RAM Direct Rambus RAM
  • the memory in the systems and the methods described in the specification includes but is not limited to these memories and a memory of any other appropriate type.
  • An embodiment of the present application further provides a system chip, where the system chip includes an input/output interface, at least one processor, at least one memory, and a bus.
  • the at least one memory is configured to store instructions
  • the at least one processor is configured to invoke the instructions of the at least one memory to perform operations in the methods in the foregoing embodiments.
  • An embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a program instruction for performing any of the foregoing methods.
  • the storage medium may be specifically the memory.
  • An embodiment of the present application further provides a computer program product which, when run on an electronic device, causes the electronic device to perform any of the foregoing methods.
  • An embodiment of the present application further provides a computer-readable storage medium, storing one or more instructions for execution by a device, wherein the one or more instructions include: performing a multiplication operation on a first operand and a second operand; performing a right bit shift operation on a result of the multiplication operation; and performing an accumulation operation on a result of the right bit shift operation and a third operand.
  • the foregoing steps can belong to one instruction.
  • multiply-bit shift-add there is an instruction that instructs to perform “multiply-bit shift-add” .
  • this instruction can be used to perform an operation such as a*b ⁇ c.
  • the specific description about “multiply-shift-add” can refer to method 400.
  • the first operand, the second operand and the third operand are example of the multiple operands.
  • the specific description can refer to method 400.
  • the first operand and the second operand are K-bit integers
  • the accumulation operation is implemented with M-bit integer arithmetic, M ⁇ 4*K
  • K is a positive integer
  • M is a positive integer
  • M 2*K.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the described apparatus embodiment is merely an example.
  • the unit division is merely logical function division and may be other division in actual implementation.
  • a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed.
  • the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces.
  • the indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
  • the units described as separate parts may be or may not be physically separate, and parts displayed as units may be or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected based on actual requirements to achieve objectives of the solutions of the embodiments.
  • functional units in the embodiments of the present application may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
  • the functions When the functions are implemented in a form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions in the present application essentially, or the part contributing to the prior art, or some of the technical solutions may be implemented in a form of a software product.
  • the computer software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or some of the steps of the methods described in the embodiments of the present application.
  • the foregoing storage medium includes: any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM) , a random access memory (Random Access Memory, RAM) , a magnetic disk, or an optical disc.
  • program code such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM) , a random access memory (Random Access Memory, RAM) , a magnetic disk, or an optical disc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

Embodiments of the present application provide a data processing method and an electronic device. The method includes: performing one or more right bit shift operations between one or more multiplication operations and one or more accumulation operations to obtain a result shifted N bit (s) to the right relative to a result of a first arithmetic operation, where the first arithmetic operation includes the one or more multiplication operations and the one or more accumulation operations. The above technical solution can be conducive to avoiding saturation and/or overflow when accumulation is performed.

Description

DATA PROCESSING METHOD AND ELECTRONIC DEVICE TECHNICAL FIELD
Embodiments of the present application relate to the field of artificial intelligence (AI) , and more specifically, to a data processing method and an electronic device.
BACKGROUND
Accumulators are widely used in various computing devices. An accumulator with a lower bit width requires less computation and occupies a smaller area on a computing device, such as a chip. However, saturation or overflow will take place when a result exceeds the bit width range of the accumulator, which will distort the result. The lower the accumulator bit width, the higher the probability of saturation or overflow.
A neural network is taken as an example. There are some operations including dot products between two vectors in the neural network. When the result of a dot product exceeds the bit width range of the accumulator, the saturation or overflow occurs. The distorted result will affect the performance of the neural network. If the rate of saturation or overflow is high enough, the neural network performance will degrade significantly.
SUMMARY
Embodiments of the present application provide a data processing method and an electronic device. The technical solutions can be conducive to avoiding saturation and/or overflow when accumulation is performed.
According to a first aspect, an embodiment of the present application provides a data processing method including: performing one or more multiplication operations and one or more accumulation operations on multiple operands; and performing one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations to obtain a result shifted N bit (s) to the right relative to a result of a first arithmetic operation on the multiple operands, where the first arithmetic operation includes the one or more multiplication operations and the one or more accumulation operations, and N is a positive integer.
When saturation and/or overflow occurs, the most significant left-most bits are discarded whereas the least significant right-most bits are kept. According to the above technical solution, one or more right bit shift operations are performed between multiplication and accumulation. This helps to keep the most significant left-most bits of the first arithmetic result, which is conducive to avoiding saturation and/or overflow when accumulation is performed.
The multiplication can be performed on two operands. The objects of the accumulation are based on the results of the multiplication.
The first arithmetic operation refers to an arithmetic operation that can be implemented by multiplication and accumulation.
In a possible design, the first arithmetic operation satisfied: a*b±c, where a, b and c are the multiple operands.
In a possible design, the first arithmetic operation is any of the following types of arithmetic operations: a dot product operation, a matrix multiplication operation, or a convolution operation.
The first arithmetic operation can be an integer arithmetic operation.
In this case, the numbers in the multiple operands are integers.
In a possible design, the one or more accumulation operations can be implemented with M-bit arithmetic. N can be used to make the result within a range represented by M bits. M is a positive integer.
According to the above technical solution, N can be used to make the result within a range represented by M bits, or N can be used to make the result fit into the accumulator bit width. It is conducive to avoiding saturation and/or overflow when accumulation is achieved with M-bit arithmetic.
In this way, the solutions in the embodiment of the present application are conducive to avoiding saturation and/or overflow when accumulation is achieved with a lower bit width. In other words, the right bit shift inserted into the first arithmetic operation allows the accumulation to be achieved with lower bit-width accumulators, thereby reducing the requirement for the capacity of the accumulators. The reduction of the capacity of the accumulator can lead to faster computing speed, lower power consumption and smaller chip area.
In a possible design, the multiple operands are K-bit integers, the one or more accumulation operations can be implemented with M-bit integer arithmetic, M<4*K, K is a positive integer, and M is a positive integer.
In a possible design, M=2*K.
In a possible design, K=8, M=16.
In a possible design, K=4, M=8.
In a possible design, the performing one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations, includes: performing an N-bit right shift operation  after every multiplication.
In a possible design, the one or more accumulation operations are implemented as cascade summations involving multiple summation iterations, and the performing one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations, includes: performing right bit shift operations before at least two summation iterations of the multiple summation iterations.
According to the above technical solution, the incremental bit shift mode enables the right-most bit to be truncated gradually during the accumulation. Compared with truncating the right-most N bit (s) after multiplication at one time, the precision of the results obtained by this mode is higher. In other words, the incremental bit shift mode can significantly reduce errors compared to the single N-bit shift mode.
Optionally, the performing one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations, includes: performing right bit shift operations before every summation iteration.
In a possible design, the data processing method is applied to inference of a quantized neural network, and the quantized neural network includes the first arithmetic operation.
According to the above technical solution, models implemented with lower bit-width accumulators can be obtained quickly by inserting right bit shift operations into the first arithmetic operations of the original quantized neural network. The embodiment of the present application can work without additional training, which leads to an efficient and fast conversion.
Meanwhile, the models implemented with lower bit-width accumulators have similar performance to the original quantized neural network. If the original quantized neural network is robust to small perturbation, the models converted by the method in the embodiment of the present application will keep their performance similar to the original quantized neural network using one or more greater bit-width accumulators.
In a possible design, the N corresponding to all first arithmetic operations in a first layer of the quantized neural network are the same.
In a possible design, the N corresponding to at least two first arithmetic operations in a second layer of the quantized neural network are different.
According to the above technical solution, N corresponding to at least two first arithmetic operations can be determined separately, which is conducive to finding more appropriate N, so as to avoid overflow and ensure performance.
In a possible design, N is determined according to the maximum number of bits used to represent the results of the J first arithmetic operations based on a first dataset, where J is a positive integer.
According to the above technical solution, N can be obtained with a small calibration dataset, which can lead to an efficient and fast conversion from the original quantized neural networks to the quantized neural networks including first arithmetic operations with bit shift operations.
According to a second aspect, an embodiment of the present application provides a training method of a neural network implemented by one or more computing devices, including: training a first neural network, where in forward propagation during the training, the following steps are performed on a result of a first arithmetic operation in the first neural network: dividing the result of the first arithmetic operation by a scale S, where S>0; rounding the division result; and multiplying the rounding result by the scale S, where the multiplication result is used to replace the result of the first arithmetic operation in the forward propagation.
Right bit shift operations inserted into a first arithmetic operation may degrade the performance of models sensitive to perturbations introduced by bit shift. According to the above technical solution, precision reduction caused by right bit shift operations is simulated in the training process. In this way, the models sensitive to perturbations introduced by bit shift could be trained by the simulation of bit shift for floating point numbers in the neural network to adapt to these perturbations, which is conducive to keeping the inference performance of the quantized neural network using one or more smaller sized accumulators similar to the original quantized neural network using one or more greater sized accumulators.
In a possible design, the first arithmetic operation is any of the following types of arithmetic operations: a dot product operation, a matrix multiplication operation, or a convolution operation.
In a possible design, the scale S is used to make the rounding result within a range represented by M bits, and M is a positive integer.
In a possible design, M is a bit width of an accumulator used for the first arithmetic operation in a quantized neural network corresponding to the first neural network.
In a possible design, multiple operands of the first arithmetic operation in the quantized neural network are K-bit integers, where M<4*K, K is a positive integer, and M is a positive integer.
In a possible design, M=2*K.
In a possible design, K=8, M=16.
In a possible design, K=4, M=8.
According to a third aspect, an embodiment of the present application provides an electronic device, including a function of implementing the method in the first aspect. The function may be implemented by hardware, or may be implemented by the hardware executing corresponding software. The hardware or the software includes one or more modules corresponding to the function.
According to a fourth aspect, an embodiment of the present application provides an electronic device, including a function of implementing the method in the second aspect. The function may be implemented by hardware, or may be implemented by the hardware executing corresponding software. The hardware or the software includes one or more modules corresponding to the function.
According to a fifth aspect, an embodiment of the present application provides a computer-readable storage medium having instructions which, when run on a computer, cause the computer to perform the method in the first aspect or any possible design of the first aspect.
According to a sixth aspect, an embodiment of the present application provides a computer-readable storage medium having instructions which, when run on a computer, cause the computer to perform the method in the second aspect or any possible design of the second aspect.
According to a seventh aspect, there is provided an electronic device, including a processor and a memory. The processor is connected to the memory, the memory is configured to store instructions, and the processor is configured to execute the instructions. When the processor executes the instructions stored in the memory, the processor is caused to perform the method in any possible design of the first aspect or the second aspect.
According to an eighth aspect, there is provided a chip system, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to invoke the computer program from the memory and run the computer program, so that a server on which a chip is disposed performs the method in any possible design of the first aspect or the second aspect.
In a possible design, the chip can include a circuit.
According to a ninth aspect, there is provided a computer program product which, when run on an electronic device, causes the electronic device to perform the method in any possible design of the first aspect or the second aspect.
According to a tenth aspect, an embodiment of the present application provides a computer-readable storage medium storing instructions for execution by a device, where the instructions include: performing a multiplication operation on a first operand and a second operand; performing a right bit shift operation on a result of the multiplication operation; and performing an accumulation operation on a result of the right bit shift operation and a third operand.
In a possible design, the first operand and the second operand are K-bit integers, where the accumulation operation is implemented with M-bit integer arithmetic, M<4*K, K is a positive integer, and M is a positive integer.
In a possible design, M=2*K.
In a possible design, K=8, M=16.
In a possible design, K=4, M=8.
DESCRIPTION OF DRAWINGS
FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present application.
FIG. 2 is a schematic diagram of a system according to an embodiment of the present application.
FIG. 3 is a schematic diagram of an example of a neural network layer.
FIG. 4 is a flowchart of an embodiment of a data processing method according to an embodiment of the present application.
FIG. 5 is a schematic diagram of an example of a dot product according to an embodiment of the present application.
FIG. 6 is a schematic diagram of an example of a dot product with an incremental bit shift mode according to an embodiment of the present application.
FIG. 7 is a schematic diagram of the representative distribution of the maximum bit of all accumulators of a single neural network layer.
FIG. 8 is a schematic diagram of the representative distribution of the maximum bit of all accumulators of a single neural network layer according to an embodiment of the present application.
FIG. 9 is a flowchart of an example method for training a neural network according to an embodiment of the present application.
FIG. 10 a schematic block diagram of an electronic device 1800 according to an embodiment of the present application.
FIG. 11 is a schematic block diagram of an electronic device 1900 according to an embodiment of the present application.
DESCRIPTION OF EMBODIMENTS
The following describes the technical solutions in the present application with reference to the accompanying drawings.
In order to better describe the solutions of embodiments in the present application, concepts and terms that may be involved in the present application are described below.
(1) Fully-connected (FC) or Linear layer
A fully-connected (FC) or linear layer is the simplest and the oldest neural network block, which is still frequently used in all modern neural networks.
The FC layer performs linear projection of input data X of a shape Nb×Cin with a weight tensor W of a shape Cin×Cout and produces output data Y of a shape Nb×Cout, where Nb is a size of a batch, Cin is the number of input data channels, and Cout is the number of output data channels. The transformation can be written as Y=XW [+b] , where b is a bias tensor of a size Cout, which can be optionally added to each of N rows of a resulting matrix Y. Here, the input data and the output data are 2-dimensional (2D) tensors, however they can be of any shape as long as the last dimension (feature space size) is equal to Cin. The first can be flattened into a single one and not flattened after the layer is applied.
(2) Convolutional neural network (CNN)
A CNN is a subclass of deep neural networks (DNN) which typically consists of the following operations: convolution, batch normalization, non-linear activation (sigmoid, ReLU, etc. ) , fully-connected layers, element-wise operations (addition, multiplication) , down sampling (max pooling, average pooling) and up sampling (transposed convolution, interpolation) .
The convolution is a widely-used operation in a modern neural network which convolves input data of a shape Nb×Cin×Hin×Win (or Nb×Hin×Win×Cin in some implementations) with a convolutional kernel of a size Ke×Ke×Cin×Cout and produces output data of a shape Nb×Cout×Hout×Wout, where Nb is the size of the batch, Hin and Hout are a height of the input data and a height of the output data, respectively, Win and Wout are a width of the input data and a width of the output data, respectively, Cin is the number of the input data channels, Cout is the number of the output data channels ( “filters” ) . In other words, Cout is a rectangular cuboid of a size Ke×Ke×Cin in the slide along an image and is point-wise multiplied with underlying pixels. Values are added together and a number is provided as an output for each cube at each location. Hout and Wout depend on a size of kernel, stride (size of sliding window step) and padding (extra pixels added at edges of the image) .
Compared to the fully-connected networks, the constructed convolutions have inductive biases: locality and translation equivariance, which allows them to better perform spatially correlated data (images) .
A CNN is usually constructed in a pyramidal way. It starts from high resolution input data with a small number of input data channels. At each stage, the resolution is decreased by some factors via pooling or convolution with the stride > 1, and the number of channels is increased to compress information inside a local embedding. One of the most well- known examples of CNN architectures of such kind is a residual neural network (ResNet) family.
(3) Quantized neural network
There are two major numerical types implemented on hardware: integer and floating-point types. Integer type is used to represent integer numbers, such as 0, 1, 2, or 3. Floating-point type is used to represent real numbers, such as 0.1, -3.1, or 100.0134.
The fact is that integer arithmetic works faster on hardware than floating-point arithmetic. This fact means that adding two integers on hardware, such as 2 + 2, is faster than adding two floating-point numbers, such as 2.2 + 2.1. This is also true for other operations such as division and multiplication.
Deep learning has reached significant breakthroughs in many practical problems, such as computer vision, natural language processing, and speech recognition. For example, computer vision tasks can include classification, Face ID, person re-identification, car brand recognition, object detection, semantic and instance segmentation, or the like.
As a main instrument of the deep learning, neural networks have been widely used in various fields.
The following is a simplified example to illustrate the neural network model.
Suppose there are two vectors: a parameter vector (p_1, p_2, p_3) and a description vector (d_1, d_2, d_3) . The description vector could be some characteristics of objects to be processed by a neural network. For example, the neural network can be a classifier used to determine whether to give the credit to a client. In which case, the description vector could be some characteristics of the client. The classifier can be constructed as follows. The dot product of the two vectors is estimated: (p_1, p_2, p_3) * (d_1, d_2, d_3) = p_1*d_1 + p_2*d_2 + p_3*d_3 = r. If the result r is greater than some value s then the classifier will determine to give the credit to the client otherwise the classifier will determine not to give the credit to the client. The real numbers are assigned to variables. For instance, (p_1, p_2, p_3) = (0.1, 0.2, 0.4) , (d_1, d_2, d_3) = (1.2, 1.5, 1.1) and s = 0.5. (p_1, p_2, p_3) = (0.1, 0.2, 0.4) means that p_1=0.1, and so on. (d_1, d_2, d_3) = (1.2, 1.5, 1.1) means that d_1=1.2, and so on. For these values the dot product of the two vectors can be obtained. r = 0.1*1.2 + 0.2*1.5 + 0.4*1.1 = 0.12 + 0.3 + 0.44 = 0.86 > 0.5, which means the classifier will determine to give the credit to the client. In this case, floating-point multiplication and addition are used to perform classifier evaluation. When the parameter vector and description vector are scaled by multiplying them by 10, (10*p_1, 10*p_2, 10*p_3) = (1, 2, 4) and (10*d_1, 10*d_2, 10*d_3) = (12, 15, 11) are obtained. Classifier evaluation is performed: 1*12 + 2*15 + 4*11 = 86. Then the result is divided by 10*10 to recover the original scale and obtain the same result 0.86. In this case, the main evaluation is performed using integers (1, 12, 2, 15, 4, and 11) .
In practice, a neural network usually contains thousands or millions of parameters represented by the floating-point type, which means there are a large number of floating-point operations. If floating-point arithmetic can be switched to  integer arithmetic, there will be significant acceleration. But in general, it is not so straightforward to the convert floating-point parameters vector and description vector (or inputs) to the integer parameters vector and description vector by scaling as in the example above. The range and precision of numbers represented in the hardware are limited by the number of bits. If there is large difference between values in the vectors, some values may exceed the range that can be represented by the integer type after scaling. For example, 8-bit integers can only represent 28 = 256 integer numbers, such as integer numbers from -128 to 127. If there is a vector (0.1, 0.0001) , the scale parameter needs to be equal to 10000 to obtain integer vector 10000* (0.1, 0.0001) = (1000, 1) . But 1000 is outside of the range of 8-bit integers. So in general the parameters and the scale parameters need to be set up in a way to obtain integers.
The procedure of searching scale parameters and tuning parameters of the model to obtain integers is called quantization. Quantization is a powerful method for acceleration and compression of deep neural network (DNN) models. In typical quantized DNN, some part of weights or/and activations are represented as low-precision integer numbers instead of 32-bit floating-point value.
Quantization enables deep neural networks to be deployed on more types of devices, such as edge devices or embedded devices.
On one hand, reducing the bit width of NN weights and/or activations is beneficial for memory reduction. Quantized NN requires less storage space, which means it can be more easily distributed over resource-limited devices.
On the other hand, arithmetic with lower bit-width numbers is faster for hardware and consumes less energy. Floating-point calculations are harder and may not always be supported on microcontrollers on ultra low-power embedded devices.
A lower bit width can also be used if hardware supports low-bit calculations. But the typical problem is that the quality of the neural network can degrade, so it needs to be either fine-tuned or retrained from scratch.
Quantization-aware training (QAT) is to train the network in the process of quantification, so that the network parameters can better adapt to the loss caused by quantification. Specifically, QAT is to insert a fake quant module into the network to simulate the rounding and clamping operations of the quantized model in the inference process, so as to improve the adaptability of the network to the effect of quantification during the training process and obtain higher precision of the quantized network. In the process of QAT, all calculations are implemented by floating-point calculation. And the quantized neural network can be obtained after QAT.
For example, for a floating-point weight or activation in the neural network, the resultafter processing by the fake quant module can meet the following formula:
w is a floating-point weight or activation. s is the quantitative scale. qmax is the maximum value of the quantization range, an qmin is the minimum value of the quantization range. represents the function of rounding. clamp () represents the function of clamping, where clampcan meet the following formula: 
Straight through estimator (STE) can be used to solve the problem that a neural network containing rounding functions is difficult to train. Specifically, STE is used for rounding operator in back propagation (BP) : It can be understood as follows. STE directly skips the fake quant process and avoids the round. The gradient of a network layer is directly transmitted back to the weight before the fake quant module.
FIG. 1 depicts a schematic diagram of a system architecture 100. As shown in FIG. 1, a data acquisition device 160 is configured to collect training data. For example, if the training data is image data, the training data can include the training image and processing results corresponding to a training image such as a classification result corresponding to the training image, which can be a result of manual pre-annotation.
After collecting the training data, a database 130 obtains the training data from the data acquisition device 160, and a training device 120 trains a target model/rule 101 based on the training data maintained in the database 130.
The target model/rule 101 can be a neural network model provided by the embodiment of the present application. It should be noted that in practical applications, the training data maintained in the database 130 may not be collected by the data acquisition device 160, but may also be received from other devices. In addition, it should be noted that the training device 120 does not necessarily train the target model/rule 101 completely based on the training data maintained in the database 130. It is also possible to obtain training data from the cloud or other places for model training. The above description should not be used as a limitation to the embodiment of the present application.
The target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as an execution device 110 shown in FIG. 1. The execution device 110 can be a terminal, such as a mobile terminal, a tablet computer, a laptop computer, augmented reality (AR) , virtual reality (VR) , an on-board terminal, or a  server or cloud terminal. In FIG. 1, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with external devices. The data can be input to the I/O interface 112 through a client device 140. The input data in the embodiment of the present application can include data to be processed input by the client device 140.
The execution device 110 can call data, code, or the like, in a data storage system 150 for corresponding processing by a calculation module 111, and can also store the corresponding processed data, instructions, or the like, in the data storage system 150.
Finally, the I/O interface 112 returns the processing result, such as the processing result of the data obtained above, to the client device 140, so as to provide it to a user.
It is worth noting that the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be configured to achieve the above goals or tasks.
It is worth noting that FIG. 1 is only a schematic diagram of a system architecture provided by the embodiment of the application. A positional relationship between devices and modules shown in FIG. 1 does not constitute any limitation. For example, in FIG. 1, the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed in the execution device 110.
As shown in FIG. 2, the embodiment of the present application provides a system 200. The system 200 includes a local device 201, a local device 202, an execution device 210 and a data storage system 250, where the local device 201 and the local device 202 are connected with the execution device 210 through a communication network.
The execution device 210 may be implemented by one or more servers. Alternatively, the execution device 210 may be configured in conjunction with other computing devices, such as data storage devices, routers, load balancers, or the like. The execution device 210 may be provided on a physical site or distributed on a plurality of physical sites. The execution device 210 may use data in the data storage system 250 or call a program code in the data storage system 250 to implement the method provided by the embodiment of the present application, the calculation of the neural network provided by the embodiment of the present application, or the training of the neural network provided by the embodiment of the present application.
Users can operate their respective user devices (such as a local device 201 and a local device 202) to interact with an execution device 210. Each local device can represent any computing device, such as a personal computer, a computer workstation, a smart phone, a tablet computer, a smart camera, a smart car or other types of cellular phone, a media consumption device, a wearable device, a set-top box or a game console.
The local device can interact with the execution device 210 through a communication network of any  communication mechanism/communication standard. The communication network can be a wide area network, a local area network, a point-to-point connection, or any combination of them.
In some implementations, the local device 201 and the local device 202 acquire relevant parameters of the neural network from the execution device 210, deploy the neural network on the local device 201 and the local device 202, and use the neural network for data processing like image processing, speech processing or text processing.
In some implementations, the neural network can be directly deployed on the execution device 210. The execution device 210 obtains the data to be processed from the local device 201 and the local device 202, and processes the data using the neural network.
The execution device 210 can be a cloud device. In this case, the execution device 210 can be deployed on the cloud. Alternatively, the execution device 210 can be a terminal device. In this case, the execution device 210 can be deployed on the user terminal.
Accumulators are widely used in various computing devices. An accumulator with a lower bit width requires less computation, which means it is faster and power-efficient. And an accumulator with a lower bit width occupies a smaller area on a computing device, such as a chip. However, saturation or overflow will take place when a result exceed the bit width range of the accumulator, which will distort the result. The lower the accumulator bit width, the higher the probability of saturation or overflow.
A neural network is taken as an example. There are some operations including dot products in the neural network. The most commonly used neural network layers are linear layers and convolutional layers containing matrix multiplication operations and convolution operations both of which consist of a set of dot products between two vectors. The dot product is done incrementally by multiplying two numbers at a time and adding the result to the previously accumulated sum which is called an accumulator.
FIG. 3 depicts a schematic diagram of an example of a neural network layer. As shown in FIG. 3, inputs and weights are quantized to 8-bit integers. The multiplication result of two 8-bit integer numbers requires 16 bits. And the sum of these intermediate multiplication may require an expanded bit width depending on the length of the input vectors and the actual values. Commonly, as shown in FIG. 3, a 32-bit integer accumulator is used with 8-bit integer inputs, because its capacity is sufficient in practical scenarios. The accumulator is implemented as a 32-bit integer which means all the additions in the dot product are completed with 32-bit integer arithmetic. In other words, the results of intermediate multiplication in the dot product are accumulated in the 32-bit integer. For example, as shown in FIG. 3, the inputs and weights are vector (I1, I2, I3, I4) and vector (W1, W2, W3, W4) , respectively. I1, I2, I3, I4, W1, W2, W3 and W4 are represented as 8-bit integers. The values of the corresponding position in the two vectors are multiplied to obtain C1, C2, C3 and C4. C1, C2, C3 and C4  are represented as 16-bit integers. C1, C2, C3 and C4 are added to obtain A. A is represented as 32-bit integers in FIG. 3. If the input of the next layer needs to be quantized into 8-bit integer, the result of the dot production can be quantized into 8-bit integer outputs as shown in FIG. 3.
When the result exceeds the bit width range of the accumulator, saturation or overflow occurs. The distorted result will affect the performance of the neural network. If the rate of saturation or overflow is high enough, the neural network performance will degrade significantly. In some related technical solutions, the bit width of inputs is further reduced. A smaller bit width of inputs reduces the bit width of the intermediate multiplication and, therefore, the final accumulator. However, the bit width of the inputs affects the precision of the inputs. The additional loss of precision will lead to the quality drop.
The embodiment of the present application provides a data processing method to avoid computational errors introduced by saturation and/or overflow by inserting the bit shift between multiplication and accumulation.
FIG. 4 depicts a flow chart diagram of an example method 400 for data processing according to an embodiment of the present application. The method 400 shown in FIG. 4 can be executed by one or more computing devices. In a possible implementation, computing devices can be cloud service devices or terminal devices, such as computers or servers, or a system composed of cloud service devices and terminal devices. For example, the method 400 may be performed by the execution device 110 in FIG. 1, or the execution device 210 or the local device in FIG. 2. In a possible implementation, computing devices can be chips or circuits configured in cloud service devices or terminal devices.
It should be understood that the step number in the embodiment of the present application is only for the convenience of description, and does not limit the execution order of the steps.
The method 400 shown in FIG. 4 includes at least step 401 and step 402.
401, performing one or more multiplication operations and one or more accumulation operations on multiple operands;
402, performing one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations to obtain a result shifted N bit (s) to the right relative to a result of a first arithmetic operation on the multiple operands, where the first arithmetic operation includes the one or more multiplication operations and the one or more accumulation operations, and N is a positive integer.
The multiplication can be performed on two operands. The objects of the accumulation are based on the results of the multiplication.
The method 400 can be understood as follows: insert one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations originally included in the first arithmetic  operation. “Multiply –shift–accumulate” is used as a building block of the first arithmetic operation instead of the standard “multiply -accumulate (MAC) ” operation. The result of the first arithmetic operation with one or more right bit shift operations inserted inside shifted N bit (s) to the right compared with the result of the first arithmetic operation without the one or more right bit shift operations.
In the embodiment of the present application, the one or more right shift operations inserted into the first operation can also be referred to as the one or more right shift operations inside the first operation.
Specifically, one or more right bit shift operations between the initial multiplication operation and final accumulation operation originally included in the first arithmetic operation is inserted. In this case, the final accumulated result will be shifted N bit (s) to the right relative to the first arithmetic operation result. The final accumulated result is the result obtained in step 402.
The first arithmetic operation refers to an arithmetic operation that can be implemented by multiplication and accumulation.
Optionally, the first arithmetic operation satisfied: a*b±c, where a, b and c are the multiple operands.
Optionally, the first arithmetic operation is any of the following types of arithmetic operations: a dot product operation, a matrix multiplication operation, or a convolution operation.
Matrix multiplication and convolution can be regarded as including a set of dot products. In order to facilitate description, the embodiments of the present application take the first arithmetic operation as the dot product operation as an example to explain, without limiting the solutions in the embodiments of the present application.
For example, the multiple operands of the dot product can be two vectors. The dot product is done incrementally by multiplying two numbers from the two vectors at a time and adding the result to the previously accumulated sum stored in an accumulator to update the accumulated sum in the accumulator. Multiplication on the corresponding numbers in the two vectors is performed. The objects of the accumulation are based on the results of multiplication.
When saturation and/or overflow occurs, the most significant left-most bits are discarded whereas the least significant right-most bits are kept. According to the embodiment of the present application, one or more right bit shift operations are performed between multiplication and accumulation. This helps to keep the most significant left-most bits of the first arithmetic result, which is conducive to avoiding saturation and/or overflow when accumulation is performed.
Meanwhile, the embodiment of the present application does not require changing the effective bit width of the first arithmetic operation inputs, which avoids the loss of precision caused by the reduction of the bit width of the inputs, and is conducive to ensuring the calculation quality.
In the method 400, the first arithmetic operation is an integer arithmetic operation.
In this case, the numbers in the multiple operands are integers.
Optionally, the one or more accumulation operations can be implemented with M-bit arithmetic. N can be used to make the result within a range represented by M bits. M is a positive integer.
In other words, M is the bit width of one or more accumulators used to perform the one or more accumulation operations.
According to the embodiment of the present application, N can be used to make the result within a range represented by M bits, or N can be used to make the result fit into the accumulator bit-width. It is conducive to avoiding saturation and/or overflow when accumulation is achieved with M-bit arithmetic.
In this way, the solutions in the embodiment of the present application are conducive to avoiding saturation and/or overflow when accumulation is achieved with a lower bit width. In other words, the right bit shift inserted into the first arithmetic operation allows the accumulation to be achieved with lower bit-width accumulators, thereby reducing the requirement for the capacity of the accumulators. The reduction of the capacity of the accumulator can lead to faster computing speed, lower power consumption, and a smaller chip area.
Optionally, the multiple operands of the first arithmetic operation can be K-bit integers, and the one or more accumulation operations can be implemented with M-bit integer arithmetic. M<4*K. K is a positive integer.
Optionally, M=2*K.
Optionally, K=8, M=16.
Optionally, K=4, M=8.
The one or more right bit shift operations in step 402 are described as follows.
In some implementations, step 402 can include: performing an N-bit right shift operation after every multiplication.
The bit shift mode described above can be called a single bit shift mode. A single N-bit right shift operation is performed on the result of every multiplication. In this way, the result of the final accumulation can be shifted N bit (s) to the right relative to the first arithmetic operation result. Taking a dot product operation as an example, the result of the final accumulation can be a value shifted by N bit (s) to the right relative to the dot product of the two vectors.
FIG. 5 depicts a schematic diagram of an example of a dot product with an N-bit shift.
For example, as shown in FIG. 5, multiple operands of a dot product operation are vector#1 (I1, I2, I3, I4) and vector#2 (W1, W2, W3, W4) . Multiplication is performed on the values of the corresponding position in the two vectors to obtain C1, C2, C3, and C4. The results of the multiplication (C1, C2, C3, and C4) are shifted N bit (s) to the right  respectively to obtain S1, S2, S3, and S4. S1, S2, S3, and S4 are added to obtain A’. Compared to the value of A in FIG. 3, A’ is a value shifted by N bit (s) to the right.
As shown in FIG. 5, I1, I2, I3, I4, W1, W2, W3 and W4 can be represented as 8-bit integers. C1, C2, C3, and C4 are represented as 16-bit integers. S1, S2, S3, and S4 are represented as 16-bit integers. A’ is represented as a 16-bit integer. In this case, M can be 16.
It should be understood that FIG. 5 is only an example and does not limit the solutions in the embodiments of the present application.
In some implementations, the one or more accumulation operations can be implemented as cascade summations involving multiple summation iterations. In this case, step 402 can include: performing right bit shift operations before at least two summation iterations of the multiple summation iterations.
The bit shift mode described above can be called an incremental bit shift mode. In the process of cascade summations, right bit shift operations are done incrementally before at least two summation iterations. The total number of bits of right bit shift before the at least two summation iterations is N, in which way, the result of the final accumulation can be shifted N bit (s) to the right relative to the first arithmetic operation result. Taking a dot product operation as an example, the result of the final accumulation can be a value shifted by N bit (s) to the right relative to the dot product of the two vectors.
The number of bits shifted before the at least two summation iterations can be the same or different.
Optionally, step 402 can include: performing right bit shift operations before every summation iteration.
The number of bits shifted before different summation iterations can be the same or different.
As an example, step 402 can include: performing N1-bit right shift operations before every summation iteration. N1 is a positive integer.
For example, N1 can be 1. In this case, N can be equal to the number of the summation iterations in the cascade summations.
FIG. 6 depicts a schematic diagram of an example of a dot product with an incremental 2-bit shift.
For example, as shown in FIG. 6, multiple operands of a dot product operation are vector (I1, I2, I3, I4) and vector (W1, W2, W3, W4) . Multiplication is performed on the values of the corresponding position in the two vectors to obtain C1, C2, C3, and C4. In FIG. 6, there are two summation iterations. Before the first summation iteration, the results of the multiplication (C1, C2, C3, and C4) are shifted 1 bit to the right respectively to obtain S’1, S’2, S’3, and S’4. In the first summation iteration, S’1 and S’2 are added to obtain A” 1, and S’3 and S’4 are added to obtain A” 2. Before the second summation iteration, the results of the first summation iteration, A” 1 and A” 2, are shifted 1 bit to the right respectively to obtain S” 1 and S” 2. In the second summation iteration, S” 1 and S” 2 are added to obtain A”’ . Compared to the value of A in  FIG. 3, A”’ is a value shifted by 2 bits to the right.
As shown in FIG. 6, I1, I2, I3, I4, W1, W2, W3 and W4 can be represented as 8-bit integers. C1, C2, C3, C4, S’1, S’2, S’3, S’4, A” 1, A” 2, S” 1 and S” 2 are represented as 16-bit integers. A” ’ is represented as a 16-bit integer. In this case, M can be 16.
It should be understood that FIG. 6 is only an example and does not limit the solutions in the embodiments of the present application.
The incremental bit shift mode enables the right-most bit to be truncated gradually during the accumulation. Compared with truncating the right-most N bit (s) after multiplication at one time, the precision of the results obtained by this mode is higher. In other words, the incremental bit shift mode can significantly reduce errors compared to the single N-bit shift mode.
The embodiment of the present application is applicable to any arithmetic operations containing multiplication and accumulation, such as a dot product, which could benefit from the reduced capacity accumulator. Moreover, it can be applied to any bit-width setups.
Optionally, the method 400 can be applied to inference of a quantized neural network. And the quantized neural network includes the first arithmetic operation.
Input data and output data of a quantized neural network are related to a task of the quantized neural network.
If the quantized neural network is used for a computer vision task, the input of the quantized neural network can be image data. The computer vision task can include image classification, image detection, image segmentation, and image recognition or image generation. For example, the computer vision task can be image classification, in which case, the output of the quantized neural network can be used to indicate the classification to which the image belongs. For another example, computer vision task can be image classification, in which case, the output of the quantized neural network can be used to confirm the identity of the object in the image.
If the quantized neural network model is used for a text processing task, the input of the quantized neural network can be text data. The text processing task can include text recognition or text translation. For example, the text processing task can be text translation, in which case, the output of the quantized neural network can be the translation result of the input text.
For another example, if the quantized neural network model is used for a voice processing task, the input of the quantized neural network can be voice data. The voice processing task can include speech recognition, in which case, the output of the quantized neural network can be the recognition result of the input voice.
A type of quantized neural network in the embodiment of the present application is not limited.
FIG. 7 depicts a schematic diagram of the representative distribution of the maximum bit of all accumulators of a single neural network layer in a quantized neural network. FIG. 8 depicts a schematic diagram of the representative distribution of the maximum bit of all accumulators of a single neural network layer based on the solutions in the embodiment of the present application. The above example distribution is obtained from a quantized neural network with 8-bit integer weights and activations and 32-bit accumulators. As shown in FIG. 7, the values in the accumulators may need up to 18 bits to represent. With 16-bit accumulators, some of the accumulators would experience overflow. The 2-bit right shift can be inserted into the dot products of the layer according to the solution in the embodiment of the application. In this case, as shown in FIG. 8, the distribution of the maximum bit will stay very similar but shifted by 2 bits to the right, which can fit into a 16-bit integer type.
The output of a network layer in the quantized neural network is typically re-quantized to become an input to the following network layer. The numbers in the accumulator with a large bit width will be scaled down to fit into the type of a smaller bit width. For example, as shown in FIG. 3, the output of the network layer is represented as a 32-bit integer, which does not match the bit width of the input of the following network layer. The 32-bit integer needs to be re-quantized to an 8-bit integer. In the case of scaling integers by the multiples of 2, it is equivalent to a right bit shift operation. When the method 400 is applied to a network layer of a quantized neural network, the output of the network layer will be right-shifted. The output of the original quantized neural network can be approximately obtained by adjusting the re-quantization scales and offsets accordingly. For every bit shifted the re-quantization parameters need to be scaled by 2.
According to the embodiment of the present application, models implemented with lower bit-width accumulators can be obtained quickly by inserting one or more right bit shift operations into the first arithmetic operations of the original quantized neural network. The embodiment of the present application can work without additional training, which leads to an efficient and fast conversion. A small calibration dataset can be used to estimate the optimal bit shift, which is the best value of N.
Meanwhile, the models implemented with lower bit-width accumulators have similar performance to the original quantized neural network. If the original quantized neural network is robust to small perturbation, the models converted by the method in the embodiment of the present application will keep their performance similar to the original quantized neural network using one or more accumulators with a full size, such as 32 bits. Others have a small quality degradation after being converted by the method in the embodiment of the present application. Models sensitive to perturbations introduced by the bit shift could be further trained to adapt to these perturbations. The details are described in method 900.
Optionally, N corresponding to all first arithmetic operations in the quantized neural network can be the same.
N corresponding to a first arithmetic operation refers to the number of bits shifted caused by the one or more right bit shift operations inserted into the first arithmetic operation. The number of bits shifted caused by the one or more right bit shift operations inserted into the first arithmetic operation can also be called the number of bits shifted inside the first arithmetic operation.
There can be multiple first arithmetic operations in a quantized neural network. The number of bits shifted inside all first arithmetic operations in the quantized neural network can be the same. In other words, a single N can be used for all first arithmetic operations in a quantized neural network.
The first arithmetic operations in a quantized neural network can be of the same type or of different types. For example, there can be matrix multiplication and convolution in a quantized neural network.
Optionally, N corresponding to all first arithmetic operations in a first layer of the quantized neural network can be the same.
There can be multiple first arithmetic operations in one network layer of the quantized neural network. The first layer can be any network layer in the quantized neural network, as long as the network layer includes multiple first arithmetic operations. “The first” of the first network layer is only used to limit that the network layer has a plurality of first arithmetic operations, and has no other limiting effect.
A single N can be used for the whole layer. In other words, a single N corresponds to the whole layer.
In addition, for multiple layers including the first arithmetic operations, the N corresponding to different network layers can be independent. The N corresponding to the multiple layers can be different.
Optionally, N corresponding to at least two first arithmetic operations in a second layer of the quantized neural network are different.
The second layer can be any network layer in the neural network, as long as the network layer includes multiple first arithmetic operations. “The second” of the second network layer is only used to limit that the network layer has a plurality of first arithmetic operations, and has no other limiting effect.
In this way, N corresponding to at least two first arithmetic operations can be determined separately, which is conducive to finding more appropriate N, so as to avoid overflow and ensure performance.
A convolution layer is taken as an example. The output channels of a convolution layer can be multiple. In other words, the output feature maps of a convolution layer can be multiple. The convolution operations between each filter in the convolution layer and the input data of the convolution layer are independent. Accordingly, the accumulators for each output channel of a convolutional layer can be independent of each other. Individual N can be used for convolution operations corresponding to different filters in a convolution layer. In other words, individual N can be used for each output  channel. N corresponding to different filters in a convolution layer can be the same or different.
For example, there are two filters (filter#1 and filter#2) in a convolution layer. An N2-bit right shift is performed inside all dot product operations included in the convolution operation of the filter#1 and input data of the convolution layer. An N3-bit right shift is performed inside all dot product operations included in the convolution operation of the filter#2 and input data of the convolution layer. N2 and N3 can be the same or different.
It should be noted that the above only takes the convolution layer as an example, and other network layers with multiple independent outputs are also applicable to the above solution. That is, for different outputs of a network layer, the number of bits shifted to the right can be the same or different. In other words, individual N can be used for each output channel or each output feature of a network layer. For example, each output feature of an FC layer is independent of each other. Accordingly, the accumulators for each output feature of a FC layer can be independent of each other. Individual N can be used for each output feature. For each output feature of an FC layer, N can be the same or different.
The following is an exemplary description of determining the number of bits shifted inside a first arithmetic operation, that is, the value of N.
In some implementations, N can be determined according to the number of bits required to represent the results of J first arithmetic operation corresponding to the same N. J is a positive integer.
J can be an integer greater than 1. As mentioned above, the number of bits shifted to the right inside some first arithmetic operations of a quantized neural network may be the same. The N corresponding to these first arithmetic operations can be determined according to the number of bits required to represent the results of these first arithmetic operations.
Optionally, N is determined according to the maximum number of bits used to represent the results of the J first arithmetic operations based on a first dataset.
Specifically, N is determined according to the maximum number of bits used to represent the results of the J first arithmetic operations based on a first dataset and the bit width of one or more accumulators used for the J first arithmetic operations.
The J first arithmetic operations correspond to the same value of N, which means the number of bits shifted to the right inside these first arithmetic operations is the same.
The first dataset can be called a calibration dataset.
The size of the first dataset can be selected as required.
N can be obtained with a small calibration dataset, which can lead to an efficient and fast conversion from the original quantized neural networks to the quantized neural networks including first arithmetic operations with bit shift  operations.
The J first arithmetic operations can belong to a quantized neural network. In this case, the first dataset can be understood as a dataset of the quantized neural network. For instance, the data in the first dataset may belong to a training dataset or a test dataset of the quantized neural network. Or the first dataset also can be obtained in other ways. The embodiment of the present application does not limit this.
Specifically, statistics of the accumulators used to perform the J first arithmetic operations can be collected to determine the value of N.
For example, the maximum result of the J first arithmetic operations based on a first dataset can be expressed as follows:
Amax is the maximum result of the J first arithmetic operations based on the first dataset. Ai can be the maximum result of the first arithmetic operation i based on the first dataset, i=1, 2, …J.
For example, N can be expressed as follows:
N=max (0, [log2 (Amax+1) - (Acc-1) ] ) ;
Accis the minimum bit width of one or more accumulators used for the J first arithmetic operations. For example, the bit width of one or more accumulators can be the same, in which case, Accis the bit width of the one or more accumulators used for the J first arithmetic operations. Acc can be M.
It should be understood that the above formula is only an example, and N can also be determined by other forms of formulas, which is not limited in the embodiments of the present application.
As an example, a single N can be used for the whole quantized neural network. In this case, J can be the number of the first arithmetic operations in the quantized neural network.
As another example, a single N can be used for the whole network layer of a quantized neural network. For different network layers including the first arithmetic operations, the N corresponding to the different layers can be independent. In this case, J can be the number of the first arithmetic operations in one network layer. For different network layers, the J can be the same or different. As shown in FIG. 7, the maximum bit of all accumulators of a single network layer in a quantized neural network can be 18, which means log2 (Amax+1) =18.
The above way can be called the per-tensor approach.
As another example, individual N can be used for each output channel or each output feature. In this case, J  can be the number of the first arithmetic operations for each output channel or each output feature.
The above way can be called the per-channel approach, which can be naturally combined with per-channel quantization.
It should be understood that the above is only an example, and N can also be obtained in other ways.
The embodiment of the present application is implemented at the cost of truncating the rightmost bit at earlier stages. Some models may be sensitive to such perturbations introduced by the bit shift.
FIG. 9 depicts a flow chart diagram of an example method 900 for training a neural network according to an embodiment of the present application. The method 900 shown in FIG. 9 can be executed by one or more computing devices. In a possible implementation, computing devices can be cloud service devices or terminal devices, such as computers or servers, or a system composed of cloud service devices and terminal devices. For example, the method 900 may be performed by the training device 120 in FIG. 1, or the execution device 210 or the local device in FIG. 2.
The method 900 can include: training a first neural network. In forward propagation during training, performing the following steps on a result of a first arithmetic operation in the first neural network:
901, dividing the result of the first arithmetic operation by a scale S, where S>0;
902, rounding the division result; and
903, multiplying the rounding result by the scale S, where the multiplication result is used to replace the result of the first arithmetic operation in the forward propagation.
The first neural network can be any type of neural network.
During training, the first neural network uses floating point arithmetic, in which case, the first arithmetic operation can be a floating point arithmetic. Step 901 to step 903 can be understood as a simulation of the bit shift for floating point numbers in the neural network. In other words, step 901 to step 903 are inserted in the forward propagation.
The method 900 can be called bit shift-aware training.
For example, the multiplication resultcan meet the following formula:
A is the result of the first arithmetic operation.
As an example, scale S can be chosen in such a way that the significand of the maximum absolute value of the accumulator is mapped to 2M-1 after the first re-scaling. For example, S can be 2N.
The first arithmetic operation can be implemented by multiplication and accumulation.
Optionally, the first arithmetic operation is any of the following types of arithmetic operations: a dot product  operation, a matrix multiplication operation, or a convolution operation.
Optionally, the scale S is used to make the rounding result within a range represented by M bits. M is a positive integer.
Further, M can be a bit width of an accumulator used for the first arithmetic operation in a quantized neural network corresponding to the first neural network.
Optionally, multiple operands of the first arithmetic operation in the quantized neural network are K-bit integers, where M<4*K, K is a positive integer, and M is a positive integer.
Optionally, M=2*K.
Optionally, K=8, M=16.
Optionally, K=4, M=8.
Right bit shift operations inside a first operation may degrade the performance of models sensitive to perturbations introduced by the bit shift. According to the solution in the embodiment of the present application, precision reduction caused by right bit shift operations is simulated in the training process by the above method. In this way, the models sensitive to perturbations introduced by the bit shift could be trained by the simulation of bit shift for floating point numbers in the neural network to adapt to these perturbations, which is conducive to keeping the inference performance of the quantized neural network using one or more smaller sized accumulators similar to the original quantized neural network using one or more greater sized accumulators.
Optionally, the training a neural network can include: performing QAT on the neural network.
In other words, QAT with a simulation of bit shift for floating point numbers can be used to obtain the quantized neural network.
The device embodiments provided in the embodiments of the present application will be described below with reference to FIG. 10 and FIG. 11. The device provided in the embodiments of the present application can be used to perform the methods described above, such as method 400 or method 900. The specific description can refer to method 400 or method 900. In order to avoid unnecessary repetition, it is appropriate to omit part of the description when describing the device embodiments.
FIG. 10 is a schematic block diagram of an electronic device 1800 according to an embodiment of the present application. As shown in FIG. 10, the electronic device 1800 includes: a calculation module 1801 and a shift module 1802.
The calculation module 1801 is configured to one or more multiplication operations and one or more accumulation operations on multiple operands.
The shift module 1802 is configured to perform one or more right bit shift operations between the one or more  multiplication operations and the one or more accumulation operations to obtain a result shifted N bit (s) to the right relative to a result of a first arithmetic operation on the multiple operands, where the first arithmetic operation includes the one or more multiplication operations and the one or more accumulation operations, and N is a positive integer.
Optionally, the first arithmetic operation is any of the following types of arithmetic operations: a dot product operation, a matrix multiplication operation, or a convolution operation.
Optionally, the shift module 1802 is further configured to perform an N-bit right shift operation after every multiplication.
Optionally, the one or more accumulation operations are implemented as cascade summations involving multiple summation iterations, and the shift module 1802 is further configured to perform right bit shift operations before at least two summation iterations of the multiple summation iterations.
Optionally, the electronic device is applied to inference of a quantized neural network, and the quantized neural network includes the first arithmetic operation.
Optionally, the multiple operands are K-bit integers, the one or more accumulation operations are implemented with M-bit integer arithmetic, M<4*K, K is a positive integer, and M is a positive integer.
For example, the calculation module 1801 can include one or more accumulators configured to perform the accumulation, where M is the bit width of the one or more accumulators.
Optionally, M=2*K.
Optionally, K=8, M=16.
Optionally, K=4, M=8.
Optionally, the N corresponding to all first arithmetic operations in a first layer of the quantized neural network are the same.
Optionally, the N corresponding to at least two first arithmetic operations in a second layer of the quantized neural network are different.
The term "module" herein may be implemented in software and/or hardware without specific limitation. For example, a "module" may be a software program, a hardware circuit, or a combination of the above functions.
It should be noted that in other embodiments, the calculation module 1801 can be configured to execute any step in the data processing method and the shift module 1802 can be configured to execute any step in the data processing method. The steps to be implemented by the calculation module 1801 and shift module 1802 can be specified as required to implement all functions of the electronic device 1800.
FIG. 11 is a schematic block diagram of an electronic device 1900 according to an embodiment of the present  application.
As shown in FIG. 11, the electronic device 1900 may include a transceiver 1901, a processor 1902, and a memory 1903. The memory 1903 may be configured to store codes, instructions, or the like, executed by the processor 1902.
It should be understood that the processor 1902 may be an integrated circuit chip and has a signal processing capability. In an implementation process, steps of the foregoing method embodiments may be completed by using a hardware integrated logic circuit in a processor, or by using instructions in a form of software. The processor may be a general purpose processor, a digital signal processor (digital signal processor, DSP) , an application specific integrated circuit (application specific integrated circuit, ASIC) , a field programmable gate array (field programmable gate array, FPGA) , or another programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. The processor may implement or perform the methods, the steps, and the logical block diagrams that are disclosed in the embodiments of the present application. The general purpose processor may be a microprocessor, or the processor may be any conventional processor or the like. The steps of the methods disclosed with reference to the embodiments of the present application may be directly performed and completed by a hardware decoding processor, or may be performed and completed by using a combination of hardware in the decoding processor and a software module. The software module may be located in a mature storage medium in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, an electrically erasable programmable memory, or a register. The storage medium is located in the memory, and the processor reads information in the memory and completes the steps of the foregoing methods in combination with hardware in the processor.
It may be understood that the memory 1903 in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both a volatile memory and a nonvolatile memory. The nonvolatile memory may be a read-only memory (Read-Only Memory, ROM) , a programmable read-only memory (Programmable ROM, PROM) , an erasable programmable read-only memory (Erasable PROM, EPROM) , an electrically erasable programmable read-only memory (Electrically EPROM, EEPROM) , or a flash memory. The volatile memory may be a random access memory (Random Access Memory, RAM) and is used as an external cache. By way of example rather than limitation, many forms of RAMs may be used, and are, for example, a static random access memory (Static RAM, SRAM) , a dynamic random access memory (Dynamic RAM, DRAM) , a synchronous dynamic random access memory (Synchronous DRAM, SDRAM) , a double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDR SDRAM) , an enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM) , a synchronous link dynamic random access memory (Synchronous link DRAM, SLDRAM) , and a direct rambus random access memory (Direct Rambus RAM, DR RAM) .
It should be noted that the memory in the systems and the methods described in the specification includes but is not limited to these memories and a memory of any other appropriate type.
An embodiment of the present application further provides a system chip, where the system chip includes an input/output interface, at least one processor, at least one memory, and a bus. The at least one memory is configured to store instructions, and the at least one processor is configured to invoke the instructions of the at least one memory to perform operations in the methods in the foregoing embodiments.
An embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a program instruction for performing any of the foregoing methods.
Optionally, the storage medium may be specifically the memory.
An embodiment of the present application further provides a computer program product which, when run on an electronic device, causes the electronic device to perform any of the foregoing methods.
An embodiment of the present application further provides a computer-readable storage medium, storing one or more instructions for execution by a device, wherein the one or more instructions include: performing a multiplication operation on a first operand and a second operand; performing a right bit shift operation on a result of the multiplication operation; and performing an accumulation operation on a result of the right bit shift operation and a third operand.
Optionally, the foregoing steps can belong to one instruction.
In other words, there is an instruction that instructs to perform “multiply-bit shift-add” . For example, this instruction can be used to perform an operation such as a*b±c. The specific description about “multiply-shift-add” can refer to method 400.
The first operand, the second operand and the third operand are example of the multiple operands. The specific description can refer to method 400.
Optionally, the first operand and the second operand are K-bit integers, the accumulation operation is implemented with M-bit integer arithmetic, M<4*K, K is a positive integer, and M is a positive integer.
Optionally, M=2*K.
Optionally, K=8, M=16 or K=4, M=8.
A person of ordinary skill in the art may be aware that, in combination with the examples described in the embodiments disclosed in the specification, units and algorithm steps can be implemented by electronic hardware or a combination of computer software and the electronic hardware. Whether functions are performed by hardware or software depends on particular applications and design constraints of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that  the implementation goes beyond the scope of the present application.
It may be clearly understood by a person skilled in the art that, for a purpose of a convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, refer to a corresponding process in the foregoing method embodiment. Details are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely an example. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may be or may not be physically separate, and parts displayed as units may be or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected based on actual requirements to achieve objectives of the solutions of the embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
When the functions are implemented in a form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions in the present application essentially, or the part contributing to the prior art, or some of the technical solutions may be implemented in a form of a software product. The computer software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or some of the steps of the methods described in the embodiments of the present application. The foregoing storage medium includes: any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM) , a random access memory (Random Access Memory, RAM) , a magnetic disk, or an optical disc.
The foregoing descriptions are merely specific implementations of the present application, but are not intended to limit a protection scope of the present application. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed in the present application shall fall within the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the  claims.

Claims (27)

  1. A data processing method implemented by one or more computing devices, comprising:
    performing one or more multiplication operations and one or more accumulation operations on multiple operands; and
    performing one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations to obtain a result shifted N bit (s) to the right relative to a result of a first arithmetic operation on the multiple operands, wherein the first arithmetic operation comprises the one or more multiplication operations and the one or more accumulation operations, and N is a positive integer.
  2. The method according to claim 1, wherein the first arithmetic operation is any of the following types of arithmetic operations: a dot product operation, a matrix multiplication operation, or a convolution operation.
  3. The method according to claim 1 or 2, wherein the performing one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations, comprises:
    performing an N-bit right shift operation after every multiplication.
  4. The method according to claim 1 or 2, wherein the one or more accumulation operations are implemented as cascade summations involving multiple summation iterations, and the performing one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations, comprises:
    performing right bit shift operations before at least two summation iterations of the multiple summation iterations.
  5. The method according to any one of claims 1 to 4, wherein the multiple operands are K-bit integers, one or more accumulation operations are implemented with M-bit integer arithmetic, M<4*K, K is a positive integer, and M is a positive integer.
  6. The method according to claim 5, wherein M=2*K.
  7. The method according to claim 6, wherein K=8, M=16 or K=4, M=8.
  8. The method according to any one of claims 1 to 7, wherein the data processing method is applied to inference of a quantized neural network, and the quantized neural network comprises the first arithmetic operation.
  9. The method according to claim 8, wherein the N corresponding to all first arithmetic operations in a first layer of the quantized neural network are the same.
  10. The method according to claim 8, wherein the N corresponding to at least two first arithmetic operations in a second layer of the quantized neural network are different.
  11. An electronic device, comprising:
    a calculation module configured to perform one or more multiplication operations and one or more accumulation operations on multiple operands; and
    a shift module configured to perform one or more right bit shift operations between the one or more multiplication operations and the one or more accumulation operations to obtain a result shifted N bit (s) to the right relative to a result of a first arithmetic operation on the multiple operands, wherein the first arithmetic operation comprises the one or more multiplication operations and the one or more accumulation operations, and N is a positive integer.
  12. The electronic device according to claim 11, wherein the first arithmetic operation is any of the following types of arithmetic operations: a dot product operation, a matrix multiplication operation, or a convolution operation.
  13. The electronic device according to claim 11 or 12, wherein the shift module is further configured to:
    perform an N-bit right shift operation after every multiplication.
  14. The electronic device according to claim 11 or 12, wherein the one or more accumulation operations are implemented as cascade summations involving multiple summation iterations, and the shift module is further configured to:
    perform right bit shift operations before at least two summation iterations of the multiple summation iterations.
  15. The electronic device according to any one of claims 11 to 14, wherein the multiple operands are K-bit integers, the one or more accumulation operations are implemented with M-bit integer arithmetic, M<4*K, K is a positive integer, and M is a positive integer.
  16. The electronic device according to claim 15, wherein M=2*K.
  17. The electronic device according to claim 16, wherein K=8, M=16 or K=4, M=8.
  18. The electronic device according to any one of claims 11 to 17, wherein the electronic device is applied to inference of a quantized neural network, and the quantized neural network comprises the first arithmetic operation.
  19. The electronic device according to claim 18, wherein the N corresponding to all first arithmetic operations in a first layer of the quantized neural network are the same.
  20. The electronic device according to claim 18, wherein the N corresponding to at least two first arithmetic operations in a second layer of the quantized neural network are different.
  21. An electronic device, comprising a memory and a processor, wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program from the memory and run the computer program, so that a computer on which a chip is disposed performs the method according to any one of claims 1 to 10.
  22. A computer-readable storage medium having instructions which, when run on a computer, cause the computer to perform the method according to any one of claims 1 to 10.
  23. A computer program product which, when run on a computer, causes the computer to perform the method according to any one of claims 1 to 10.
  24. A computer-readable storage medium, storing one or more instructions for execution by a device, wherein the one or more instructions comprise:
    performing a multiplication operation on a first operand and a second operand;
    performing a right bit shift operation on a result of the multiplication operation; and
    performing an accumulation operation on a result of the right bit shift operation and a third operand.
  25. The computer-readable storage medium according to claim 24, wherein the first operand and the second operand are K-bit integers, the accumulation operation is implemented with M-bit integer arithmetic, M<4*K, K is a positive integer, and M is a positive integer.
  26. The computer-readable storage medium according to claim 25, wherein M=2*K.
  27. The computer-readable storage medium according to claim 26, wherein K=8, M=16 or K=4, M=8.
PCT/CN2023/103990 2022-12-13 2023-06-29 Data processing method and electronic device WO2024124866A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2022132545 2022-12-13
RU2022132545 2022-12-13

Publications (1)

Publication Number Publication Date
WO2024124866A1 true WO2024124866A1 (en) 2024-06-20

Family

ID=91484316

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/103990 WO2024124866A1 (en) 2022-12-13 2023-06-29 Data processing method and electronic device

Country Status (1)

Country Link
WO (1) WO2024124866A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050223053A1 (en) * 2004-04-06 2005-10-06 Tay-Jyi Lin Static floating point arithmetic unit for embedded digital signals processing and control method thereof
CN104166535A (en) * 2013-07-19 2014-11-26 郑州宇通客车股份有限公司 Fixed point processor and anti-overflow method thereof
CN107066423A (en) * 2016-11-07 2017-08-18 复旦大学 A kind of optimization method of limited input fixed-point number matrix multiplier
CN107580712A (en) * 2015-05-08 2018-01-12 高通股份有限公司 Pinpoint the computation complexity of the reduction of neutral net
US20180373977A1 (en) * 2015-12-21 2018-12-27 Commissariat a l'énergie atomique et aux énergies alternatives Optimized neuron circuit, and architecture and method for executing neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050223053A1 (en) * 2004-04-06 2005-10-06 Tay-Jyi Lin Static floating point arithmetic unit for embedded digital signals processing and control method thereof
CN104166535A (en) * 2013-07-19 2014-11-26 郑州宇通客车股份有限公司 Fixed point processor and anti-overflow method thereof
CN107580712A (en) * 2015-05-08 2018-01-12 高通股份有限公司 Pinpoint the computation complexity of the reduction of neutral net
US20180373977A1 (en) * 2015-12-21 2018-12-27 Commissariat a l'énergie atomique et aux énergies alternatives Optimized neuron circuit, and architecture and method for executing neural networks
CN107066423A (en) * 2016-11-07 2017-08-18 复旦大学 A kind of optimization method of limited input fixed-point number matrix multiplier

Similar Documents

Publication Publication Date Title
US11580719B2 (en) Dynamic quantization for deep neural network inference system and method
US10929746B2 (en) Low-power hardware acceleration method and system for convolution neural network computation
CN108345939B (en) Neural network based on fixed-point operation
US10096134B2 (en) Data compaction and memory bandwidth reduction for sparse neural networks
WO2021208151A1 (en) Model compression method, image processing method and device
Xu et al. A distributed canny edge detector: algorithm and FPGA implementation
WO2019238029A1 (en) Convolutional neural network system, and method for quantifying convolutional neural network
EP3627397A1 (en) Processing method and apparatus
CN112508125A (en) Efficient full-integer quantization method of image detection model
CN110610237A (en) Quantitative training method and device of model and storage medium
CN113326930B (en) Data processing method, neural network training method, related device and equipment
EP4379607A1 (en) Neural network accelerator, and data processing method for neural network accelerator
WO2021164269A1 (en) Attention mechanism-based disparity map acquisition method and apparatus
US20200218777A1 (en) Signal Processing Method and Apparatus
CN112633477A (en) Quantitative neural network acceleration method based on field programmable array
CN114978189A (en) Data coding method and related equipment
US20240127044A1 (en) Hardware implementation of an attention-based neural network
WO2024124866A1 (en) Data processing method and electronic device
CN113919479A (en) Method for extracting data features and related device
CN112418388A (en) Method and device for realizing deep convolutional neural network processing
CN115471718A (en) Construction and detection method of lightweight significance target detection model based on multi-scale learning
CN114065913A (en) Model quantization method and device and terminal equipment
CN115759192A (en) Neural network acceleration method, device, equipment, chip and storage medium
Solovyev et al. Real-Time Recognition of Handwritten Digits in FPGA Based on Neural Network with Fixed Point Calculations
CN114730331A (en) Data processing apparatus and data processing method