Disclosure of Invention
(I) technical problems to be solved by the invention
The technical problem solved by the invention is as follows: how to reduce the number of times data is read from memory.
(II) the technical scheme adopted by the invention
A data batching method for a neural network, the data batching method comprising:
acquiring a memory bandwidth and selecting original channel data of N continuous frame images according to the memory bandwidth;
splicing the original channel data of the N continuous frame images to form a plurality of recombined data strips, wherein each recombined data strip comprises the original channel data of the N continuous frame images at the same pixel position;
and sequentially inputting a plurality of recombined data strips into the parallel computing unit array for convolution operation, wherein all original channel data of the same recombined data strip enter the computing unit at the same time.
Preferably, each of the reassembled data strips further includes zero padding data, and a data bit width of each of the reassembled data strips is equal to the memory bandwidth.
Preferably, the data batch processing method further includes:
and storing the multiple recombined data strips into a memory.
Preferably, the method for sequentially inputting the plurality of recombined data strips into the parallel computing unit array to perform convolution operation includes:
performing multiply-add operation on the original channel data of each continuous frame image in each recombined data strip and the same weight data respectively;
and storing the result of the multiply-add operation of each continuous frame image in each recombined data strip into different registers.
Preferably, the memory bandwidth is 128 bits, N is 5, and the raw channel data at each pixel position of each of the consecutive frame images includes red channel data, green channel data, and blue channel data.
The present application also discloses a data batch processing apparatus for a neural network, the data batch processing apparatus comprising:
the data acquisition module is used for acquiring the memory bandwidth and selecting the original channel data of N continuous frame images according to the memory bandwidth;
the data recombination module is used for splicing the original channel data of the N continuous frame images to form a plurality of recombined data strips, wherein each recombined data strip comprises the original channel data of the N continuous frame images at the same pixel position;
and the convolution calculation module is used for reading the multiple recombined data strips and carrying out convolution operation on the multiple recombined data strips in sequence, wherein all original channel data of the same recombined data strip are read by the convolution calculation module at the same time.
Preferably, the data batch processing device further comprises a memory, and the memory is used for receiving and storing the multiple recombined data strips formed by the data recombination module.
Preferably, the convolution calculation module includes:
the multiplier-adder unit is used for respectively carrying out multiplication-addition operation on the original channel data of each continuous frame image in each recombined data strip and the same weight data;
and the storage unit is used for storing the result of the multiply-add operation of each continuous frame image in each recombined data strip.
The invention also discloses a computer readable storage medium, which stores the data batch processing program for the neural network, and the data batch processing program for the neural network realizes the data batch processing method for the neural network when being executed by a processor.
The invention also discloses a computer device, which comprises a computer readable storage medium, a processor and a data batch processing program for the neural network stored in the computer readable storage medium, wherein the data batch processing program for the neural network realizes the data batch processing method for the neural network when being executed by the processor.
(III) advantageous effects
The invention discloses a data batch processing method for a neural network, which has the following technical effects compared with the traditional calculation method:
(1) the optimized data structure facing the three-dimensional array can realize the rapid buffering of data and avoid the reading of repeated weights among different frame images, thereby greatly reducing the access times of an off-chip memory;
(2). The method is novel in idea, and starts from the characteristics of input data, the method can exert great effect on the situation that the input data such as background images, monitoring videos and other images of the first layer of convolutional layer are single and unchanged, and has great potential on the situation that the depth of a convolutional kernel in a deep neural network is great.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Before describing in detail the various embodiments of the present application, the inventive concepts of the present application are first briefly described: in the prior art, convolution calculation is carried out on each frame of picture in sequence, weight data and image data need to be read repeatedly, and calculation resource waste can be caused.
Example one
Specifically, as shown in fig. 1, the data batch processing method for the neural network according to the first embodiment includes the following steps:
step S10: acquiring a memory bandwidth and selecting original channel data of N continuous frame images according to the memory bandwidth;
step S20: splicing the original channel data of the N continuous frame images to form a plurality of recombined data strips, wherein each recombined data strip comprises the original channel data of the N continuous frame images at the same pixel position;
step S30: and sequentially inputting a plurality of recombined data strips into the parallel computing unit array for convolution operation, wherein all original channel data of the same recombined data strip enter the computing unit at the same time.
In step S10, taking the memory bandwidth equal to 128 bits as an example, in the prior art, when reading data from the memory, each time, the original channel data of one pixel point is read, which includes the red channel data R, the green channel data G, and the blue channel data B, each color channel data occupies 8 bits, and 24 bits in total, so that only 24 bits of data are read each time, which wastes the memory bandwidth. According to the characteristics of the convolution calculation process, the characteristics of the image data are combined, the original channel data of N continuous frame images are selected according to the size of the actually used memory bandwidth, splicing is carried out, more channel data can be read when the data are read from the memory each time, and the use efficiency of the memory bandwidth is improved.
Illustratively, the splicing process in step S20 is described in detail by taking the example that the memory bandwidth is equal to 128 bits and N is equal to 5. The original channel data of 5 continuous frame images at the same pixel position are spliced to form a repeated data strip, for example, the original channel data of 5 continuous frame images at a first pixel point are spliced to form a recombined data strip with a data bit width of 120 bits. As a preferred embodiment, zero padding is performed on the formed reassembled data strip, so that the data bit width of the reassembled data strip is equal to the memory bandwidth, for example, 8 0 s are padded at the end of the reassembled data strip with a data bit width of 120 bits, and a 128-bit reassembled data strip is formed.
As a preferred embodiment, a Block Memory (model 128-32) of Xilinx is adopted to read the original channel data of each image, but the Block Memory can only read four color channel data at a time, namely 32-bit data, and what is really needed is 24-bit data, so that the data after being read by the Block Memory needs to be further recombined. Illustratively, the original channel data of 5 consecutive frame images are temporarily stored in 5 buffers, respectively, and as shown in fig. 3, the color channel data, i.e., R, are sequentially arranged in the order of pixel position0G0B0R1G1B1R2G2B2R3G3B3... when reading with a block memory, the R's are read sequentially in the first cache0G0B0R1、G1B1R2G2、B2 R3G3B3Setting a first register to read R for the first time1Storing in the first register for next splicing to obtain R of 5 images0G0B0Splicing and zero filling are carried out to form a recombined data strip of a first pixel point, namely R0G0B0R0G0B0R0G0B0R0G0B0R0G0B0And 0, setting a second register to store the recombined data strip, thereby completing the recombination of the original channel data of the first pixel point. Similarly, when the original channel data of the second pixel point is recombined, G read for the second time is obtained1B1And R in the first register1Recombining to form a recombined data strip of a second pixel, namely R1G1B1R1G1B1R1G1B1R1G1B1R1G1B10, and R read for the second time2G2Stored in the first register for use in the next reassembly. Similarly, when the original channel data of the third pixel point is recombined, the read B for the third time is used2With R stored in a first register2G2Splicing to form a recombined data strip of a third pixel point, namely R2G2B2R2G2B2R2G2B2R2G2B2R2G2B20, and R of each image read for the third time3G3B3Splicing to form a recombined data strip of a fourth pixel point, namely R3G3B3R3G3B3R3G3B3R3G3B3R3G3B3And 0, finishing the recombination of the original channel data of the four pixel points by reading for three times to form four repeated data strips.
And repeating the steps until the original channel data of all the pixel points of the 5 continuous frame images are recombined, and storing all the obtained recombined data strips into a memory so as to facilitate the subsequent calculation and use.
In step S30, the 32 × 64 parallel computing unit array is exemplified to include 32 × 64 computing units, 64 data caches TB and 32 weight caches WB, where each data cache TB stores multiple sets of reassembled data strips, and the weight data stored in each weight cache WB is shared by 64 data caches.
As a preferred embodiment, taking the size of the sliding window equal to 2 × 2 in the convolution calculation process as an example, each data cache stores a recombined data strip of four adjacent pixels, that is, R0G0B0R0G0B0R0G0B0R0G0B0R0G0B00、R1G1B1R1G1B1R1G1B1R1G1B1R1G1B10、R2G2B2R2G2B2R2G2B2R2G2B2R2G2B20 and R3G3B3R3G3B3R3G3B3R3G3B3R3G3B30. When convolution calculation is carried out, the same recombined data strip is written into the calculation unit at the same time, so that the memory bandwidth utilization efficiency can be improved on one hand, and the memory reading times can be reduced on the other hand.
Specifically, as shown in fig. 2, the method for sequentially inputting multiple recombined data strips into the parallel computing unit array to perform convolution operation includes:
step S31: performing multiply-add operation on the original channel data of each continuous frame image in each recombined data strip and the same weight data respectively;
step S32: and storing the result of the multiply-add operation of each continuous frame image in each recombined data strip into different registers.
Illustratively, taking 5 consecutive frame images as an example, each reconstructed data strip includes original channel data of 5 pixels of the 5 consecutive frame images, wherein 5 third registers are provided for storing the result of the multiply-add operation of the 5 consecutive frame images, respectively. As shown in FIG. 5, for example, for the original channel data of the first pixel of the first continuous frame image, the result of the multiply-add operation is F0=W00*R0+W01*G0+W02*B0And storing the calculation result in a corresponding third register, and further adding the multiplication and addition calculation results after the multiplication and addition calculation results of all pixel points in the sliding window of the first continuous frame image are obtained. Similarly, the result of the multiply-add operation of the first pixel point of the second continuous frame image is F1=W00*R0+W01*G0+W02*B0And storing the multiplication and addition operation result into a corresponding third register, and so on, and storing each calculation result into a different third register. In the convolution calculation process, the same recombined data strip corresponds to the same weight data, so that the sharing of the weight data can be realized, the repeated reading of the weight data for many times is not needed, and the repeated reading of the image data is avoided and the memory access times are reduced because all the original channel data of the same recombined data strip are read into the calculation unit at one time.
Example two
As shown in fig. 4, the data batch processing apparatus for a neural network in the second embodiment includes a data obtaining module 100, a data reconstructing module 200, and a convolution calculating module 300, where the data obtaining module 100 is configured to obtain a memory bandwidth and select original channel data of N consecutive frame images according to the memory bandwidth; the data reorganization module 200 is configured to splice original channel data of the N consecutive frame images to form multiple reorganization data strips, where each reorganization data strip includes original channel data of the N consecutive frame images at the same pixel position; the convolution calculation module 300 is configured to read a plurality of recombined data strips and perform convolution operation on the plurality of recombined data strips in sequence, where all original channel data of the same recombined data strip are read by the convolution calculation module at the same time. The data batch processing apparatus further includes a memory 400, and the memory 400 is configured to receive and store multiple recombined data strips formed by the data recombining module 200.
Specifically, the data obtaining module 100 includes a plurality of buffers, and the plurality of buffers are configured to read and temporarily store the original channel data of the corresponding image from the memory module 400 according to the data of the memory bandwidth. Taking the example that the memory bandwidth is equal to 128 bits, and N is equal to 5, 5 different buffers are adopted to read and store the original channel data of 5 continuous frame images from the memory, wherein the color channel data, namely R, are sequentially arranged according to the pixel position sequence0G0B0R1G1B1R2G2B2R3G3B3。
The data reorganization module 200 includes a Block Memory, illustratively a Block Memory model 128-32 from Xilinx corporation, a first register, a second register, and a counter. The block memory read can only read four color channel data at a time, namely 32-bit data, while the data really needs to be 24-bit data, so that the data after the block memory read also needs to be recombined. When the data is read for the first time, the block memory reads R data from each buffer respectively0G0B0R1At this time, R is1R stored in a first register for each image0G0B0Splicing and zero filling processing are carried out to form a recombined data strip of a first pixel point, namely R0G0B0R0G0B0R0G0B0R0G0B0R0G0B00 and stored in the second register while setting the counter value to 0. Similarly, when reading data for the second time, the data read by the block memory from each buffer is G1B1R2G2At this time, R is2G2Storing in a first register, and storing R previously stored in the first register1And G of the second reading1B1Splicing and zero filling processing are carried out to form a recombined data strip of a second pixel point, namely R1G1B1R1G1B1--R1G1B1R1G1B1R1G1B10 and stored in the second register while setting the counter value to 1. When the data is read for the third time, the data read from each buffer by the block memory is B2R3G3B3R stored in advance in the first register2G2And B of the third reading2Splicing and zero filling processing are carried out to form a recombined data strip of a third pixel point, namely R2G2B2R2G2B2R2G2B2R2G2B2R2G2B20 and stored in the second register while setting the counter value to 2. Then read the R for the third time3G3B3Splicing and zero filling processing are carried out to form a recombined data strip of a fourth pixel point, namely R3G3B3R3G3B3R3G3B3R3G3B3R3G3B30 and stored in the second register while setting the counter value to 3. Therefore, the original channel data of the four pixel points can be recombined every three times of reading to form four repeated data strips. Repeating the above steps untilAnd finishing the recombination of the original channel data of all the pixel points of the 5 continuous frame images, and storing all the obtained recombined data strips into a memory so as to be convenient for subsequent calculation and use.
Further, as shown in fig. 5, taking a 32 × 64 parallel computing unit array as an example, the parallel computing unit array includes 32 × 64 computing units PE, 64 data caches TB and 32 weight caches WB, where each data cache TB stores multiple sets of reassembly data strips, weight data stored in each weight cache WB is shared by 64 data caches, and the convolution computing module is the computing unit PE.
The convolution calculation module comprises a multiplier-adder unit and a storage unit, wherein the multiplier-adder unit is used for respectively carrying out multiplication-addition operation on the original channel data of each continuous frame image in each recombined data strip and the same weight data, and the storage unit is used for storing the multiplication-addition operation result of each continuous frame image in each recombined data strip.
Illustratively, the multiplier-adder unit includes multipliers and adders, and the storage unit includes a
data selector 301, a
data distributor 302, and 5
third registers 303. For example, the multiplier is used for the original channel data of the first pixel point of the first continuous frame image
Calculating W
00*R
0And reads data from the corresponding
third register 303 using the
data selector 301, using the adder
In the calculation, since the initial value of the third register is zero, the calculation result of the adder is W
00*R
0The result of this calculation W is then passed through the
data distributor 302
00*R
0To the
third register 303. Continued use of multipliers
Calculating W
01*G
0And reads the data W from the corresponding
third register 303 using the
data selector 301
00*R
0Using adders
Calculation, adder
The calculation result is W
00*R
0+W
01*G
0The result of this calculation W is then passed through the
data distributor 302
00*R
0+W
01*G
0To the
third register 303. Finally, the multiplier is continuously utilized
Calculating W
02*B
0And reads the data W from the corresponding
third register 303 using the
data selector 301
00*R
0+W
01*G
0Using adders
Calculation, adder
The calculation result is F
0=W
00*R
0+W
01*G
0+W
02*B
0The result of this calculation F is then passed through the
data distributor 302
0To the
third register 303. And by analogy, the convolution calculation of each original channel data is completed. Since the same recombined data strip corresponds to the same weight data, e.g. W
00W
01 W
02The reuse is needed five times, and the weight data can be multiplexed by setting an additional address pointer and a counter, and controlling the address pointer when a group of weight data is used for less than 5 times.
The application also discloses a computer readable storage medium, which stores a data batch processing program for the neural network, and the data batch processing program for the neural network realizes the data batch processing method for the neural network when being executed by a processor.
The present application also discloses a computer device, and on the hardware level, as shown in fig. 6, the terminal includes a processor 12, an internal bus 13, a network interface 14, and a computer-readable storage medium 11. The processor 12 reads a corresponding computer program from the computer-readable storage medium and then runs, forming a request processing apparatus on a logical level. Of course, besides software implementation, the one or more embodiments in this specification do not exclude other implementations, such as logic devices or combinations of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices. The computer-readable storage medium 11 stores thereon a data batching program for a neural network, which when executed by a processor implements the data batching method for a neural network described above.
Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer-readable storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
Although a few embodiments of the present invention have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the claims and their equivalents, and that such changes and modifications are intended to be within the scope of the invention.