CN114065905A

CN114065905A - Data batch processing method and batch processing device thereof, storage medium and computer equipment

Info

Publication number: CN114065905A
Application number: CN202010791617.5A
Authority: CN
Inventors: 王峥; 雷明
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Zhongke Yuanwuxin Technology Co ltd
Priority date: 2020-08-07
Filing date: 2020-08-07
Publication date: 2022-02-18
Also published as: WO2022027818A1

Abstract

The invention discloses a data batch processing method, a batch processing device, a storage medium and computer equipment thereof. The data batch processing method comprises the following steps: acquiring a memory bandwidth and selecting original channel data of N continuous frame images according to the memory bandwidth; splicing the original channel data of the N continuous frame images to form a plurality of recombined data strips, wherein each recombined data strip comprises the original channel data of the N continuous frame images on the same pixel position; and sequentially inputting a plurality of recombined data strips into the parallel computing unit array for convolution operation, wherein all original channel data of the same recombined data strip enter the computing unit at the same time. And the image channel data of the adjacent multi-frame pictures at the same pixel position is re-processed, the recombined data is input into the calculation unit at the same time and is subjected to convolution operation with the same weight data, so that the reading times of the weight data and the image data can be reduced, and the calculation energy consumption is greatly reduced.

Description

Data batch processing method and batch processing device thereof, storage medium and computer equipment

Technical Field

The present invention belongs to the technical field of data processing, and in particular, to a data batch processing method for a neural network, a batch processing apparatus, a computer-readable storage medium, and a computer device thereof.

Background

With the popularization of big data and artificial intelligence technology, the deep learning algorithm based on the artificial neural network obtains remarkable achievements in the fields of computer vision, natural language processing, intelligent agent autonomous decision and the like by means of strong feature extraction capability. However, the neural network structure is becoming more complex, and the number of parameters and the amount of computation increase sharply, which has higher requirements on the data bandwidth and the computing power of the hardware platform.

Among them, continuous image processing techniques, such as target recognition, tracking, super-resolution reconstruction, etc., in video streams, are dominant in intelligent applications. The current mainstream deep learning accelerator has a good acceleration effect on the intelligent processing of a single-frame image, however, for video application, the direct application of a single-frame acceleration technology can cause great computing resource waste, and particularly, a large amount of repeated read-write operations of an off-chip memory are caused. The core reason is the non-optimized memory operations such as repeated weight reading and discrete data reading among different frame images.

Disclosure of Invention

(I) technical problems to be solved by the invention

The technical problem solved by the invention is as follows: how to reduce the number of times data is read from memory.

(II) the technical scheme adopted by the invention

A data batching method for a neural network, the data batching method comprising:

acquiring a memory bandwidth and selecting original channel data of N continuous frame images according to the memory bandwidth;

splicing the original channel data of the N continuous frame images to form a plurality of recombined data strips, wherein each recombined data strip comprises the original channel data of the N continuous frame images at the same pixel position;

and sequentially inputting a plurality of recombined data strips into the parallel computing unit array for convolution operation, wherein all original channel data of the same recombined data strip enter the computing unit at the same time.

Preferably, each of the reassembled data strips further includes zero padding data, and a data bit width of each of the reassembled data strips is equal to the memory bandwidth.

Preferably, the data batch processing method further includes:

and storing the multiple recombined data strips into a memory.

Preferably, the method for sequentially inputting the plurality of recombined data strips into the parallel computing unit array to perform convolution operation includes:

performing multiply-add operation on the original channel data of each continuous frame image in each recombined data strip and the same weight data respectively;

and storing the result of the multiply-add operation of each continuous frame image in each recombined data strip into different registers.

Preferably, the memory bandwidth is 128 bits, N is 5, and the raw channel data at each pixel position of each of the consecutive frame images includes red channel data, green channel data, and blue channel data.

The present application also discloses a data batch processing apparatus for a neural network, the data batch processing apparatus comprising:

the data acquisition module is used for acquiring the memory bandwidth and selecting the original channel data of N continuous frame images according to the memory bandwidth;

the data recombination module is used for splicing the original channel data of the N continuous frame images to form a plurality of recombined data strips, wherein each recombined data strip comprises the original channel data of the N continuous frame images at the same pixel position;

and the convolution calculation module is used for reading the multiple recombined data strips and carrying out convolution operation on the multiple recombined data strips in sequence, wherein all original channel data of the same recombined data strip are read by the convolution calculation module at the same time.

Preferably, the data batch processing device further comprises a memory, and the memory is used for receiving and storing the multiple recombined data strips formed by the data recombination module.

Preferably, the convolution calculation module includes:

the multiplier-adder unit is used for respectively carrying out multiplication-addition operation on the original channel data of each continuous frame image in each recombined data strip and the same weight data;

and the storage unit is used for storing the result of the multiply-add operation of each continuous frame image in each recombined data strip.

The invention also discloses a computer readable storage medium, which stores the data batch processing program for the neural network, and the data batch processing program for the neural network realizes the data batch processing method for the neural network when being executed by a processor.

The invention also discloses a computer device, which comprises a computer readable storage medium, a processor and a data batch processing program for the neural network stored in the computer readable storage medium, wherein the data batch processing program for the neural network realizes the data batch processing method for the neural network when being executed by the processor.

(III) advantageous effects

The invention discloses a data batch processing method for a neural network, which has the following technical effects compared with the traditional calculation method:

(1) the optimized data structure facing the three-dimensional array can realize the rapid buffering of data and avoid the reading of repeated weights among different frame images, thereby greatly reducing the access times of an off-chip memory;

(2). The method is novel in idea, and starts from the characteristics of input data, the method can exert great effect on the situation that the input data such as background images, monitoring videos and other images of the first layer of convolutional layer are single and unchanged, and has great potential on the situation that the depth of a convolutional kernel in a deep neural network is great.

Drawings

Fig. 1 is a flowchart of a data batch processing method for a neural network according to a first embodiment of the present invention;

FIG. 2 is a flowchart of convolution calculation according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a data splicing process according to a first embodiment of the present invention;

FIG. 4 is a diagram of a data batch processing apparatus according to a second embodiment of the present invention;

FIG. 5 is a diagram of a parallel computing unit array according to a second embodiment of the present invention;

FIG. 6 is a schematic diagram of a computer apparatus according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Before describing in detail the various embodiments of the present application, the inventive concepts of the present application are first briefly described: in the prior art, convolution calculation is carried out on each frame of picture in sequence, weight data and image data need to be read repeatedly, and calculation resource waste can be caused.

Example one

Specifically, as shown in fig. 1, the data batch processing method for the neural network according to the first embodiment includes the following steps:

step S10: acquiring a memory bandwidth and selecting original channel data of N continuous frame images according to the memory bandwidth;

step S20: splicing the original channel data of the N continuous frame images to form a plurality of recombined data strips, wherein each recombined data strip comprises the original channel data of the N continuous frame images at the same pixel position;

step S30: and sequentially inputting a plurality of recombined data strips into the parallel computing unit array for convolution operation, wherein all original channel data of the same recombined data strip enter the computing unit at the same time.

In step S10, taking the memory bandwidth equal to 128 bits as an example, in the prior art, when reading data from the memory, each time, the original channel data of one pixel point is read, which includes the red channel data R, the green channel data G, and the blue channel data B, each color channel data occupies 8 bits, and 24 bits in total, so that only 24 bits of data are read each time, which wastes the memory bandwidth. According to the characteristics of the convolution calculation process, the characteristics of the image data are combined, the original channel data of N continuous frame images are selected according to the size of the actually used memory bandwidth, splicing is carried out, more channel data can be read when the data are read from the memory each time, and the use efficiency of the memory bandwidth is improved.

Illustratively, the splicing process in step S20 is described in detail by taking the example that the memory bandwidth is equal to 128 bits and N is equal to 5. The original channel data of 5 continuous frame images at the same pixel position are spliced to form a repeated data strip, for example, the original channel data of 5 continuous frame images at a first pixel point are spliced to form a recombined data strip with a data bit width of 120 bits. As a preferred embodiment, zero padding is performed on the formed reassembled data strip, so that the data bit width of the reassembled data strip is equal to the memory bandwidth, for example, 8 0 s are padded at the end of the reassembled data strip with a data bit width of 120 bits, and a 128-bit reassembled data strip is formed.

As a preferred embodiment, a Block Memory (model 128-32) of Xilinx is adopted to read the original channel data of each image, but the Block Memory can only read four color channel data at a time, namely 32-bit data, and what is really needed is 24-bit data, so that the data after being read by the Block Memory needs to be further recombined. Illustratively, the original channel data of 5 consecutive frame images are temporarily stored in 5 buffers, respectively, and as shown in fig. 3, the color channel data, i.e., R, are sequentially arranged in the order of pixel position₀G₀B₀R₁G₁B₁R₂G₂B₂R₃G₃B₃... when reading with a block memory, the R's are read sequentially in the first cache₀G₀B₀R₁、G₁B₁R₂G₂、B₂ R₃G₃B₃Setting a first register to read R for the first time₁Storing in the first register for next splicing to obtain R of 5 images₀G₀B₀Splicing and zero filling are carried out to form a recombined data strip of a first pixel point, namely R₀G₀B₀R₀G₀B₀R₀G₀B₀R₀G₀B₀R₀G₀B₀And 0, setting a second register to store the recombined data strip, thereby completing the recombination of the original channel data of the first pixel point. Similarly, when the original channel data of the second pixel point is recombined, G read for the second time is obtained₁B₁And R in the first register₁Recombining to form a recombined data strip of a second pixel, namely R₁G₁B₁R₁G₁B₁R₁G₁B₁R₁G₁B₁R₁G₁B₁0, and R read for the second time₂G₂Stored in the first register for use in the next reassembly. Similarly, when the original channel data of the third pixel point is recombined, the read B for the third time is used₂With R stored in a first register₂G₂Splicing to form a recombined data strip of a third pixel point, namely R₂G₂B₂R₂G₂B₂R₂G₂B₂R₂G₂B₂R₂G₂B₂0, and R of each image read for the third time₃G₃B₃Splicing to form a recombined data strip of a fourth pixel point, namely R₃G₃B₃R₃G₃B₃R₃G₃B₃R₃G₃B₃R₃G₃B₃And 0, finishing the recombination of the original channel data of the four pixel points by reading for three times to form four repeated data strips.

And repeating the steps until the original channel data of all the pixel points of the 5 continuous frame images are recombined, and storing all the obtained recombined data strips into a memory so as to facilitate the subsequent calculation and use.

In step S30, the 32 × 64 parallel computing unit array is exemplified to include 32 × 64 computing units, 64 data caches TB and 32 weight caches WB, where each data cache TB stores multiple sets of reassembled data strips, and the weight data stored in each weight cache WB is shared by 64 data caches.

As a preferred embodiment, taking the size of the sliding window equal to 2 × 2 in the convolution calculation process as an example, each data cache stores a recombined data strip of four adjacent pixels, that is, R₀G₀B₀R₀G₀B₀R₀G₀B₀R₀G₀B₀R₀G₀B₀0、R₁G₁B₁R₁G₁B₁R₁G₁B₁R₁G₁B₁R₁G₁B₁0、R₂G₂B₂R₂G₂B₂R₂G₂B₂R₂G₂B₂R₂G₂B₂0 and R₃G₃B₃R₃G₃B₃R₃G₃B₃R₃G₃B₃R₃G₃B₃0. When convolution calculation is carried out, the same recombined data strip is written into the calculation unit at the same time, so that the memory bandwidth utilization efficiency can be improved on one hand, and the memory reading times can be reduced on the other hand.

Specifically, as shown in fig. 2, the method for sequentially inputting multiple recombined data strips into the parallel computing unit array to perform convolution operation includes:

step S31: performing multiply-add operation on the original channel data of each continuous frame image in each recombined data strip and the same weight data respectively;

step S32: and storing the result of the multiply-add operation of each continuous frame image in each recombined data strip into different registers.

Illustratively, taking 5 consecutive frame images as an example, each reconstructed data strip includes original channel data of 5 pixels of the 5 consecutive frame images, wherein 5 third registers are provided for storing the result of the multiply-add operation of the 5 consecutive frame images, respectively. As shown in FIG. 5, for example, for the original channel data of the first pixel of the first continuous frame image, the result of the multiply-add operation is F₀＝W₀₀*R₀+W₀₁*G₀+W₀₂*B₀And storing the calculation result in a corresponding third register, and further adding the multiplication and addition calculation results after the multiplication and addition calculation results of all pixel points in the sliding window of the first continuous frame image are obtained. Similarly, the result of the multiply-add operation of the first pixel point of the second continuous frame image is F₁＝W₀₀*R₀+W₀₁*G₀+W₀₂*B₀And storing the multiplication and addition operation result into a corresponding third register, and so on, and storing each calculation result into a different third register. In the convolution calculation process, the same recombined data strip corresponds to the same weight data, so that the sharing of the weight data can be realized, the repeated reading of the weight data for many times is not needed, and the repeated reading of the image data is avoided and the memory access times are reduced because all the original channel data of the same recombined data strip are read into the calculation unit at one time.

Example two

As shown in fig. 4, the data batch processing apparatus for a neural network in the second embodiment includes a data obtaining module 100, a data reconstructing module 200, and a convolution calculating module 300, where the data obtaining module 100 is configured to obtain a memory bandwidth and select original channel data of N consecutive frame images according to the memory bandwidth; the data reorganization module 200 is configured to splice original channel data of the N consecutive frame images to form multiple reorganization data strips, where each reorganization data strip includes original channel data of the N consecutive frame images at the same pixel position; the convolution calculation module 300 is configured to read a plurality of recombined data strips and perform convolution operation on the plurality of recombined data strips in sequence, where all original channel data of the same recombined data strip are read by the convolution calculation module at the same time. The data batch processing apparatus further includes a memory 400, and the memory 400 is configured to receive and store multiple recombined data strips formed by the data recombining module 200.

Specifically, the data obtaining module 100 includes a plurality of buffers, and the plurality of buffers are configured to read and temporarily store the original channel data of the corresponding image from the memory module 400 according to the data of the memory bandwidth. Taking the example that the memory bandwidth is equal to 128 bits, and N is equal to 5, 5 different buffers are adopted to read and store the original channel data of 5 continuous frame images from the memory, wherein the color channel data, namely R, are sequentially arranged according to the pixel position sequence₀G₀B₀R₁G₁B₁R₂G₂B₂R₃G₃B₃。

The data reorganization module 200 includes a Block Memory, illustratively a Block Memory model 128-32 from Xilinx corporation, a first register, a second register, and a counter. The block memory read can only read four color channel data at a time, namely 32-bit data, while the data really needs to be 24-bit data, so that the data after the block memory read also needs to be recombined. When the data is read for the first time, the block memory reads R data from each buffer respectively₀G₀B₀R₁At this time, R is₁R stored in a first register for each image₀G₀B₀Splicing and zero filling processing are carried out to form a recombined data strip of a first pixel point, namely R₀G₀B₀R₀G₀B₀R₀G₀B₀R₀G₀B₀R₀G₀B₀0 and stored in the second register while setting the counter value to 0. Similarly, when reading data for the second time, the data read by the block memory from each buffer is G₁B₁R₂G₂At this time, R is₂G₂Storing in a first register, and storing R previously stored in the first register₁And G of the second reading₁B₁Splicing and zero filling processing are carried out to form a recombined data strip of a second pixel point, namely R₁G₁B₁R₁G₁B₁--R₁G₁B₁R₁G₁B₁R₁G₁B₁0 and stored in the second register while setting the counter value to 1. When the data is read for the third time, the data read from each buffer by the block memory is B₂R₃G₃B₃R stored in advance in the first register₂G₂And B of the third reading₂Splicing and zero filling processing are carried out to form a recombined data strip of a third pixel point, namely R₂G₂B₂R₂G₂B₂R₂G₂B₂R₂G₂B₂R₂G₂B₂0 and stored in the second register while setting the counter value to 2. Then read the R for the third time₃G₃B₃Splicing and zero filling processing are carried out to form a recombined data strip of a fourth pixel point, namely R₃G₃B₃R₃G₃B₃R₃G₃B₃R₃G₃B₃R₃G₃B₃0 and stored in the second register while setting the counter value to 3. Therefore, the original channel data of the four pixel points can be recombined every three times of reading to form four repeated data strips. Repeating the above steps untilAnd finishing the recombination of the original channel data of all the pixel points of the 5 continuous frame images, and storing all the obtained recombined data strips into a memory so as to be convenient for subsequent calculation and use.

Further, as shown in fig. 5, taking a 32 × 64 parallel computing unit array as an example, the parallel computing unit array includes 32 × 64 computing units PE, 64 data caches TB and 32 weight caches WB, where each data cache TB stores multiple sets of reassembly data strips, weight data stored in each weight cache WB is shared by 64 data caches, and the convolution computing module is the computing unit PE.

The convolution calculation module comprises a multiplier-adder unit and a storage unit, wherein the multiplier-adder unit is used for respectively carrying out multiplication-addition operation on the original channel data of each continuous frame image in each recombined data strip and the same weight data, and the storage unit is used for storing the multiplication-addition operation result of each continuous frame image in each recombined data strip.

Illustratively, the multiplier-adder unit includes multipliers and adders, and the storage unit includes a data selector 301, a data distributor 302, and 5 third registers 303. For example, the multiplier is used for the original channel data of the first pixel point of the first continuous frame image

Calculating W₀₀*R₀And reads data from the corresponding third register 303 using the data selector 301, using the adder

In the calculation, since the initial value of the third register is zero, the calculation result of the adder is W₀₀*R₀The result of this calculation W is then passed through the data distributor 302₀₀*R₀To the third register 303. Continued use of multipliers

Calculating W₀₁*G₀And reads the data W from the corresponding third register 303 using the data selector 301₀₀*R₀Using adders

Calculation, adder

The calculation result is W₀₀*R₀+W₀₁*G₀The result of this calculation W is then passed through the data distributor 302₀₀*R₀+W₀₁*G₀To the third register 303. Finally, the multiplier is continuously utilized

Calculating W₀₂*B₀And reads the data W from the corresponding third register 303 using the data selector 301₀₀*R₀+W₀₁*G₀Using adders

Calculation, adder

The calculation result is F₀＝W₀₀*R₀+W₀₁*G₀+W₀₂*B₀The result of this calculation F is then passed through the data distributor 302₀To the third register 303. And by analogy, the convolution calculation of each original channel data is completed. Since the same recombined data strip corresponds to the same weight data, e.g. W₀₀W₀₁ W₀₂The reuse is needed five times, and the weight data can be multiplexed by setting an additional address pointer and a counter, and controlling the address pointer when a group of weight data is used for less than 5 times.

The application also discloses a computer readable storage medium, which stores a data batch processing program for the neural network, and the data batch processing program for the neural network realizes the data batch processing method for the neural network when being executed by a processor.

The present application also discloses a computer device, and on the hardware level, as shown in fig. 6, the terminal includes a processor 12, an internal bus 13, a network interface 14, and a computer-readable storage medium 11. The processor 12 reads a corresponding computer program from the computer-readable storage medium and then runs, forming a request processing apparatus on a logical level. Of course, besides software implementation, the one or more embodiments in this specification do not exclude other implementations, such as logic devices or combinations of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices. The computer-readable storage medium 11 stores thereon a data batching program for a neural network, which when executed by a processor implements the data batching method for a neural network described above.

Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer-readable storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

Although a few embodiments of the present invention have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the claims and their equivalents, and that such changes and modifications are intended to be within the scope of the invention.

Claims

1. A data batching method for a neural network, the data batching method comprising:

2. The data batching method for the neural network as recited in claim 1, wherein each said reorganized data strip further comprises zero padding data, and a data bit width of each said reorganized data strip is equal to said memory bandwidth.

3. The data batching method for a neural network as claimed in claim 2, further comprising:

and storing the multiple recombined data strips into a memory.

4. The data batch processing method for the neural network as claimed in claim 1, wherein the method for sequentially inputting the plurality of recombined data strips into the parallel computing unit array to perform convolution operation comprises:

5. The data batch processing method for the neural network according to claim 1, wherein the memory bandwidth is 128 bits, N is 5, and the raw channel data at each pixel position of each of the successive frame images includes red channel data, green channel data, and blue channel data.

6. A data batching device for a neural network, the data batching device comprising:

7. The data batch processing device for the neural network according to claim 6, further comprising a memory, wherein the memory is used for receiving and storing the plurality of recombined data strips formed by the data recombination module.

8. The data batching device for the neural network as recited in claim 6, wherein said convolution calculating module comprises:

9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a data batching program for a neural network, which when executed by a processor implements the data batching method for a neural network according to any one of claims 1 to 5.

10. A computer device comprising a computer readable storage medium, a processor, and a data batching program for a neural network stored in the computer readable storage medium, the data batching program for a neural network implementing the data batching method for a neural network of any one of claims 1 to 5 when executed by the processor.