CN117789157A

CN117789157A - Real-time target detection method and system

Info

Publication number: CN117789157A
Application number: CN202211164904.9A
Authority: CN
Inventors: 冯黎; 高小龙; 姜珊; 黄建强; 李辉; 张秋磊; 李建飞; 李博伦
Original assignee: Beijing Machinery Equipment Research Institute
Current assignee: Beijing Machinery Equipment Research Institute
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2024-03-29

Abstract

The invention relates to a real-time target detection method and a real-time target detection system, belongs to the technical field of image recognition, and solves the problem of low detection speed of a target detection network. The method comprises the following steps: receiving an image to be detected, preprocessing the image to be detected to obtain a feature image to be detected in a preset format, and storing the feature image to be detected into a DDR (double data Rate) of a PS (double data Rate) end; loading a preset weight pre-stored in the DDR of the PS end into the DDR of the PL end; determining the size of a PE array of a convolution layer according to the number of available DSPs at the PL end, the number of convolution kernels of a plurality of convolution layers included in a target detection network and an input feature map of the plurality of convolution layers; performing convolution operation on a plurality of convolution layers in a target detection network by using a convolution layer PE array to obtain a target detection result after the target detection network operation is completed; and determining the position range and the category of each target in the image to be detected according to the target detection result, and displaying the position range and the category on the visualization equipment for the user to view. The speed of the target detection network for detecting the image is improved.

Description

Real-time target detection method and system

Technical Field

The invention relates to the technical field of image recognition, in particular to a real-time target detection method and system.

Background

An autopilot system refers to a highly centralized control train operation system that is fully automated with respect to the work performed by the train driver. The automatic driving system has the functions of automatically waking up, starting and sleeping a train, automatically entering and exiting a parking lot, automatically cleaning, automatically driving, automatically stopping, automatically opening and closing a vehicle door, automatically recovering faults and the like, and has various operation modes such as normal operation, degraded operation, operation interruption and the like.

The main function of the automatic driving system is to accurately identify objects around the vehicle in real time to ensure safe and correct control decisions. The target detection network based on the deep learning target detection and semantic segmentation algorithm has become very important in the field of automatic driving, and the rapid reasoning speed is a key for ensuring the safety of automatic driving.

The current target detection network in the prior art has slower reasoning speed and is more troublesome to maintain, and inconvenience is brought to the development of the automatic driving field.

Disclosure of Invention

In view of the above analysis, the present invention aims to provide a real-time target detection method and system, which are used for solving the problems of slow detection speed and difficult maintenance of a target detection network in the prior art.

In one aspect, an embodiment of the present invention provides a real-time target detection method, where the method includes:

receiving an image to be detected, preprocessing the image to be detected to obtain a feature image to be detected in a preset format, and storing the feature image to be detected into a DDR (double data Rate) of a PS (double data Rate) end;

loading a preset weight pre-stored in the DDR of the PS end into the DDR of the PL end, wherein the preset weight is weight data pre-trained according to a target detection network;

determining the size of a PE array of a convolution layer according to the number of available DSPs at the PL end, the number of convolution kernels of a plurality of convolution layers included in a target detection network and an input feature map of the plurality of convolution layers; the size of the PE array of the convolution layer is C, K and L, C is the execution depth, K is the number of the convolution kernels to be executed, and L is the parallelism of the execution lines;

performing convolution operation on a plurality of convolution layers in a target detection network by using a convolution layer PE array to obtain a target detection result after the target detection network operation is completed; in the convolution operation process, the feature map to be detected and the preset weight are scheduled to the PL end to participate in the operation;

and determining the position range and the category of each target in the image to be detected according to the target detection result, and displaying the position range and the category on the visualization equipment for the user to view.

Based on a further improvement of the above method, the determining the size of the convolutional layer PE array according to the number of available DSPs at the PL end, the number of convolutional kernels of a plurality of convolutional layers included in the target detection network, and the input feature map of the plurality of convolutional layers includes:

determining a plurality of execution depths and parallelism of a plurality of execution lines according to the sizes of input feature graphs of a plurality of convolution layers included in the target detection network;

determining the number of a plurality of execution convolution kernels according to the number of the convolution kernels of a plurality of convolution layers included in the target detection network;

and determining the size of the PE array of the convolution layer according to the number of available DSPs at the PL end, a plurality of execution depths, a plurality of parallelism of execution lines and a plurality of execution convolution kernel numbers.

Based on further improvement of the above method, the determining the size of the convolution layer PE array according to the number of available DSPs at the PL end, a plurality of execution depths, a plurality of parallelism of execution lines, and a plurality of execution convolution kernel numbers includes:

selecting one execution depth, one execution line parallelism and one execution convolution kernel number from the plurality of execution depths, the plurality of execution lines parallelism and the plurality of execution convolution kernel numbers, respectively, so that the selected execution depth, the execution line parallelism, the execution convolution kernel number and the available DSP number satisfy the following conditions:

C*K*L≤2*N；

wherein, C x K x L represents the size of the convolutional layer PE array, N represents the number of available DSPs at the PL end, C represents the execution depth, K represents the number of execution convolution kernels, and L represents the parallelism of the execution rows.

Based on a further improvement of the above method, the performing convolution operation on the plurality of convolution layers in the target detection network by using the convolution layer PE array includes:

for each of the plurality of convolutional layers, dividing an input feature map H1 x W1 x P1 of the current convolutional layer into columns according to the parallelism L of the execution column and the execution step S of the convolutional kernel of the current convolutional layerEach region includes an input feature map of size H1 x (L x S) x P1;

and sequentially carrying out the following steps on each region according to the direction from the first column to the last column of the input feature diagram of the current convolution layer:

dividing an input characteristic diagram H1 (L) P1 of a current area into the following steps according to a row X of a convolution kernel of the current convolution layer and an execution step S of the convolution kernel of the current convolution layerEach sub-region comprises an input feature map with a size of X (L S) P1; the convolution kernel size of the current convolution layer is X, Y, P1, X is row, Y is column, P1 is depth;

and carrying out convolution operation on each sub-region in sequence according to the directions from the first row to the last row of the input feature diagram of the current convolution layer.

Based on a further improvement of the above method, the performing convolution operation on each sub-region sequentially according to the direction from the first row to the last row of the input feature diagram of the current convolution layer includes:

dividing a current subarea X (L S) P1 into P1/C spaces according to the execution depth C of the convolution layer PE array, wherein the size of an input feature map included in each space is X (L S) C;

and sequentially carrying out the following processing on each space according to the direction from the first layer depth to the last layer depth of the input feature map of the current convolution layer:

determining the number M/K of kernel execution times according to the number M of the convolution kernels of the current convolution layer and the number K of the execution convolution kernels of the PE array of the convolution layer;

and carrying out convolution operation on the input characteristic graphs X (L.S) C included in the current space by each time of scheduling K convolution kernels of the current convolution layer and weight data of corresponding preset weights, and executing the kernel execution times M/K times until all M convolution kernels of the current convolution layer are executed.

Based on a further improvement of the above method, the convolving operation is performed on the input feature graph X (l×s) ×c included in the current space by each scheduling of K convolution kernels of the current convolution layer and weight data of a corresponding preset weight, including:

carrying out convolution operation of X times and Y times on the corresponding input feature images of the current space according to the directions of the first row to the X line and the directions of the first column to the Y column of the convolution kernel of the current convolution layer in sequence;

and in each period, according to the execution step length of the convolution kernels, selecting weight data of preset weights of one row, one column and C depth in the K convolution kernels, multiplying the feature map data of one row, L column and C depth of the input feature map included in the corresponding current space, and then adding, wherein each convolution kernel acquires L intermediate results.

Based on further improvement of the method, a first image cache space and a second image cache space are preset in DDR of the PS end; the first image cache space is used for storing a plurality of feature images to be detected; and carrying out 8bit fixed point quantization on each feature map to be detected, reordering in the depth, row and column directions, and mapping and storing each reordered feature map to a second image cache space.

Based on further improvement of the method, the preset weight is obtained through training of the following steps:

training a target detection network by using the disclosed KITTI data set to obtain weight data after training is completed;

carrying out 8bit fixed point quantization on the weight data after training, and storing the weight data after fixed point quantization into an SD card according to the sequence of depth, row and column;

and loading and storing the weight data in the SD card into the DDR at the PS end to serve as a preset weight.

In another aspect, an embodiment of the present invention provides a real-time object detection system, including:

the image data receiving module is used for receiving the image to be detected, obtaining a feature image to be detected in a preset format after pretreatment, and storing the feature image to be detected into the DDR at the PS end;

the weight data loading module is used for loading preset weights pre-stored in the DDR of the PS end into the DDR of the PL end, wherein the preset weights are weight data pre-trained according to the target detection network;

the PE array determining module is used for determining the size of the PE array of the convolution layers according to the number of available DSPs at the PL end, the number of convolution kernels of the convolution layers included in the target detection network and the input feature graphs of the convolution layers; the size of the PE array of the convolution layer is C, K and L, C is the execution depth, K is the number of the convolution kernels to be executed, and L is the parallelism of the execution lines;

the PE array calculation module is used for carrying out convolution operation on a plurality of convolution layers in the target detection network by utilizing the convolution layer PE array to obtain a target detection result after the target detection network operation is completed; in the convolution operation process, the feature map to be detected and the preset weight are scheduled to the PL end to participate in the operation;

and the detection result display module is used for determining the position range and the category of each target in the image to be detected according to the target detection result and displaying the position range and the category on the visual equipment for the user to check.

Based on a further improvement of the above system, the PE array determining module includes:

the depth column parallelism determination module is used for determining a plurality of execution depths and the parallelism of a plurality of execution columns according to the sizes of input feature graphs of a plurality of convolution layers included in the target detection network;

the convolution kernel number determining module is used for determining the number of a plurality of execution convolution kernels according to the number of the convolution kernels of a plurality of convolution layers included in the target detection network;

the PE array determining submodule is used for determining the size of the PE array of the convolution layer according to the number of available DSPs at the PL end, a plurality of execution depths, a plurality of parallelism of execution lines and a plurality of execution convolution kernel numbers.

Compared with the prior art, the invention has at least one of the following beneficial effects:

1. the detection speed of the target detection network in the prior art is improved, so that the target detection network can recognize the image to be detected more quickly.

2. The maintenance of the target detection network is more convenient, the portability is stronger, and the iterative ascending of the target detection algorithm is facilitated.

In the invention, the technical schemes can be mutually combined to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

FIG. 1 is a flow chart of a real-time target detection method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a target detection network structure based on YOLO V3 tiny algorithm according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of convolution operation of a PE array with a convolution layer according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a real-time object detection system according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and together with the description serve to explain the principles of the invention, and are not intended to limit the scope of the invention.

In one embodiment of the present invention, a real-time object detection method is disclosed, as shown in fig. 1.

Step S101: and receiving the image to be detected, preprocessing the image to be detected to obtain a feature image to be detected in a preset format, and storing the feature image to be detected into the DDR (double data Rate) of the PS end.

Specifically, the image to be detected may be received by an image acquisition device, such as: cameras, video cameras, scanners, or other devices with photographing capabilities (cell phones, tablet computers, etc.). Because the images to be detected acquired by different image acquisition devices do not necessarily meet the input format requirement of the target detection network, the acquired images to be detected need to be preprocessed, and the preprocessing modes can include demosaicing, gamma correction gamma or video scaling and the like. The preprocessed image to be detected meets the input format requirement of the target detection network.

It should be noted that, as shown in fig. 2, a target detection network based on YOLO V3 tiny algorithm includes 23 layers in the target detection network, where layers represent layers, filters represent the number of convolution kernels, size/strd represents the size/execution step of the convolution kernels, input represents an input feature map and output represents an output feature map. The input image needs to meet, and the preset format is 416×416×3, that is, 416 rows, 416 columns, and 3 depth feature map data.

The preset format may be an input format requirement of the target detection network. The feature map to be detected corresponding to the pre-processed image to be detected can be stored in a Double Data Rate (DDR) of a Processing System (PS).

Preferably, a first image cache space and a second image cache space are preset in DDR of the PS end; the first image cache space is used for storing a plurality of feature images to be detected; and carrying out 8bit fixed point quantization on each feature map to be detected, reordering in the depth, row and column directions, and mapping and storing each reordered feature map to a second image cache space.

Such as: after preprocessing the image to be detected, RGB format data of 416×416×3 are obtained, each data is one byte, and 8 bits (bit). These data constitute AXI-Stream video Stream data written in DDR. In DDR, one byte of data is stored per address. Addresses 0x01000000, 0x01500000, and 0x02000000 are the first addresses of DDR addresses for three-frame buffering, and 0x01000000, 0x01500000, and 0x02000000 are the first image buffer spaces. A picture has 416×416×3= 519168 bytes of data. Storing an image at address 0x01000000 is to store the first byte of the image at address 0x01000000, store the second byte at address 0x01000001, and so on until 519168 bytes are stored.

The three frame buffer, i.e., image, is saved to three different DDR addresses. In the embodiment of the invention, the acquired first frame image is stored in DDR with the head address of 0x01000000, the second frame is stored in 0x01500000, and the third frame is stored in 0x 02000000. The three-frame buffer is a reasonable design because a single-frame buffer has a defect that when a data source is continuously input, the frame buffer may store the result of overlapping two or more frames of image data, so that the three-frame buffer is required.

The acquired image data is stored in the order of depth- > width- > height to the DDR. That is, the first row and first column of RGB three-byte data is stored, then the first row and second column of RGB three-byte data is stored until the first row and first column of 416 are stored, then the second row and first column are stored, and so on. When the convolution layer PE array of the embodiment of the invention performs convolution operation, data needs to be fetched according to rows, namely images need to be reordered into sequences of depth- > height- > width and put into DDR.

The acquired three frames of images are reordered and then put into DDR with 0x02500000, 0x03000000 and 0x03500000 as the first address and the PS ends, and 0x02500000, 0x03000000 and 0x03500000 are the second image buffer spaces.

Step S102: loading a preset weight pre-stored in the DDR of the PS end into the DDR of the PL end, wherein the preset weight is weight data pre-trained according to the target detection network.

Specifically, weight data of the target detection network pre-trained is pre-stored in the DDR at the PS end. After the feature map to be detected in the preset format is obtained, loading preset weights to a programmable logic (Programmable Logic, PL) end.

Preferably, the preset weight can be obtained through training by the following steps:

It can be understood that the disclosed KITTI data set is used for training the YOLO V3 tiny algorithm, the weight data and the bias data trained by the algorithm are separated into two binary files, and the CONV layer and the BN layer of the network weight can be fused. And carrying out 8-bit fixed point number quantization on the new fused convolution layer. The quantized weight data and the bias data are rearranged according to the sequence of memory access and stored into a bin format file to be put into an SD card, and the weight data are read into a 0x06000000 address of a DDR (data transfer) at a Processor (PS) end of a processing System by a virus development software. The weight data in the PS-side DDR is all loaded into the programmable logic Programmable Logic (PL) side DDR (start address 0x 00000000), and the weight data is remapped to the axi bus address (axi address signal-0 x 06000000) when read.

Step S103: determining the size of a PE array of a convolution layer according to the number of available DSPs at the PL end, the number of convolution kernels of a plurality of convolution layers included in a target detection network and an input feature map of the plurality of convolution layers; the size of the convolution layer PE array is C x K x L, C is the execution depth, K is the number of the convolution kernels to be executed, and L is the parallelism of the execution rows.

Specifically, the reasoning process of the target detection network can be completed by utilizing the logic resource of the PL end. It is thus necessary to determine the number of available DSPs on the PL side, and in order to increase the inference speed of the target detection network, it is necessary to utilize the number of available DSPs on the PL side as large as possible.

Specifically, as can be seen from fig. 2, in the target detection network, the target detection network includes 13 convolution layers conv, including 0 layer, 2 layers, 4 layers, 6 layers, 8 layers, 10 layers, 12 layers, 13 layers, 14 layers, 15 layers, 18 layers, 21 layers and 22 layers, and the number of convolution kernels included in each convolution layer is 16, 32, 64 and … 255, and the input feature patterns of each convolution layer are 416×416×3, 208×208×16, 104×104×32× 32 … 26 ×26×256, respectively.

And determining the size of a PE array of the convolution layer according to the number of available DSPs at the PL end, the number of convolution kernels of a plurality of convolution layers included in the target detection network and the input feature graphs of the plurality of convolution layers, wherein the size of the PE array of the convolution layer is C x K x L, C is the execution depth, K is the number of the execution convolution kernels, and L is the parallelism of the execution rows.

Preferably, the determining the size of the convolution layer PE array according to the number of available DSPs at the PL end, the number of convolution kernels of a plurality of convolution layers included in the target detection network, and the input feature graphs of the plurality of convolution layers includes:

It will be appreciated that, as shown in fig. 2, layers 2, 4, 6 and 8 of the target detection network are selected to perform convolution operation of the convolution layer PE array, and the number of available DSPs at the PL end is 1000.

The number of convolution kernels of the 2 layers, the 4 layers, the 6 layers and the 8 layers of the target detection network is 32, 64, 128 and 256 respectively, so that the number of the convolution kernels can be calculated by a plurality of execution convolution kernels of 32, 16, 8, 4 and 2 at a time, and each execution convolution kernel number can be the common divisor of the convolution kernels with the smallest number of the selected convolution layers.

According to 208 columns in the input feature map 208 x 16 of the 2 layers, 104 columns in the input feature map 104 x 32 of the 4 layers, 52 columns in the input feature map 52 x 64 of the 6 layers and 26 columns in the input feature map 26 x 128 of the 8 layers, the parallelism of the execution columns of the 13 columns and the 26 columns can be selected, and the parallelism of the execution columns is the common divisor of the columns in the input feature map with the smallest number of columns in each input feature map.

According to 16 depths in the input feature map 208×208×16 of 2 layers, 32 depths in the input feature map 104×104×32 of 4 layers, 64 depths in the input feature map 52×52×64 of 6 layers, and 128 depths in the input feature map 26×26×128 of 8 layers, a plurality of execution depths such as 32 depths, 16 depths, 8 depths, and the like can be selected, wherein the selected plurality of execution depths are common divisors of depths in the feature map with the smallest depth in each input feature map.

Preferably, the determining the size of the convolution layer PE array according to the number of available DSPs at the PL end, the plurality of execution depths, the parallelism of the plurality of execution lines, and the plurality of execution convolution kernel numbers includes:

C*K*L≤2*N；

It can be appreciated that, since the convolution operation utilizes the logic resource of the PD end to complete the convolution operation, the selected execution depth, the parallelism of the execution rows, the number of execution convolution kernels and the number of available DSPs satisfy the following conditions: c is K is L less than or equal to 2*N.

Specifically, in order to better utilize the available DSP number at the PL end, one execution depth, one execution row parallelism, and one execution convolution kernel number may be selected from the multiple execution depths, the multiple execution rows parallelism, and the multiple execution convolution kernel numbers, where under the premise that the condition c×k×l is less than or equal to 2*N, a combination that makes c×k×l be the maximum is selected, and at this time, the available DSP number computing resources can be fully utilized, so that the convolution operation of the target detection network can be completed in a faster manner, and the detection speed of the target detection network can be significantly improved.

Compared with the prior art, the method has the advantages that the convolution layer PE arrays with different sizes are determined according to the number of available DSPs at the PL end, the number of convolution kernels of a plurality of convolution layers included in the target detection network and the input feature map, so that the applicability of the target detection network on different hardware can be improved.

Step S104: performing convolution operation on a plurality of convolution layers in a target detection network by using a convolution layer PE array to obtain a target detection result after the target detection network operation is completed; in the convolution operation process, the feature map to be detected and the preset weight are scheduled to the PL end to participate in the operation.

Specifically, the convolution layer PE array is a setting manner when performing convolution operation in the target detection network, and under the execution condition of the convolution layer PE array, each execution needs to select an input feature map with L columns and C depths and select K convolution kernels to participate in the convolution operation.

The convolution operation is performed on a plurality of convolution layers in the target detection network by using the convolution layer PE array, for example, 2 layers, 4 layers, 6 layers and other layers in the 12 convolution layers in fig. 2 may be selected to participate in the convolution operation, and other modes may be adopted for the other layers in the target detection network to perform the operation. And after the reasoning of the target detection network is completed, obtaining a target detection result with the operation completed.

It can be appreciated that when reasoning is performed on the target detection network, the feature map to be detected and the weight data of the corresponding preset weight need to be scheduled.

Step S105: and determining the position range and the category of each target in the image to be detected according to the target detection result, and displaying the position range and the category on the visualization equipment for the user to view.

Specifically, the target detection network can output the position and the target category of the target in the image, frame the target on the image according to the target detection result, display the target category information near the frame, send the image to the visualization equipment after the processing, and display the framed target and category on the visualization equipment for the user to check. It can be appreciated that during the automatic driving process of the automobile, the vehicle-mounted system of the automobile can perform the action of determining the next step according to the target detection result.

In implementation, a target detection network is selected in advance, weight data which is trained by the target detection network is stored in a DDR (double data Rate) terminal, and a convolution layer which is needed to be operated by a convolution layer PE array provided by the embodiment of the invention in the target detection network is determined. When specifically detecting the target, reasoning the target detection network according to the received image to be detected in real time, and obtaining a target detection result.

Compared with the prior art, the real-time target detection method provided by the embodiment of the invention can carry out convolution operation on the convolution layer of the target detection network in the prior art according to the convolution layer PE array, fully utilizes the number of available DSPs at the PL end, and improves the detection speed of the target detection network. Meanwhile, the embodiment of the invention participates in a plurality of convolution layer operations of the target detection network according to the convolution layer PE arrays determined by the target detection network, so that the target detection network can be maintained more conveniently, the portability is stronger, and the iteration upgrading of a target detection algorithm is facilitated.

Further, the performing convolution operation on the plurality of convolution layers in the target detection network by using the convolution layer PE array includes:

for each of the plurality of convolutional layers, according to the parallelism L of the execution column and the current convolutional layerThe execution step S of the convolution kernel of the current convolution layer is divided into columns of the input characteristic diagram H1W 1P 1Each region includes an input feature map of size H1 x (L x S) x P1;

Specifically, as shown in fig. 3, the PE array c×k×l of the convolutional layer is 13×16×4, the input feature map of the current convolutional layer is 416×416×8, the output feature map is 416×416×32, the number of convolution kernels of the current convolutional layer is 32, the size of the convolution kernel is 3×3×8, and the execution step length of the convolution kernel of the current convolutional layer is 1.

According to the parallelism 13 of the execution column and the execution step length 1 of the convolution kernel of the current convolution layer, dividing the input feature map 416×416×8 of the current convolution layer into 32 areas, wherein the size of the input feature map included in each area is 416×13×8.

For each of the 32 regions, dividing the input feature map 416×13×8 of the current region into 416 sub-regions according to 3 rows of the current convolution kernel and the execution step length 1 of the convolution kernel of the current convolution layer, where each sub-region includes an input feature map size of 3×13×8.

Preferably, the performing convolution operation on each sub-area sequentially according to the direction from the first row to the last row of the input feature diagram of the current convolution layer includes:

Specifically, in each sub-region, the size of an input feature map of the sub-region is 3×13×8, the current sub-region 3×13×8 is divided into 2 spaces according to the execution depth 4 of the convolutional layer PE array, the size of an input feature map included in each space is 3×13×4, and the 2 spaces are sequentially processed according to the direction from the first layer depth to the last layer depth of the input feature map of the current convolutional layer.

Since the number of convolution kernels of the current convolution layer is 32 and the number of convolution kernels of the convolution layer PE array is 16, the number of kernel executions is 2. Only 16 convolution kernels of the current convolution layer and weight data of corresponding preset weights can be scheduled each time to carry out convolution operation on 3 x 13 x 4 of the current space, and the execution is carried out for 2 times, so that all the 32 convolution kernels of the current convolution layer can be completed.

Preferably, the convolving operation is performed on the input feature map X (l×s) X C included in the current space by using the K convolution kernels of the current convolution layer and the weight data of the corresponding preset weights each time, where the convolving operation includes:

Specifically, as shown in fig. 3, the convolution operation of 9 cycles is sequentially performed on the corresponding input feature map of the current space according to the direction of the first row to the 3 rd row and the direction of the first column to the 3 rd column of the convolution kernel of the current convolution layer.

In each period, selecting weight data of preset weights of one row, one column and 4 depths in 16 convolution kernels, multiplying and adding feature map data of one row, 13 columns and 4 depths in the current space 3 x 13 x 4, and obtaining 13 intermediate results by each convolution kernel.

Exemplary:

in the 1 st period, the weight data of the first row, the first column and the 4 depth of the current 16 convolution kernels are selected to be multiplied with the feature map data of the first row, the 13 column and the 4 depth of the current space, 13×16×4 operations are needed to be calculated in the process, and then 13 intermediate results of the 16 convolution kernels are obtained by adding the corresponding convolution kernels.

In the 2 nd period, the weight data of the second row, the first column and the 4 depth of the current 16 convolution kernels are selected to be multiplied with the feature map data of the second row, the 13 column and the 4 depth of the current space, 13×16×4 operations are needed to be calculated in the process, and then 13 intermediate results of the 16 convolution kernels are obtained by adding the corresponding convolution kernels.

In the 3 rd period, the weight data of the third row, the first column and the 4 depth of the current 16 convolution kernels are selected to be multiplied with the feature map data of the third row, the 13 column and the 4 depth of the current space, 13×16×4 operations are needed to be calculated in the process, and then 13 intermediate results of the 16 convolution kernels are obtained by adding the corresponding convolution kernels.

In the 4 th period, the weight data of the first row, the second column and the 4 depth of the current 16 convolution kernels are selected to be multiplied with the feature map data of the first row, the 13 column and the 4 depth of the current space, 13×16×4 operations are needed to be calculated in the process, and then 13 intermediate results of the 16 convolution kernels are obtained by adding the corresponding convolution kernels.

In the 5 th period, the weight data of the second row, the second column and the 4 depth of the current 16 convolution kernels are selected to be multiplied with the feature map data of the first row, the 13 column and the 4 depth of the current space, 13×16×4 operations are needed to be calculated in the process, and then 13 intermediate results of the 16 convolution kernels are obtained by adding the corresponding convolution kernels.

In the 6 th period, the weight data of the third row, the second column and the 4 depth of the current 16 convolution kernels are selected to be multiplied with the feature map data of the first row, the 13 column and the 4 depth of the current space, 13×16×4 operations are needed to be calculated in the process, and then 13 intermediate results of the 16 convolution kernels are obtained by adding the corresponding convolution kernels.

In the 7 th period, the weight data of the first row, the third column and the 4 depth of the current 16 convolution kernels are selected to be multiplied with the feature map data of the first row, the 13 column and the 4 depth of the current space, 13×16×4 operations are needed to be calculated in the process, and then 13 intermediate results of the 16 convolution kernels are obtained by adding the corresponding convolution kernels.

In the 8 th period, the weight data of the second row, the third column and the 4 depth of the current 16 convolution kernels are selected to be multiplied with the feature map data of the first row, the 13 column and the 4 depth of the current space, 13×16×4 operations are needed to be calculated in the process, and then 13 intermediate results of the 16 convolution kernels are obtained by adding the corresponding convolution kernels.

In the 9 th period, the weight data of the third row, the third column and the 4 depth of the current 16 convolution kernels are selected to be multiplied with the feature map data of the first row, the 13 column and the 4 depth of the current space, 13×16×4 operations are needed to be calculated in the process, and then 13 intermediate results of the 16 convolution kernels are obtained by adding the corresponding convolution kernels.

It is noted that 13 columns in 1-3 cycles are 1-13 columns, 13 columns in 4-6 cycles are 2-14 columns, and 13 columns in 7-9 cycles are 3-15 columns.

Compared with the prior art, the real-time target detection method provided by the embodiment of the invention determines the convolution layer PE arrays with different sizes according to the number of available DSPs at the PL end, the number of convolution kernels of a plurality of convolution layers included in the target detection network and the input characteristic diagram, and can improve the applicability of the target detection network on different hardware.

In one embodiment of the present invention, a real-time object detection system is disclosed, as shown in FIG. 4.

The image data receiving module 401 is configured to receive an image to be detected, obtain a feature image to be detected in a preset format after preprocessing, and store the feature image to be detected in the PS end DDR;

the weight data loading module 402 is configured to load a preset weight pre-stored in the DDR at the PS end into the DDR at the PL end, where the preset weight is weight data pre-trained according to the target detection network;

the PE array determining module 403 is configured to determine a size of a PE array of a convolutional layer according to a number of available DSPs at the PL end, a number of convolutional kernels of a plurality of convolutional layers included in the target detection network, and an input feature map of the plurality of convolutional layers; the size of the PE array of the convolution layer is C, K and L, C is the execution depth, K is the number of the convolution kernels to be executed, and L is the parallelism of the execution lines;

the PE array calculation module 404 is configured to perform convolution operation on a plurality of convolution layers in the target detection network by using the convolution layer PE array, so as to obtain a target detection result after the target detection network operation is completed; in the convolution operation process, the feature map to be detected and the preset weight are scheduled to the PL end to participate in the operation;

and the detection result display module 405 is configured to determine a position range and a category of each target in the image to be detected according to the target detection result, and display the position range and the category on the visualization device for the user to view.

Further, the PE array determining module 403 includes:

Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments may be implemented by a computer program to instruct related hardware, and the program may be stored in a computer readable storage medium. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. A method for real-time target detection, the method comprising:

2. The method for real-time object detection according to claim 1, wherein determining the size of the convolutional layer PE array according to the number of available DSPs at the PL end, the number of convolutional kernels of a plurality of convolutional layers included in the object detection network, and the input feature map of the plurality of convolutional layers comprises:

3. The method of claim 2, wherein determining the size of the convolutional layer PE array based on the number of available DSPs at the PL end, the plurality of execution depths, the plurality of execution rows and the plurality of execution convolution kernels comprises:

C*K*L≤2*N；

4. The method for real-time object detection according to claim 1, wherein the performing convolution operation on the plurality of convolution layers in the object detection network by using the convolution layer PE array includes:

5. The method for real-time object detection according to claim 4, wherein the step of sequentially performing convolution operation on each sub-region according to the direction from the first row to the last row of the input feature map of the current convolution layer comprises:

6. The method for real-time object detection according to claim 5, wherein the convolving the input feature map X (l×s) X C included in the current space with the K convolution kernels of the current convolution layer and the weight data of the corresponding preset weights each time, includes:

and in each period, according to the execution step length S of the convolution kernels, selecting weight data of preset weights of one row, one column and C depth of the K convolution kernels, multiplying the feature map data of one row, L column and C depth of the input feature map included in the corresponding current space, and then adding, wherein each convolution kernel acquires L intermediate results.

7. The real-time object detection method according to claims 1-6, wherein a first image buffer space and a second image buffer space are preset in DDR of PS end; the first image cache space is used for storing a plurality of feature images to be detected; and carrying out 8bit fixed point quantization on each feature map to be detected, reordering in the depth, row and column directions, and mapping and storing each reordered feature map to a second image cache space.

8. The method for real-time object detection according to claims 1 to 6, wherein the preset weights are obtained by training:

9. A real-time object detection system, the system comprising:

10. The real-time object detection system of claim 9, wherein the PE array determination module comprises: