CN113807361A

CN113807361A - Neural network, target detection method, neural network training method and related products

Info

Publication number: CN113807361A
Application number: CN202110920174.XA
Authority: CN
Inventors: 曹锡鹏; 袁鹏; 冯柏岚
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-12-17
Anticipated expiration: 2041-08-11
Also published as: CN113807361B

Abstract

The embodiment of the application discloses a neural network for target detection, a target detection method, a neural network training method and related products, wherein the neural network for target detection comprises a coding network and a decoding network, and the decoding network comprises a first decoding layer; the encoding network is used for extracting the features of the image to be detected to obtain a plurality of first feature maps; the first decoding layer is used for performing attention processing on the target first feature map and the plurality of target reference vectors to obtain a plurality of first reference vectors; performing target detection on the first reference vectors to obtain a plurality of first candidate frames and a plurality of first categories; processing a feature map corresponding to each first candidate frame in a plurality of first candidate frames and each first reference vector in a plurality of first reference vectors to obtain a plurality of second reference vectors; and obtaining a plurality of second candidate frames based on the plurality of second reference vectors and the plurality of first candidate frames. The embodiment of the application is beneficial to improving the target detection precision.

Description

Neural network, target detection method, neural network training method and related products

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a neural network, a target detection method, a neural network training method and a related product.

Background

Object detection is the most fundamental task in the field of computer vision, and aims to allow models to accurately predict the positions of objects in pictures and their corresponding classes. With the development of deep learning technology, deep neural networks have been successfully applied to the task of target detection, and famous networks for target detection such as fast RCNN, YOLO, RetinaNet, etc. are born. Traditional target detection relies on complex post-processing to output a robust prediction box, for example: eliminating redundant candidate boxes by using a Non Maximum Suppression (NMS) algorithm; in classical algorithms such as fast RCNN, an anchor point (anchor) is required to assist in predicting candidate boxes. However, in the industrial algorithm landing process, these complicated artificial processing processes require rich experience or multiple attempts to obtain a good detection effect, which brings a great debugging cost to engineers, and for different training data, multiple tests are often required to determine the threshold of the NMS and the number and proportion of anchors.

In order to solve the above problems, caroon et al proposes an end-to-end target Detector (DETR), which has advantages of simple structure and no need of post-processing, and is widely noticed by researchers. However, the end-to-end object detector only utilizes global information of the image, resulting in low object detection accuracy.

Therefore, how to improve the accuracy of end-to-end target detection is an urgent problem to be solved at present.

Disclosure of Invention

The application provides a neural network, a target detection method, a neural network training method and related products. And the target detection precision is improved by fusing the global information and the multi-scale local information.

In a first aspect, an embodiment of the present application provides a neural network for target detection, where the neural network includes an encoding network and a decoding network, and the decoding network includes a first decoding layer; the encoding network is used for extracting features of an image to be detected to obtain a plurality of first feature maps, wherein the resolution ratios of the first feature maps are different; the first decoding layer is used for performing attention processing on a target first feature map and a plurality of target reference vectors to obtain a plurality of first reference vectors, wherein the target first feature map is any one of the first feature maps with the resolution smaller than a first threshold value in the plurality of first feature maps, and the plurality of target reference vectors are network parameters of the neural network; performing target detection on the first reference vectors to obtain a plurality of first candidate frames and a plurality of first categories; processing a feature map corresponding to each first candidate frame in the plurality of first candidate frames and each first reference vector in the plurality of first reference vectors to obtain a plurality of second reference vectors, wherein the feature map corresponding to each first candidate frame represents features of the first candidate frame selected in at least one first feature map in the plurality of first feature maps, and the plurality of first candidate frames are in one-to-one correspondence with the plurality of first reference vectors; and obtaining a plurality of second candidate frames based on the plurality of second reference vectors and the plurality of first candidate frames.

It should be noted that when the decoding network includes one decoding layer, i.e. only the first decoding layer, the first decoding layer is essentially the decoding network.

It can be seen that the neural network of the present application can perform attention mechanism processing on a plurality of second reference vectors and the target first feature map, and each first reference vector includes global information of an image to be detected. Therefore, the plurality of first candidate frames and the plurality of first categories are obtained by detecting the target based on the global information of the image to be detected, and the global information can sense the area of the target on the image to be detected, so that the accuracy of the obtained plurality of first candidate frames is higher; further, the feature map corresponding to each first candidate frame is a local feature framed on the feature map by each first candidate frame; therefore, the finally obtained second reference vectors not only comprise the global information of the image to be detected, but also comprise the local information of the image to be detected, and the local information is favorable for accurate positioning of the target, so that the second reference vectors are used for performing frame regression, the precision of the obtained second candidate frames is higher, and the target detection precision is improved; further, the frame regression using the plurality of second reference vectors is performed based on the plurality of first frame candidates, and is equivalent to the approximate position of the frame, and the positions of the frames are not directly predicted, thereby further improving the accuracy of the plurality of second frame candidates and the target detection accuracy. In addition, when the feature map of each first candidate frame is obtained, the feature map of each first candidate frame may be obtained by intercepting local features from part or all of the first feature maps, so that the feature map of each first candidate frame contains multi-scale local information, thereby further improving the target detection accuracy.

In some possible embodiments, the decoding network further comprises a second decoding layer; the second decoding layer is used for performing attention processing on the target first characteristic diagram and the plurality of second reference vectors to obtain a plurality of third reference vectors; obtaining a plurality of third candidate frames and a plurality of second classes based on the plurality of third reference vectors and the plurality of second candidate frames; processing a feature map corresponding to each third candidate frame in the plurality of third candidate frames and each third reference vector in the plurality of third reference vectors to obtain a plurality of fourth reference vectors, wherein the feature map corresponding to each third candidate frame represents features of the third candidate frame framed in at least one first feature map in the plurality of first feature maps, and the plurality of third candidate frames correspond to the plurality of third reference vectors one to one; and obtaining a plurality of fourth candidate frames based on the plurality of fourth reference vectors and the plurality of third candidate frames.

It can be seen that, in this embodiment, the decoding network may include two decoding layers, and the second decoding layer further corrects and adjusts the second candidate frame obtained by the first decoding layer on the basis of the first decoding layer, so that the accuracy of the plurality of second classes is higher than that of the first class, and the accuracy of the plurality of fourth candidate frames is higher than that of the plurality of second classes, and then the target detection result is determined using the plurality of fourth candidate frames and the plurality of second classes obtained by the second decoding layer, thereby improving the accuracy of target detection.

In some possible embodiments, the encoding network comprises a backbone network and a feature pyramid network; the backbone network is used for extracting features of the image to be detected to obtain a plurality of second feature maps, wherein the resolution ratios of the second feature maps are different; and the characteristic pyramid network is used for extracting the characteristics of the second characteristic graphs to obtain a plurality of first characteristic graphs.

In some possible embodiments, the encoding network includes a backbone network, a feature pyramid network, and an encoder; the backbone network is used for extracting features of the image to be detected to obtain a plurality of second feature maps, wherein the resolution ratios of the second feature maps are different; the characteristic pyramid network is used for extracting characteristics of the second characteristic graphs to obtain third characteristic graphs, wherein the third characteristic graphs are different in resolution; the encoder is used for encoding the plurality of first feature vectors and the target position codes to obtain a plurality of second feature vectors, wherein the plurality of first feature vectors are obtained by tiling a target third feature map, the target third feature map is any one of the third feature maps with the resolution smaller than a second threshold value, and the target position codes are network parameters of the neural network; the plurality of first feature maps comprise a fourth feature map and a third feature map in the plurality of third feature maps except the target feature map, and the fourth feature map is obtained by combining the plurality of second feature vectors according to the reverse order of tiling the target third feature map.

It can be seen that, in the embodiment of the present application, the backbone network and the feature pyramid network are combined with the encoder, so that the spatial relationship between the pixels can be obtained. The backbone network and the characteristic pyramid network are constructed on the basis of convolution, and the convolution network extracts local characteristics, so that the receptive field range is certain. Therefore, if two strongly related objects in a picture are far away, the features between them are not related. In the application, the feature graph output by the feature pyramid network is further input into the encoder, and the encoder is operated to associate the global context in the image, so that the capability of extracting the feature relationship among the global pixels is achieved. For example, given a picture of a person riding a horse, features extracted using an encoder are more likely to represent the association between the person and the horse than features extracted using a convolutional network. Therefore, when the target detection is carried out in the application, the spatial relation among the pixel points can be utilized to carry out the target detection, and the precision of the target detection is improved.

In a second aspect, an embodiment of the present application provides a target detection method, including: performing feature extraction on an image to be detected to obtain a plurality of first feature maps, wherein the resolution ratios of the first feature maps are different; performing attention processing on a target first feature map and a plurality of target reference vectors to obtain a plurality of first reference vectors, wherein the target first feature map is any one of the first feature maps with the resolution lower than a first threshold value in the plurality of first feature maps, and each target reference vector is used for representing the feature of an object; performing target detection according to the first reference vectors to obtain a plurality of first candidate frames and a plurality of first categories; processing a feature map corresponding to each first candidate frame in the plurality of first candidate frames and each first reference vector in the plurality of first reference vectors to obtain a plurality of second reference vectors, wherein the feature map corresponding to each first candidate frame represents features of the first candidate frame selected in at least one first feature map in the plurality of first feature maps, and the plurality of first candidate frames are in one-to-one correspondence with the plurality of first reference vectors; obtaining a plurality of second candidate frames according to the plurality of second reference vectors and the plurality of first candidate frames; and obtaining a target detection result of the image to be detected based on the plurality of second candidate frames and the plurality of first categories.

It can be seen that, when the attention mechanism processing is performed on the plurality of second reference vectors and the target first feature map, each first reference vector contains global information of the image to be detected. Therefore, the plurality of first candidate frames and the plurality of first categories are obtained by detecting the target based on the global information of the image to be detected, and the global information can sense the area of the target on the image to be detected, so that the accuracy of the obtained plurality of first candidate frames is higher; further, the feature map corresponding to each first candidate frame is a local feature framed on the feature map by each first candidate frame; therefore, the finally obtained second reference vectors not only comprise the global information of the image to be detected, but also comprise the local information of the image to be detected, and the local information is favorable for accurate positioning of the target, so that the second reference vectors are used for performing frame regression, the precision of the obtained second candidate frames is higher, and the target detection precision is improved; further, the frame regression using the plurality of second reference vectors is performed based on the plurality of first frame candidates, and is equivalent to the approximate position of the frame, and the positions of the frames are not directly predicted, thereby further improving the accuracy of the plurality of second frame candidates and the target detection accuracy. In addition, when the feature map of each first candidate frame is obtained, the feature map of each first candidate frame may be obtained by intercepting local features from part or all of the first feature maps, so that the feature map of each first candidate frame contains multi-scale local information, thereby further improving the target detection accuracy.

In some possible embodiments, before obtaining the target detection result of the image to be detected based on the plurality of second candidate frames and the plurality of first categories, the method further includes: performing attention processing on the target first feature map and the plurality of second reference vectors to obtain a plurality of third reference vectors; obtaining a plurality of third candidate frames and a plurality of second classes according to the plurality of third reference vectors and the plurality of second candidate frames; processing a feature map corresponding to each third candidate frame in the plurality of third candidate frames and each third reference vector in the plurality of third reference vectors to obtain a plurality of fourth reference vectors, wherein the feature map corresponding to each third candidate frame represents features of the third candidate frame framed in at least one first feature map in the plurality of first feature maps, and the plurality of third candidate frames correspond to the plurality of third reference vectors one to one; obtaining a plurality of fourth candidate frames according to the plurality of fourth reference vectors and the plurality of third candidate frames; obtaining a target detection result of the image to be detected based on the plurality of second candidate frames and the plurality of first categories, wherein the target detection result comprises the following steps: and obtaining a target detection result of the image to be detected based on the fourth candidate frames and the second classes.

It can be seen that, in the present embodiment, after obtaining the plurality of first categories and the plurality of second candidate frames, the target detection may be performed again based on the plurality of second candidate frames, that is, the target detection is performed again, and the first target detection result is corrected and adjusted to obtain the plurality of second categories and the plurality of fourth candidate frames.

In some possible embodiments, the extracting features of the image to be detected to obtain a plurality of first feature maps includes: inputting an image to be detected into a backbone network for feature extraction to obtain a plurality of second feature maps, wherein the resolution ratios of the second feature maps are different; and inputting the plurality of second feature maps into the feature pyramid network for feature extraction to obtain a plurality of first feature maps.

In some possible embodiments, the extracting features of the image to be detected to obtain a plurality of first feature maps includes: inputting an image to be detected into a backbone network for feature extraction to obtain a plurality of second feature maps, wherein the resolution ratios of the second feature maps are different; inputting the second feature maps into a feature pyramid network for feature extraction to obtain third feature maps, wherein the third feature maps have different resolutions; tiling the target third feature map to obtain a plurality of first feature vectors, wherein the target third feature map is any one of the third feature maps with the resolution smaller than a second threshold value; inputting the plurality of first characteristic vectors and target position codes into an encoder for encoding to obtain a plurality of second characteristic vectors, wherein the target position codes are used for representing the spatial relationship among all pixel points in the image to be detected; combining the plurality of second feature vectors according to the reverse order of tiling the target third feature map to obtain a fourth feature map; and taking the fourth feature map and a third feature map except the target third feature map in the plurality of third feature maps as a plurality of first feature maps.

In a third aspect, an embodiment of the present application provides a neural network training method, where the neural network is used for target detection, the neural network includes an encoding network and a decoding network, and the decoding network includes a first decoding layer, and the method includes: inputting the training images into a coding network for feature extraction to obtain a plurality of fifth feature maps, wherein the resolution ratios of the fifth feature maps are different; inputting the target fifth feature map and the initial reference vectors into the first decoding layer for target detection to obtain a plurality of sixth candidate frames and a plurality of third categories, wherein the target fifth feature map is any one of the fifth feature maps with the resolution smaller than a first threshold value, and the plurality of initial reference vectors are obtained by initialization; the first decoding layer is used for processing the target fifth feature map and the plurality of initial reference vectors to obtain a plurality of fifth reference vectors; performing target detection on the fifth reference vectors to obtain a plurality of fifth candidate frames and a plurality of third categories; processing a feature map corresponding to each fifth candidate frame in the fifth candidate frames and a fifth reference vector in the fifth reference vectors to obtain a plurality of sixth reference vectors, wherein the feature map corresponding to each fifth candidate frame represents features framed and selected by the fifth candidate frame in at least one fifth feature map in the fifth feature maps; obtaining a plurality of sixth candidate frames based on the plurality of sixth reference vectors and the plurality of fifth candidate frames, wherein the plurality of fifth candidate frames correspond to the plurality of fifth reference vectors one to one; and training the neural network according to the fifth candidate boxes, the sixth candidate boxes, the third classes and the labels of the training images.

It can be seen that, in the embodiment of the present application, when the decoding network is used for target detection, first, the multiple initial reference vectors and the target fifth feature map are processed through the attention mechanism to obtain multiple fifth reference vectors, so that each fifth reference vector includes global information, and since the global information can sense the region of the target in the image to be detected, the accuracy of the obtained multiple fifth candidate frames is higher; furthermore, the feature map corresponding to each fifth candidate frame is local information framed and selected by each fifth candidate frame on the training image, so that each sixth reference vector not only comprises global information and local information of the training image, but also is beneficial to accurate positioning of the target, so that a plurality of sixth reference vectors are used for performing frame regression, frame regression based on the global information and the local information of the training image is realized, the obtained plurality of sixth candidate frames have higher precision, the obtained loss is smaller, the convergence speed of the neural network is further improved, the neural network trained by using the global information and the local information of the training image for performing frame regression is used, target detection is subsequently performed based on the global information and the local information of the image, and the target detection precision can be improved.

In some possible embodiments, the decoding network further comprises a second decoding layer; before training the neural network according to the fifth candidate boxes, the sixth candidate boxes, the third classes and the labels of the training images, the method further comprises: inputting the target fifth feature map, a plurality of sixth reference vectors and a plurality of sixth candidate frames into a second decoding layer for target detection to obtain a plurality of eighth candidate frames and a fourth category; the second decoding layer is used for performing attention processing on the target fifth feature map and the sixth reference vectors to obtain a plurality of seventh reference vectors; obtaining a plurality of seventh candidate frames and a plurality of fourth classes based on the plurality of seventh reference vectors and the plurality of sixth candidate frames; processing a feature map corresponding to each seventh candidate frame in the plurality of seventh candidate frames and each seventh reference vector in the plurality of seventh reference vectors to obtain a plurality of eighth reference vectors, wherein the feature map corresponding to each seventh candidate frame represents features framed and selected by the seventh candidate frame in at least one fifth feature map in the plurality of fifth feature maps; obtaining a plurality of eighth candidate frames based on the plurality of eighth reference vectors and the plurality of seventh candidate frames; training the neural network according to the fifth candidate boxes, the sixth candidate boxes, the third categories and the labels of the training images, wherein the training comprises: and training the neural network according to the fifth candidate frames, the sixth candidate frames, the third classes, the seventh candidate frames, the eighth candidate frames, the fourth classes and the labels of the training images.

It can be seen that, in this embodiment, the decoding network further includes a second decoding layer, and then the target detection is performed again through the second decoding layer, that is, based on the second decoding layer, the target detection result obtained by the first decoding layer is corrected and adjusted on the basis of the first decoding layer, so as to obtain a fourth class and an eighth candidate frame with higher precision, and then the neural network is trained using the plurality of eighth candidate frames and the plurality of fourth classes, so that the neural network training speed can be increased, and the trained neural network can implement multiple iterations of target detection, so as to obtain a target detection result with higher precision.

In some possible embodiments, the encoding network comprises a backbone network and a feature pyramid network; inputting the training images into a coding network for feature extraction to obtain a plurality of fifth feature maps, wherein the fifth feature maps comprise: inputting the training images into a backbone network to perform feature extraction on the training images to obtain a plurality of sixth feature maps, wherein the sixth feature maps have different resolutions; and inputting the sixth feature maps into the feature pyramid network for feature extraction to obtain fifth feature maps.

In some possible embodiments, the encoding network includes a backbone network, a feature pyramid network, and an encoder; inputting the training images into a coding network for feature extraction to obtain a plurality of fifth feature maps, wherein the fifth feature maps comprise: inputting the training images into a backbone network to perform feature extraction on the training images to obtain a plurality of sixth feature maps, wherein the sixth feature maps have different resolutions; inputting the sixth feature maps into a feature pyramid network for feature extraction to obtain seventh feature maps, wherein the seventh feature maps have different resolutions; tiling the target seventh feature map to obtain a plurality of third feature vectors, wherein the target seventh feature map is any one of the seventh feature maps with the resolution smaller than a second threshold value; inputting the plurality of third feature vectors and the initial position codes into an encoder for encoding to obtain a plurality of fourth feature vectors, wherein the initial position codes are obtained by initialization; combining the plurality of fourth feature vectors according to the reverse order of tiling the target seventh feature map to obtain an eighth feature map; and taking the eighth feature map and a seventh feature map except the target seventh feature map in the seventh feature maps as a plurality of fifth feature maps.

It can be seen that, in the embodiment of the present application, the association system between each pixel point can be obtained by combining the feature pyramid network and the encoder. Because the convolution network extracts local features, the scope of the receptive field is certain. Therefore, if two strongly related objects in a picture are far away, the features between them are not related. In the application, the high-level feature graph output by the feature pyramid network is further input into the encoder, and the encoder is operated to associate the global context in the image, so that the capability of extracting the feature relationship among the global pixels is achieved. For example, given a picture of a person riding a horse, features extracted using an encoder are more likely to represent the association between the person and the horse than features extracted using a pure convolutional network. Therefore, when the target detection is carried out in the application, the target detection can be carried out by utilizing the association between the pixel points, the precision of the target detection is improved, and the convergence speed of the neural network is further improved.

In a fourth aspect, an embodiment of the present application provides an object detection apparatus, including: an acquisition unit and a processing unit; the acquisition unit is used for acquiring an image to be detected; the processing unit is used for extracting features of an image to be detected to obtain a plurality of first feature maps, wherein the resolution ratios of the first feature maps are different; performing attention processing on a target first feature map and a plurality of target reference vectors to obtain a plurality of first reference vectors, wherein the target first feature map is any one of the first feature maps with the resolution lower than a first threshold value in the plurality of first feature maps, and each target reference vector is used for representing the feature of an object; performing target detection according to the first reference vectors to obtain a plurality of first candidate frames and a plurality of first categories; processing a feature map corresponding to each first candidate frame in the plurality of first candidate frames and each first reference vector in the plurality of first reference vectors to obtain a plurality of second reference vectors, wherein the feature map corresponding to each first candidate frame represents features of the first candidate frame selected in at least one first feature map in the plurality of first feature maps, and the plurality of first candidate frames are in one-to-one correspondence with the plurality of first reference vectors; obtaining a plurality of second candidate frames according to the plurality of second reference vectors and the plurality of first candidate frames; and obtaining a target detection result of the image to be detected based on the plurality of second candidate frames and the plurality of first categories.

In some possible embodiments, before the processing unit obtains the target detection result of the image to be detected based on the plurality of second candidate frames and the plurality of first categories, the processing unit is further configured to: performing attention processing on the target first feature map and the plurality of second reference vectors to obtain a plurality of third reference vectors; obtaining a plurality of third candidate frames and a plurality of second classes according to the plurality of third reference vectors and the plurality of second candidate frames; processing a feature map corresponding to each third candidate frame in the plurality of third candidate frames and each third reference vector in the plurality of third reference vectors to obtain a plurality of fourth reference vectors, wherein the feature map corresponding to each third candidate frame represents features of the third candidate frame framed in at least one first feature map in the plurality of first feature maps, and the plurality of third candidate frames correspond to the plurality of third reference vectors one to one; obtaining a plurality of fourth candidate frames according to the plurality of fourth reference vectors and the plurality of third candidate frames; obtaining a target detection result of the image to be detected based on the plurality of second candidate frames and the plurality of first categories, wherein the target detection result comprises the following steps: and obtaining a target detection result of the image to be detected based on the fourth candidate frames and the second classes.

In some possible embodiments, in the aspect that the processing unit performs feature extraction on the image to be detected to obtain the plurality of first feature maps, the processing unit is specifically configured to: inputting an image to be detected into a backbone network for feature extraction to obtain a plurality of second feature maps, wherein the resolution ratios of the second feature maps are different; and inputting the plurality of second feature maps into the feature pyramid network for feature extraction to obtain a plurality of first feature maps.

In some possible embodiments, in the aspect that the processing unit performs feature extraction on the image to be detected to obtain the plurality of first feature maps, the processing unit is specifically configured to: inputting an image to be detected into a backbone network for feature extraction to obtain a plurality of second feature maps, wherein the resolution ratios of the second feature maps are different; inputting the second feature maps into a feature pyramid network for feature extraction to obtain third feature maps, wherein the third feature maps have different resolutions; tiling the target third feature map to obtain a plurality of first feature vectors, wherein the target third feature map is any one of the third feature maps with the resolution smaller than a second threshold value; inputting the plurality of first characteristic vectors and target position codes into an encoder for encoding to obtain a plurality of second characteristic vectors, wherein the target position codes are used for representing the spatial relationship among all pixel points in the image to be detected; combining the plurality of second feature vectors according to the reverse order of tiling the target third feature map to obtain a fourth feature map; and taking the fourth feature map and a third feature map except the target third feature map in the plurality of third feature maps as a plurality of first feature maps.

In a fifth aspect, an embodiment of the present application provides a neural network training apparatus, where the neural network is used for target detection, the neural network includes an encoding network and a decoding network, and the neural network training apparatus includes an obtaining unit and a processing unit; an acquisition unit configured to acquire a training image; the processing unit is used for inputting the training images into the coding network for feature extraction to obtain a plurality of fifth feature maps, wherein the resolution ratios of the fifth feature maps are different; inputting the target fifth feature map and the initial reference vectors into the first decoding layer for target detection to obtain a plurality of sixth candidate frames and a plurality of third categories, wherein the target fifth feature map is any one of the fifth feature maps with the resolution smaller than a first threshold value, and the plurality of initial reference vectors are obtained by initialization; the first decoding layer is used for processing the target fifth feature map and the plurality of initial reference vectors to obtain a plurality of fifth reference vectors; performing target detection on the fifth reference vectors to obtain a plurality of fifth candidate frames and a plurality of third categories; processing a feature map corresponding to each fifth candidate frame in the fifth candidate frames and a fifth reference vector in the fifth reference vectors to obtain a plurality of sixth reference vectors, wherein the feature map corresponding to each fifth candidate frame represents features framed and selected by the fifth candidate frame in at least one fifth feature map in the fifth feature maps; obtaining a plurality of sixth candidate frames based on the plurality of sixth reference vectors and the plurality of fifth candidate frames, wherein the plurality of fifth candidate frames correspond to the plurality of fifth reference vectors one to one; and training the neural network according to the fifth candidate boxes, the sixth candidate boxes, the third classes and the labels of the training images.

In some possible embodiments, the decoding network further comprises a second decoding layer; before the processing unit trains the neural network according to the fifth candidate frames, the sixth candidate frames, the third classes and the labels of the training images, the processing unit is further configured to: inputting the target fifth feature map, a plurality of sixth reference vectors and a plurality of sixth candidate frames into a second decoding layer for target detection to obtain a plurality of eighth candidate frames and a fourth category; the second decoding layer is used for performing attention processing on the target fifth feature map and the sixth reference vectors to obtain a plurality of seventh reference vectors; obtaining a plurality of seventh candidate frames and a plurality of fourth classes based on the plurality of seventh reference vectors and the plurality of sixth candidate frames; processing a feature map corresponding to each seventh candidate frame in the plurality of seventh candidate frames and each seventh reference vector in the plurality of seventh reference vectors to obtain a plurality of eighth reference vectors, wherein the feature map corresponding to each seventh candidate frame represents features framed and selected by the seventh candidate frame in at least one fifth feature map in the plurality of fifth feature maps; obtaining a plurality of eighth candidate frames based on the plurality of eighth reference vectors and the plurality of seventh candidate frames; training the neural network according to the fifth candidate boxes, the sixth candidate boxes, the third categories and the labels of the training images, wherein the training comprises: and training the neural network according to the fifth candidate frames, the sixth candidate frames, the third classes, the seventh candidate frames, the eighth candidate frames, the fourth classes and the labels of the training images.

In some possible embodiments, the encoding network comprises a backbone network and a feature pyramid network; in the aspect that the processing unit inputs the training image into the coding network for feature extraction to obtain a plurality of fifth feature maps, the processing unit is specifically configured to: inputting the training images into a backbone network to perform feature extraction on the training images to obtain a plurality of sixth feature maps, wherein the sixth feature maps have different resolutions; and inputting the sixth feature maps into the feature pyramid network for feature extraction to obtain fifth feature maps.

In some possible embodiments, the encoding network includes a backbone network, a feature pyramid network, and an encoder; in the aspect of inputting the training image into the coding network for feature extraction to obtain a plurality of fifth feature maps, the processing unit is specifically configured to: inputting the training images into a backbone network to perform feature extraction on the training images to obtain a plurality of sixth feature maps, wherein the sixth feature maps have different resolutions; inputting the sixth feature maps into a feature pyramid network for feature extraction to obtain seventh feature maps, wherein the seventh feature maps have different resolutions; tiling the target seventh feature map to obtain a plurality of third feature vectors, wherein the target seventh feature map is any one of the seventh feature maps with the resolution smaller than a second threshold value; inputting the plurality of third feature vectors and the initial position codes into an encoder for encoding to obtain a plurality of fourth feature vectors, wherein the initial position codes are obtained by initialization; combining the plurality of fourth feature vectors according to the reverse order of tiling the target seventh feature map to obtain an eighth feature map; and taking the eighth feature map and a seventh feature map except the target seventh feature map in the seventh feature maps as a plurality of fifth feature maps.

In a sixth aspect, an embodiment of the present application provides an electronic device, including: a memory for storing a program; a processor for executing programs stored in the memory; the processor is adapted to implement the method of the second or third aspect described above when the memory stores a program that is executed.

In a seventh aspect, the present application provides a computer-readable medium storing program code for execution by a device, where the program code includes instructions for implementing the method in the second or third aspect.

In an eighth aspect, the present application provides a computer program product containing instructions, which when run on a computer, causes the computer to implement the method of the second or third aspect.

In a ninth aspect, the present embodiment provides a chip, where the chip includes a processor and a data interface, and the processor reads instructions stored on a memory through the data interface, so as to implement the method in the second aspect or the third aspect.

Optionally, as an implementation manner, the chip may further include a memory, the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to implement the method in the second aspect or the third aspect.

Drawings

Fig. 1 is a schematic diagram illustrating an application of a target detection method to an automatic driving scenario according to an embodiment of the present application;

fig. 2 is a schematic diagram of a system architecture according to an embodiment of the present application;

fig. 3 is a schematic diagram of a chip hardware structure according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a neural network for target detection according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of another neural network for target detection according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a first decoding layer according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of another neural network for target detection according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an ith decoding layer according to an embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of another neural network for target detection according to an embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of another neural network for target detection according to an embodiment of the present disclosure;

fig. 11 is a schematic flowchart of a target detection method according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a dynamic interaction provided by an embodiment of the present application;

FIG. 13 is a diagram illustrating mapping of feature vectors to convolution parameters according to an embodiment of the present application;

fig. 14 is a schematic flowchart of a neural network training method according to an embodiment of the present disclosure;

fig. 15 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present disclosure;

fig. 16 is a schematic structural diagram of a neural network training device according to an embodiment of the present disclosure;

fig. 17 is a schematic structural diagram of another object detection apparatus according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of another neural network training device according to an embodiment of the present application.

Detailed Description

To facilitate an understanding of the present application, the related art to which the present application relates will first be explained and illustrated.

In the process of target detection by the DETR model, each query feature (object query) needs to interact with global information of an image to be detected, and the overall process is complex, so that the convergence speed is low during training of the DETR model. In addition, only the global information of the image to be detected is utilized in the target detection process, and the target detection precision is low.

The Sparse R-CNN model only uses local features for prediction in the process of target detection, and is lack of global information, so that the target detection precision is low.

Therefore, how to improve the target detection accuracy is a problem to be solved urgently at present.

To facilitate understanding of the present application, reference will first be made to related terms to which the present application refers.

An attention mechanism is as follows: the method comprises the following steps that three inputs are respectively a Query vector (Q), a Key vector (Key, K) and a Value vector (V), and the core is to calculate the correlation of corresponding elements of Q and K to obtain an attention matrix; the process of weighting V based on the attention matrix can be represented by equation (1):

wherein softmax is a normalization operation,

is a preset super ginseng.

The self-attention mechanism is that Q, K and V are an attention mechanism of input data.

The multi-head attention mechanism is to perform attention calculation on Q, K and V on a plurality of subspaces and finally splice attention results obtained by the subspaces together.

ROI align: and intercepting the feature selected by the candidate box in the feature map from the feature map.

For example, if the original image resolution is 800 × 800, the resolution of a feature map is 25 × 25, and the resolution of the detected candidate box is 665 × 665, then the candidate boxes are mapped to the size in the feature map: 665/32 ═ 20.78, i.e. 20.78 × 20.78; then, based on preset pooling resolution, mapping the candidate frame to the size in the feature map for pooling operation to obtain a plurality of small areas with the same size; if, pooling resolution is: when the pooled _ w is 7 and the pooled _ h is 7, dividing a 7 × 7 size feature map after pooling (posing), i.e., dividing a 20.78 × 20.78 area mapped on the feature map into 49 small areas with the same size, wherein the size of each small area is 20.78/7 is 2.97, i.e., 2.97 × 2.97; then, sampling is carried out in each small area according to the set number of sampling points so as to obtain the pixel value of each small area; if the set sampling point number is 4, dividing four parts equally for each small area of 2.97 x 2.97, taking the central point of each part, and calculating the pixel value of the central point by using a bilinear interpolation method, so that the pixel value of four central points can be obtained for each small area of 2.97 x 2.97; finally, the maximum value of the four pixel values is used as the pixel value of each small region of 2.97 × 2.97, and so on, 49 small regions will get 49 pixel values, and the 49 pixel values are combined into a feature map (feature map) of 7 × 7 size, and this feature map of 7 × 7 size is used as the local information corresponding to the candidate frame cut out from the feature map.

Characteristic diagram tiling: feature map tiling is the process of tiling a feature map into multiple one-dimensional sequences, where each one-dimensional sequence can be treated as a vector. For example, the characteristic diagram is

The feature maps are tiled to obtain three one-dimensional sequences, namely (a)₁₁ a₁₂ a₁₃)、(a₂₁ a₂₂ a₂₃) And (a)₃₁ a₃₂ a₃₃). The characteristics should be notedIn the process of tiling the graph, the graph can be tiled according to rows, that is, each row in the feature graph is used as a one-dimensional sequence, and the number of the one-dimensional sequences is the number of rows in the feature graph, or can be tiled according to columns, that is, each column in the feature graph is used as a one-dimensional sequence, and the number of the one-dimensional sequences is the number of columns in the feature graph. The present application will be described by way of example with tiling in rows. It should be noted that, if the resolution of the feature map is three-dimensional, that is, the feature maps on a plurality of channels, the two-dimensional feature map on each channel may be tiled to obtain a plurality of one-dimensional sequences corresponding to each channel.

Vector combination: vector combination is the process of combining multiple vectors into a feature map in the reverse order of the feature map tiling. For example, the three vectors (a) described above are obtained₁₁ a₁₂ a₁₃)、(a₂₁ a₂₂ a₂₃) And (a)₃₁ a₃₂ a₃₃) (ii) a Then, according to the reverse order of the tiling, the three vectors are combined to obtain a feature map

Vector splicing: splicing a plurality of vectors with smaller dimensions into a vector with larger dimensions, for example, the three vectors (a) mentioned above₁₁ a₁₂ a₁₃)、(a₂₁ a₂₂ a₂₃) And (a)₃₁ a₃₂ a₃₃) Then pair (a)₁₁ a₁₂ a₁₃)、(a₂₁ a₂₂ a₂₃) And (a)₃₁a₃₂ a₃₃) Splicing is carried out, and the obtained splicing vector is (a)₁₁ a₁₂ a₁₃a₂₁ a₂₂ a₂₃a₃₁ a₃₂ a₃₃)。

First, an application scenario of the present application will be described. For example, the target detection method provided by the embodiment of the application can be applied to other scenes such as an automatic driving scene, an image labeling scene, and an action recognition scene. The following briefly introduces an automatic driving scenario and an image labeling scenario, respectively.

Automatic driving scene:

as shown in fig. 1, the target detection method of the present application is deployed in a vehicle-mounted terminal or a vehicle-mounted chip of an artificial intelligence driving vehicle, so that the image shot by a vehicle-mounted camera can be detected by the target detection method of the present application, and the positions of objects such as traffic signs, traffic lights, pedestrians, other vehicles, etc. in the image can be identified. And inputting the identified entity as input data into a decision module of the automatic driving system so as to support the automatic driving vehicle to complete automatic driving.

Due to the fact that the target detection method is high in target detection precision, the target detection method can be applied to an automatic driving scene, and safety of automatic driving can be improved.

Target identification scenario on terminal:

the target detection method is applied to the terminal or a chip of the terminal, when a user shoots a picture, the target detection can be carried out on the picture shot by the terminal camera through the target detection method, and people, trees and other objects in the picture are identified; these objects can then be marked out for viewing by the user. Because the target detection precision of this application is higher, therefore the precision of the object of annotating is higher, and user experience is high.

Target identification scenario on AR device:

the target detection method is applied to the AR equipment or a chip of the AR equipment, when the AR equipment is worn by a user, the target detection can be carried out on the image presented in the AR equipment through the target detection method, the object of the image is identified, and the target is marked out for the user to check.

Target recognition scenario on medical device:

when the target detection method is applied to medical equipment, a doctor can perform target detection on a medical image by using the target detection method to identify a target in the medical image, for example, the position of each organ in the medical image or a focus in the medical image. Then, the target is marked for the doctor to check, so that the doctor is assisted to carry out medical diagnosis, and the medical diagnosis precision is improved.

Image annotation scene:

for network training, training images are basically labeled, however, before labeling the training images, targets in the training images need to be recognized first so as to be labeled manually.

Due to the fact that the target detection method is high in target detection precision, the target detection method is applied to the image labeling scene, a training image with high precision can be constructed, and the precision of network training is improved.

The target detection method of the present application will be described from the network training side and the network application side with reference to the accompanying drawings:

the network training method provided by the embodiment of the application relates to the processing of computer vision, and particularly can be applied to data processing methods such as data training, machine learning and deep learning, and the like, and is used for performing symbolic and formal intelligent information modeling, extraction, preprocessing, training and the like on training data (such as training images in the application) to finally obtain a trained neural network for target detection; in addition, the target detection method provided in the embodiment of the present application may use the trained neural network for target detection to input data (e.g., the image to be detected in the present application) into the trained neural network for target detection, so as to obtain output data (e.g., a plurality of second candidate frames in the present application). It should be noted that the neural network training method and the target detection method provided in the embodiments of the present application are inventions based on the same concept, and may also be understood as two parts in a system or two stages of an overall process: such as a neural network training phase and a neural network application phase.

Referring to fig. 2, fig. 2 is a schematic diagram of a system architecture provided in the embodiment of the present application. As shown in system architecture 100, data acquisition device 160 is configured to acquire a training image, where the training image includes labels indicating the true position and true class of each target in the training image; the training images are stored in database 130 and training device 120 trains neural network/rules 101 for target detection based on the training images maintained in database 130.

Describing in more detail how the training apparatus 120 obtains the neural network/rule 101 for object detection based on the training data, the neural network/rule 101 for object detection can be used to implement the object detection method provided by the embodiment of the present application, that is, by inputting the image to be detected into the neural network/rule 101 for object detection, a plurality of second candidate boxes and a plurality of first classes can be obtained. The neural network/rule 101 for target detection in the embodiments of the present application includes a backbone network, a feature pyramid encoding network, and a decoding network. It should be noted that, in practical applications, the training images maintained in the database 130 are not necessarily all acquired by the data acquisition device 160, and may be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the neural network/rule 101 for target detection based on the training images maintained by the database 130, and may also obtain the training images from the cloud or other places to perform the training of the neural network for target detection, and the above description should not be taken as a limitation to the embodiments of the present application.

The neural network/rule 101 for target detection obtained by training the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 2, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer, an AR/VR, a vehicle-mounted terminal, or a server or a cloud. In fig. 2, the execution device 110 is configured with an I/O interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include, in an embodiment of the present application: an original image.

The preprocessing module 113 is configured to perform preprocessing according to input data (such as an original image) received by the I/O interface 112, in this embodiment, the preprocessing module 113 may be configured to perform other image processing operations such as cropping, scaling, enhancing, and gray-scale value conversion on the original image to obtain an image to be detected.

In the process of preprocessing the input data by the execution device 110 or performing relevant processing such as calculation by the calculation module 111 of the execution device 110, the execution device 110 may call data, codes, and the like in the data storage system 150 for corresponding processing, and may store data, instructions, and the like obtained by corresponding processing into the data storage system 150.

Finally, the I/O interface 112 returns the processing result, such as the target detection result obtained as described above, to the client device 140, thereby providing it to the user.

It is worth noting that the training device 120 may generate the corresponding neural network/rule 101 for target detection based on different training data for different targets or different tasks, and the corresponding neural network/rule 101 for target detection may be used to implement the target detection, so as to provide the user with the required result.

In the case shown in fig. 2, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the position relationship between the devices, modules, and the like shown in fig. 2 does not constitute any limitation, for example, in fig. 2, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

Referring to fig. 3, fig. 3 is a schematic diagram of a hardware structure of a chip according to an embodiment of the present invention, where the chip includes a neural network processor 50. The chip may be provided in the execution device 110 as shown in fig. 3 to complete the calculation work of the calculation module 111. The chip may also be disposed in the training device 120 as shown in fig. 3 to complete the training work of the training device 120 and output the neural network/rule 101 for target detection.

A Neural Network Processor (NPU) 50, the NPU 50 being mounted as a coprocessor on a main Central Processing Unit (CPU), and tasks being distributed by the main CPU. The core portion of the NPU 50 is an arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract data in the memory (the weight memory 502 or the input memory 501) and perform arithmetic.

In some implementations, the arithmetic circuit 503 internally includes a plurality of processing units (PEs).

In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.

In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 503 fetches the weight matrix B from the weight memory 502 and buffers it on each PE in the arithmetic circuit 503. The arithmetic circuit 503 takes the input matrix a and the weight matrix B from the input memory 501 to perform matrix operation, and stores a partial result or a final result of the obtained matrix in an accumulator (accumulator) 508.

The vector calculation unit 507 may further process the output of the operation circuit 503, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculation of non-convolution/non-Fully Connected Layers (FCs) in a neural network, such as Pooling (floating), Batch Normalization (Batch Normalization), Local Response Normalization (Local Response Normalization), and the like.

In some implementations, vector calculation unit 507 stores the processed output vector to unified memory 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, e.g., a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the arithmetic circuitry 503, for example, for use in subsequent layers in a neural network.

For example, in the present application, the arithmetic circuit 503 performs feature extraction on an image to be detected to obtain a plurality of first feature maps; performing target detection based on the plurality of first feature maps and the plurality of target reference vectors to obtain a target detection result;

a unified memory 506 for storing input data and output data, such as the image to be detected and the target detection result of the present application;

the weight data directly passes through a Memory cell Access Controller (DMAC) 505 to transfer input data in the external Memory to the input Memory 501 and/or the unified Memory 506, store the weight data in the external Memory in the weight Memory 502, and store the data in the unified Memory 506 in the external Memory.

A Bus Interface Unit (BIU) 510 for implementing interaction between the main CPU, the DMAC, and the instruction fetch memory 509 through a Bus;

an instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;

the controller 504 is configured to call the instruction cached in the instruction fetch memory 509, so as to control the working process of the operation circuit 503.

Generally, the unified Memory 506, the input Memory 501, the weight Memory 502, and the instruction fetch Memory 509 are On-Chip (On-Chip) memories, the external Memory is an external NPU Memory, and the external Memory may be a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.

First, the neural network for target detection provided by the present application is constructed based on a transform architecture, and is hereinafter referred to as a neural network for short. Illustratively, the encoding network in the neural network for object detection of the present application is constructed based on a transformer encoder, and the decoding network of the present application is constructed based on a transformer decoder.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a neural network for target detection according to an embodiment of the present disclosure. As shown in fig. 4, the neural network for object detection includes an encoding network and a decoding network, and the decoding network includes a first decoding layer. It should be noted that a first decoding layer is understood to be a decoding network when the decoding network comprises only one decoding layer, i.e. only the first decoding layer.

The encoding network is used for extracting features of an image to be detected to obtain a plurality of first feature maps, wherein the sizes of the first feature maps are different;

the first decoding layer is used for performing attention processing on a target first feature map and a plurality of target reference vectors to obtain a plurality of first reference vectors, wherein the target first feature map is any one of the first feature maps with the resolution smaller than a first threshold value in the plurality of first feature maps, namely, the target first feature map and the plurality of target reference vectors are selected from the first feature maps with the resolution smaller than the first threshold value to perform attention processing; the plurality of target reference vectors are network parameters of the neural network for target detection, that is, parameters obtained when the neural network for target detection is trained. The following describes the training process of the neural network for target detection, and is not described herein too much;

the first decoding layer is further configured to perform target detection on the plurality of first reference vectors to obtain a plurality of first candidate frames and a plurality of first categories, that is, the first decoding layer performs regression and category prediction on each first reference vector to obtain the first candidate frame and the first category corresponding to each first reference vector. Therefore, the plurality of first candidate frames, the plurality of first classes and the plurality of first reference vectors are in one-to-one correspondence;

the first decoding layer is further configured to process a feature map corresponding to each of the plurality of first candidate frames and each of the plurality of first reference vectors to obtain a plurality of second reference vectors, where the feature map corresponding to each of the plurality of first candidate frames represents features of the first candidate frame framed in at least one first feature map of the plurality of first feature maps;

the first decoding layer is further configured to obtain a plurality of second candidate frames based on the plurality of second reference vectors and the plurality of first candidate frames, that is, predict an offset value of the first candidate frame corresponding to the second reference vector through the first decoding layer and each second reference vector, and offset the first candidate frame based on the offset value to obtain the second candidate frame corresponding to the second reference vector.

In one embodiment of the present application, as shown in fig. 5, the decoding network further comprises a second decoding layer; similar to the function of the first decoding layer, the second decoding layer is used for performing attention processing on the target first feature map and the plurality of second reference vectors to obtain a plurality of third reference vectors; obtaining a plurality of third candidate frames and a plurality of second classes based on the plurality of third reference vectors and the plurality of second candidate frames; processing a feature map corresponding to each third candidate frame in the plurality of third candidate frames and each third reference vector in the plurality of third reference vectors to obtain a plurality of fourth reference vectors, wherein the feature map corresponding to each third candidate frame represents features of the third candidate frame framed in at least one first feature map in the plurality of first feature maps, and the plurality of third candidate frames correspond to the plurality of third reference vectors one to one; and obtaining a plurality of fourth candidate frames based on the plurality of fourth reference vectors and the plurality of third candidate frames.

In one embodiment of the present application, the first decoding layer and the second decoding layer are both constructed based on a transform decoder, and the internal structures are similar. For example, the first decoding layer and the second decoding layer each include an attention layer, which includes a self-attention layer and a multi-Head attention layer, a Dynamic interaction layer (Dynamic Inter), which may also be referred to as a Dynamic Head, and four feed-forward networks.

The functions of the sub-layers in the first decoding layer will be described below by taking the structure of the first decoding layer as an example, and the functions of the sub-layers in the second decoding layer are similar to this and will not be described again.

As shown in fig. 6, for the first decoding layer, the self-attention layer is configured to perform self-attention processing on the target reference vector to obtain a plurality of reference vectors after the self-attention processing; the multi-head attention layer is used for carrying out multi-head attention processing on the multiple reference vectors subjected to the self-attention processing and the target first feature map to obtain multiple reference vectors subjected to the multi-head attention processing; the first feedforward network is used for performing feature enhancement processing on a plurality of reference vectors of a multi-head attention layer to obtain a plurality of first reference vectors, wherein the feature enhancement is mainly to enhance foreground features in an image to be detected; the second feedforward network is used for carrying out target detection on the first reference vectors to obtain a plurality of first candidate frames and a plurality of first classes; the dynamic interaction layer is used for carrying out dynamic interaction processing on the plurality of first reference vectors and the feature maps corresponding to the plurality of first candidate frames to obtain a plurality of reference vectors after the dynamic interaction processing; the third feedforward network is used for performing feature enhancement on the plurality of reference vectors subjected to dynamic interaction processing to obtain a plurality of third reference vectors; and the fourth feedforward network is used for processing the plurality of second reference vectors and the plurality of first candidate frames to obtain a plurality of second candidate frames.

In one embodiment of the present application, the decoding network may have a larger number of decoding layers in addition to the first decoding layer and the second decoding layer, wherein each decoding layer has the same internal structure as the first decoding layer. The input data of each decoding layer are respectively a plurality of reference vectors output by the third feedforward network of the previous decoding layer and a plurality of candidate frames output by the last feedforward network of the previous decoding layer.

As shown in fig. 7, the decoding network includes N (N is greater than or equal to 3) decoding layers, i.e., decoding layer 1, decoding layer 2, … …, decoding layer N shown in fig. 7; two frame regression processes and one classification process are carried out in each decoding layer, and the second frame regression process of each decoding layer is completed on the basis of the first frame regression process.

Specifically, as shown in fig. 7, the target detection process of the N decoding layers mainly includes the following steps: when a plurality of target reference vectors are input to the first decoding layer (decoding layer 1) to perform the first target detection operation, a plurality of classes (i.e. first classes) and a plurality of first candidate frames D corresponding to the first target detection operation can be obtained₁And a plurality of second candidate frames C₁(ii) a Then, a plurality of second candidate frames C obtained by the first target detection operation are used₁And a plurality of second reference vectors A₁As input data of the second decoding layer, a second target detection operation is performed, and a plurality of classes (second classes) and a plurality of first candidate frames D corresponding to the second target detection operation can be obtained₂And a plurality of second candidate frames C₂A plurality of second candidate frames C obtained by the second target detection operation₂And a plurality of second reference vectors A₂Performing a third target detection operation as input data of a third decoding layer; and analogizing in sequence, carrying out a plurality of second candidate frames C obtained by carrying out the target detection operation for the (N-1) th time on the (N-1) th decoding layer_N-1And a plurality of second reference vectors A_N-1As input data of the Nth decoding layer, the Nth target detection operation is executed to obtain a plurality of categories (Nth categories) corresponding to the Nth target detection operation and a plurality of first candidate frames D_NAnd a plurality of second candidate frames C_N。

The following describes the functions of each sub-layer when the ith decoding layer implements target detection, taking the ith decoding layer as an example, with reference to fig. 8.

As shown in fig. 8, for the ith decoding layer (i.e. decoding layer i), the self attention layer is used for decoding a plurality of second reference vectors a_i-1Performing self-attention treatment; attention from multiple headA layer for processing the self-attention processed reference vectors A_i-1Performing multi-head attention processing on the target first feature map; a first feedforward network for performing feature enhancement processing on multiple reference vectors obtained by multi-head attention processing to obtain multiple first reference vectors B_i(ii) a A second feedforward network for generating a second feedforward signal based on the plurality of first reference vectors B_iAnd a plurality of second candidate frames C obtained from the i-1 th decoding layer_i-1Obtaining a plurality of categories and a plurality of first candidate frames D corresponding to the ith decoding layer_i；

Further, for a plurality of first candidate frames D_iPerforming ROI align operation to obtain each first candidate frame D from multiple first feature maps_iCorresponding feature maps (local information); a dynamic interaction layer for combining the plurality of first candidate frames D_iCorresponding characteristic diagram and a plurality of first reference vectors B_iCarrying out dynamic interaction; a third feedforward network for performing feature enhancement on the dynamic processing result to obtain a plurality of second reference vectors A_i(ii) a A fourth feedforward network for generating a first feedforward signal based on the plurality of first candidate frames D_iAnd a plurality of second reference vectors A_iObtaining a plurality of first candidate frames C corresponding to the ith decoding layer_i。

In one embodiment of the present application, as shown in fig. 9, the coding network in the present application includes a backbone network (backbone) and a feature pyramid network FPN. The backbone network is used for performing feature extraction on the image to be detected to obtain a plurality of second feature maps, wherein the resolution ratios of the second feature maps are different; and the characteristic pyramid network is used for extracting the characteristics of the second characteristic graphs to obtain a plurality of first characteristic graphs.

In one embodiment of the present application, as shown in fig. 10, the encoding network in the present application includes a backbone network (backbone), an encoder, and a feature pyramid network FPN. Optionally, the encoder is a transform encoder.

Illustratively, the backbone network is configured to perform feature extraction on an image to be detected to obtain a plurality of second feature maps, where resolutions of the plurality of second feature maps are different; the characteristic pyramid network is used for extracting characteristics of the second characteristic graphs to obtain third characteristic graphs, wherein the third characteristic graphs are different in resolution; the encoder is used for encoding the plurality of first feature vectors and target position codes to obtain a plurality of second feature vectors, wherein the plurality of first feature vectors are obtained by tiling a target third feature map, the target third feature map is any one of the third feature maps with the resolution smaller than a second threshold value, and the target position codes are network parameters of a neural network for target detection and are used for representing the features of the object;

therefore, the plurality of first feature maps include a fourth feature map and a third feature map other than the target feature map in the plurality of third feature maps, and the fourth feature map is obtained by combining the plurality of second feature vectors in the reverse order of tiling the target third feature map.

It should be additionally noted that, for the coding networks shown in fig. 9 and 10, N decoding layers, that is, the decoding layer 1, the decoding layer 2, …, and the decoding layer N shown in fig. 9 and 10, may be included.

Referring to fig. 11, fig. 11 is a schematic flowchart of a target detection method according to an embodiment of the present disclosure. The target detection method of the present application is realized by the above neural network for target detection. The method comprises the following steps:

1101: and performing feature extraction on the image to be detected to obtain a plurality of first feature maps, wherein the resolution ratios of the plurality of first feature maps are different.

Optionally, the feature extraction of the image to be detected may be implemented by the above coding network. Therefore, the plurality of first feature maps are first feature maps output by network layers of different depths of the coding network.

For example, as shown in fig. 9, when the coding network includes a backbone network and a feature pyramid network FPN, the image to be detected may be input to the backbone network for feature extraction, so as to obtain a plurality of second feature maps, where resolutions of the plurality of second feature maps are different, that is, the plurality of second feature maps are output from network layers of different depths of the backbone network. Then, inputting each second feature map into a network layer corresponding to the resolution of the second feature map in the FPN for feature extraction, so as to obtain a plurality of first feature maps.

Exemplarily, as shown in fig. 10, when the encoding network includes a backbone network, a feature pyramid network and an encoder, after a plurality of second feature maps are obtained through the backbone network, each second feature map is respectively input to a network layer corresponding to a resolution of each second feature map in the FPN for feature extraction, so as to obtain a plurality of third feature maps, where resolutions of the plurality of third feature maps are different; tiling a target third feature map to obtain a plurality of first feature vectors, where the target third feature map is any one of the third feature maps with a resolution smaller than a second threshold, and the target third feature map is taken as an example with the smallest resolution in the application for explanation; then, inputting the plurality of first feature vectors and target position codes into an encoder for encoding to obtain a plurality of second feature vectors, namely, fusing the plurality of first feature vectors and the target position codes, and inputting a fusion result into the encoder for encoding to obtain a plurality of second feature vectors, wherein the target position codes are obtained by training and are used for representing the spatial relationship among all pixel points in the image to be detected; and finally, combining the plurality of second feature vectors according to the reverse order of tiling the target third feature map to obtain a fourth feature map, and taking the third feature maps except the target third feature map in the fourth feature map and the third feature maps as the plurality of first feature maps.

1102: and performing attention processing on the target first feature map and the plurality of target reference vectors to obtain a plurality of first reference vectors.

The target first feature map is any one of the first feature maps with the resolution lower than the first threshold in the plurality of first feature maps, and the target first feature map is taken as the first feature map with the minimum resolution in the plurality of first feature maps in the present application for example. The method comprises the steps of obtaining a plurality of target reference vectors, wherein the target reference vectors are obtained through training, and each target reference vector is used for representing the characteristics of an object. For example, the object features may be a face contour, a cup contour, etc.

Illustratively, the first feature map of the target and the plurality of target reference vectors are processed by the first decoding layer with attention, so as to obtain a plurality of first reference vectors. Specifically, the multiple target reference vectors are processed based on the self-attention layer of the first decoding layer, namely the multiple target reference vectors are respectively used as Q, K, V for self-attention processing; performing multi-head attention processing on the self-attention processing result and the target first feature map based on a multi-head attention layer of the first decoding layer, namely performing multi-head attention processing on the self-attention processing result serving as Q and the target first feature map serving as K and V respectively; and finally, performing feature enhancement on the multi-head attention processing result through a feedforward network of a first decoding layer to obtain a plurality of first reference vectors.

1103: and performing target detection according to the first reference vectors to obtain a plurality of first candidate frames and a plurality of first categories.

For example, a plurality of first candidate frames may be predicted according to a plurality of first reference vectors, that is, target detection is performed according to each first reference vector, a position of each first candidate frame in an image to be detected is predicted, and classification is performed based on each first reference vector to obtain a plurality of first classes, where the plurality of first classes correspond to the plurality of first candidate frames one to one.

Optionally, the plurality of first reference vectors are input to a feed-forward network of the first decoding layer to perform target detection, so as to obtain a plurality of first candidate frames and a plurality of first classes. Illustratively, the characterization form of each first candidate frame is pixel coordinates of four vertexes of each first candidate frame in the image to be detected, and the position of the candidate frame is characterized by the pixel coordinates of the four vertexes.

1104: and processing the feature map corresponding to each first candidate frame in the plurality of first candidate frames and each first reference vector in the plurality of first reference vectors to obtain a plurality of second reference vectors.

The feature map corresponding to each first candidate frame characterizes features of the first candidate frame selected in at least one first feature map in the multiple first feature maps, and the multiple first candidate frames are in one-to-one correspondence with the multiple first reference vectors.

Illustratively, at least one first feature map is selected from the plurality of first feature maps, and then the ROI align operation is performed on each first candidate frame and the at least one first feature map, respectively, to obtain a feature map corresponding to each first candidate frame. It should be noted that, when the number of the at least one first feature map is greater than or equal to 2, then performing a ROI align operation on the first candidate frame and each first feature map in the at least one feature map respectively, so as to truncate a feature map corresponding to the first candidate frame from each first feature map; and finally, splicing the feature maps cut out from each first feature map to obtain the feature map corresponding to the first candidate frame.

And dynamically interacting the feature map of each first candidate frame with the first reference vector corresponding to each first candidate frame to obtain a second reference vector corresponding to each first reference vector, so that a plurality of second reference vectors can be obtained by a plurality of first candidate frames. For example, the feature map of each first candidate box and the first reference vector corresponding to each first candidate box are input to the dynamic interaction layer of the first decoding layer for dynamic interaction, so as to obtain a second reference vector corresponding to each first reference vector.

The process of dynamic interaction is described below in conjunction with the figures.

Illustratively, each first reference vector is mapped into a two-dimensional matrix, wherein the number of columns of the two-dimensional matrix is the dimension of the feature map corresponding to the first candidate frame corresponding to the first reference vector, and the number of rows of the two-dimensional matrix is the number of convolution kernels set in the dynamic interaction process. The size of the convolution kernel in this application is 1 x 1. Illustratively, first mapping each first reference vector into a long vector whose dimension is the product of the number of columns and the number of rows of the two-dimensional matrix is performed through the fully-connected layer of the neural network for target detection. Then, reshaping (reshape) is carried out on the long vector to obtain the two-dimensional matrix; further, the two-dimensional matrix is tiled to obtain convolution parameters corresponding to each first reference vector, that is, the two-dimensional matrix is tiled into a plurality of one-dimensional sequences, and each one-dimensional sequence is used as the convolution parameter corresponding to the first reference vector. As shown in fig. 12, the convolution parameter 1, the convolution parameters 2, … …, and the convolution parameter m are mapped convolution parameters respectively corresponding to each candidate box.

As shown in fig. 12, since the dimension of each one-dimensional sequence is the same as the dimension of the feature map corresponding to the first candidate box corresponding to the first reference vector, each one-dimensional sequence may be used as a convolution parameter of a convolution kernel of 1 × 1; then, performing convolution processing on the feature map by using a plurality of convolution cores of 1 x 1 to obtain a target feature map corresponding to each first reference vector; then, tiling a target characteristic graph corresponding to each first reference vector to obtain a plurality of one-dimensional sequences corresponding to each first reference vector, and splicing the plurality of one-dimensional sequences corresponding to each first reference vector to obtain a target characteristic vector corresponding to each first reference vector; and finally, mapping the target characteristic vector to obtain a second reference vector corresponding to each first reference vector, so that after the dynamic interaction processing is carried out on the plurality of first reference vectors, a plurality of second reference vectors can be obtained.

For example, as shown in fig. 13, if the dimension of the feature map corresponding to the first reference vector is 4 (i.e., 4 channels), and the number of the preset convolution kernels of 1 × 1 is 4, the first reference vector may be mapped to a long vector of 1 × 16; as shown in fig. 13, the 1 × 16 vectors are reshaped into a 4 × 4 two-dimensional matrix, which ensures that the number of rows of the two-dimensional matrix is 1 × 1 of the number of convolution kernels, and the number of columns is the dimension of the feature map corresponding to the first reference vector. The two-dimensional matrix is then tiled, resulting in 4 convolution parameters for the 1 x 1 convolution kernel. As shown in fig. 13, the first row in the two-dimensional matrix is used as a convolution parameter of a convolution kernel of 1 × 1, and since the dimension of the feature map corresponding to the first reference vector is the same as the number of columns in the two-dimensional matrix, the feature map can be convolved by using the convolution kernel of 1 × 1, so as to obtain the target feature map corresponding to the convolution kernel of 1 × 1. And performing convolution processing on the feature map corresponding to the first reference vector through 4 convolution kernels of 1 × 1 to obtain a target feature map with the resolution of 4 × 4. Then, tiling and splicing the target feature maps of 4 × 4, that is, respectively tiling and splicing the target feature maps of each channel to obtain a one-dimensional sequence of 1 × 16 corresponding to each channel, and then splicing the one-dimensional sequences of 1 × 16 corresponding to the 4 channels to obtain a target feature vector of 1 × 64 corresponding to the first reference vector. And finally, mapping the target feature vector of 1 × 64 into second reference vectors with the same resolution as the first reference vectors to obtain the second reference vectors corresponding to each first reference vector.

It can be seen that the first feature vector B is divided into_iThe use of convolution parameters mapped to convolution kernels of 1 x 1 corresponds to the use of such convolution kernels for the first feature vector B_iAnd weighting the corresponding feature map, so that important information in the local information framed by each first candidate frame can be reserved, and the prediction precision of a subsequent second candidate frame is improved.

1105: and obtaining a plurality of second candidate frames according to the plurality of second reference vectors and the plurality of first candidate frames.

Illustratively, determining an offset corresponding to each first candidate frame according to the second reference vector corresponding to the first candidate frame; and shifting the first candidate frame according to the offset to obtain a second candidate frame corresponding to the first candidate frame, namely obtaining a plurality of second candidate frames. For example, a plurality of second reference vectors and a plurality of first candidate frames may be input to a feed-forward network to obtain an offset of each first candidate frame, and since the characterization form of each first candidate frame is the pixel coordinates of four vertices in the image to be detected, the pixel coordinates of the four vertices of the first candidate frame are offset based on the offset to obtain new four vertices, that is, the second candidate frame corresponding to the first candidate frame is obtained.

1106: and obtaining a target detection result of the image to be detected based on the plurality of second candidate frames and the plurality of first categories.

Exemplarily, performing confidence detection based on a plurality of second candidate frames and a plurality of first classes, that is, performing confidence detection based on each second candidate frame and the first class corresponding to each second candidate frame to obtain a confidence corresponding to each second candidate frame; and then, taking the second candidate frame with the confidence coefficient larger than the confidence coefficient threshold value and the first class corresponding to the second candidate frame as a target detection result of the image to be detected.

In one embodiment of the present application, the decoding network further comprises a second decoding layer. Therefore, before a target detection result of an image to be detected is obtained based on a plurality of second candidate frames and a plurality of first categories, attention processing is performed on a plurality of target first feature maps and a plurality of second reference vectors to obtain a plurality of third reference vectors, namely, similar to the first decoding layer, the target first feature maps and the plurality of second reference vectors are processed through a self-attention layer, a multi-head attention layer and a first feedforward network of the second decoding layer in sequence to obtain a plurality of third reference vectors; then, obtaining a plurality of third candidate frames and a plurality of second classes according to the plurality of third reference vectors and the plurality of second candidate frames, namely processing the plurality of third reference vectors and the plurality of second candidate frames through a second feed-forward network of a second decoding layer to obtain a plurality of third candidate frames and a plurality of second classes; then, processing a feature map corresponding to each third candidate frame in the plurality of third candidate frames and each third reference vector in the plurality of third reference vectors to obtain a plurality of fourth reference vectors, wherein the feature map corresponding to each third candidate frame represents features framed by the third candidate frame in at least one first feature map in the plurality of first feature maps, and the plurality of third candidate frames and the plurality of third reference vectors are in one-to-one correspondence, that is, the feature map corresponding to each third candidate frame in the plurality of third candidate frames and each third reference vector in the plurality of third reference vectors are processed through a dynamic interaction layer of a second decoding layer and a third feed-forward network respectively to obtain a plurality of fourth reference vectors; and finally, obtaining a plurality of fourth candidate frames according to the plurality of fourth reference vectors and the plurality of third candidate frames, namely processing the plurality of fourth reference vectors and the plurality of third candidate frames through a fourth feedforward network of a second decoding layer to obtain the plurality of fourth candidate frames.

It should be noted that the candidate frame obtained by the object detection performed by the second decoding layer is completed on the basis of the first decoding layer. Relatively speaking, the confidence degrees of the fourth candidate frame, the third candidate frame, the second candidate frame and the first candidate frame are sequentially reduced. Therefore, the target detection result of the image to be detected can be obtained based on the plurality of fourth candidate frames and the plurality of second categories. Similarly, the confidence degree corresponding to each fourth candidate frame is determined, and the fourth candidate frame with the confidence degree larger than the confidence degree threshold value and the second category corresponding to the fourth candidate frame are used as the target detection result of the image to be detected.

In an embodiment of the present application, when a decoding network of a neural network for target detection includes N decoding layers, the N decoding layers may successively perform N target detection operations. The following describes a process of each decoding layer for implementing the target detection operation for the ith time by taking the ith decoding layer as an example, where i is an integer from 1 to N, and N is an integer greater than or equal to 1.

Illustratively, as shown in FIG. 8, a plurality of second reference vectors A are decoded by the ith decoding layer_i-1And processing the target first characteristic diagram to obtain a plurality of first reference vectors B_iWherein, the target first feature map is any one of the plurality of first feature maps with resolution less than a first threshold, and the plurality of second reference vectors A_i-1The target detection method is obtained by executing the target detection operation for the (i-1) th time through the (i-1) th decoding layer. It should be noted that when i is 1, i.e., the 1 st target detection operation is performed by the first decoding layer, the plurality of second reference vectors a_i-1The plurality of target reference vectors are described above.

Specifically, a plurality of second reference vectors A are paired through a self-attention layer of an ith decoding layer_i-1Carrying out treatment; then, performing multi-head attention processing on the self-attention processing result and the target first feature map through a multi-head attention layer of the ith decoding layer; finally, performing feature enhancement on the multi-head attention processing result through a feedforward network of an i-th decoding layer to obtain a plurality of first reference vectors B_i；

Then, through the ithSecond feed-forward network of decoding layer for multiple first reference vectors B_iAnd a plurality of second candidate frames C_i-1Processing to obtain a plurality of first candidate frames D_iAnd a category corresponding to the ith target detection, wherein a plurality of second candidate frames C_i-1The target detection method is obtained by executing the target detection operation for the (i-1) th time through the (i-1) th decoding layer.

It should be noted that, when i is equal to 1, i.e., the 1 st target detection operation is performed, the target detection is performed for the first time, i.e., the frame regression is performed for the first time, and there are no plurality of second candidate frames C_i-1Can be referenced, and thus can be based on a plurality of first reference vectors B_iDirectly obtaining a plurality of first candidate frames D_iI.e. directly on the first reference vector B by the first decoding layer_iPerforming target detection to predict each first candidate frame D_iThe pixel coordinates in the image to be detected can be obtained together with each first reference vector B_iCorresponding first candidate frame D_i. When i is greater than 1, there may be a second candidate box C since the box regression has been performed before_i-1Reference is made to so can be made to each second candidate box C_i-1Corresponding first reference vector B_iDetermining each second candidate frame C_i-1A corresponding offset; according to each second candidate frame C_i-1Corresponding offset for each second candidate frame C_i-1Shifting to obtain each second candidate frame C_i-1Corresponding first candidate frame D_iObtaining a plurality of first candidate frames D_i。

In general, when the frame regression is performed for the first time, the candidate frame regression can be directly performed according to the obtained reference vector; when the position information of the candidate frame already exists, the offset of the candidate frame can be determined on the basis of the existing candidate frame, and then the candidate frame is offset according to the offset of the candidate frame to obtain a new frame.

Further, a plurality of first candidate frames D are paired through a dynamic interaction layer of the ith decoding layer_iEach first candidate frame D in (1)_iCorresponding characteristic diagram and a plurality of first reference vectors B_iPerforming dynamic interactive processing, and then passingThe feedforward network of the ith decoding layer performs feature enhancement on the dynamic interaction processing result to obtain a plurality of second reference vectors A_i。

Further, a plurality of second reference vectors A are processed by a fourth feedforward network of the ith decoding layer_iAnd a plurality of first candidate frames D_iProcessing to obtain a plurality of second candidate frames C_i. Illustratively, according to each first candidate frame D_iCorresponding second reference vector A_iDetermining the first candidate frame D_iA corresponding offset; according to the offset, the first candidate frame D is aligned_iShifting to obtain the first candidate frame D_iCorresponding second candidate frame C_iObtaining a plurality of second candidate frames C_i。

Finally, a plurality of second candidate frames C obtained by executing the Nth target detection operation through the Nth decoding layer_NAnd determining a target detection result of the image to be detected according to a plurality of categories.

Optionally, in the process of performing the ith target detection, the method is performed according to a plurality of first reference vectors B_iAnd classifying the target in each candidate frame to obtain a plurality of categories. In addition, each first reference vector B_iIs processed by attention mechanism, each first reference vector B_iThe first reference vector B can be directly used if the global information containing the image to be detected_iAnd classifying the target in each candidate frame to obtain a plurality of categories. Of course, since each second reference vector A_iNot only contains the global information of the image to be detected, but also contains the second reference vector A_iThe local information framed by the corresponding candidate frame can also use the second reference vector A_iAnd classifying the target in each candidate frame to obtain a plurality of categories. The present application mainly aims to use a plurality of first reference vectors B_iThe following description will be made by taking the classification as an example.

It should be noted that, in the training process, each time the target detection operation is performed, the first reference vector obtained in each target detection operation needs to be used for classification, so as to calculate the first frame regression in each target detection operationLoss in the process. Therefore, in order to align with the training process, in the neural network application process, it is common to base it on a plurality of first reference vectors B_iAnd classifying to obtain a plurality of first categories corresponding to the ith target detection operation.

It should be noted that, for classification, as long as the reference vector used for classification contains the global information of the image to be detected, the obtained classification result is relatively accurate. The plurality of first classes obtained in any one target detection process are relatively accurate, in other words, the classes obtained in the N target detection processes should be the same. Therefore, in practical application, a plurality of categories obtained in any one target detection process can be used as the categories of the images to be detected.

It should be noted that the object detection of the present application is an end-to-end object detection process, and therefore the plurality of second reference vectors a described above_i-1A plurality of second reference vectors A_iA plurality of first reference vectors B_iA plurality of second candidate frames C_i-1A plurality of first candidate frames D_iA plurality of second candidate frames C_iAnd a plurality of second reference vectors A corresponding to the first classes one by one_i-1A plurality of second reference vectors A_iAnd a plurality of first reference vectors B_iThe same dimension between.

It should be noted that due to the plurality of first candidate frames D_iIs in a plurality of second candidate frames C_i-1Is obtained on the basis of (1), thus, each first candidate frame D_iIs higher than the confidence of the first candidate frame D_iCorresponding second candidate frame C_i-1The confidence of (2); likewise, a plurality of second candidate frames C_iIs in a plurality of first candidate frames D_iIs obtained on the basis of (1), thus, each second candidate frame C_iIs higher than the confidence of the second candidate frame C_iCorresponding first candidate frame D_iThe confidence of (c).

Referring to fig. 14, fig. 14 is a schematic flowchart of a neural network training method according to an embodiment of the present disclosure. The neural network is used for target detection. The neural network for object detection includes an encoding network and a decoding network. The method comprises the following steps:

1401: inputting the training image into a coding network for feature extraction to obtain a plurality of fifth feature maps, wherein the resolution of the fifth feature maps is different.

The training image is an image sample carrying a label, wherein the label is used for identifying the real position of the frame in the training image and the real class of the object in the frame. In the present application, a frame labeled with a label is referred to as a real frame, and a category of an object of the real frame is referred to as a real category.

As shown in fig. 9, the encoding network may include a backbone network and a feature pyramid network. Therefore, inputting the training image into the backbone network to perform feature extraction on the training image to obtain a plurality of sixth feature maps, wherein the resolution ratios of the sixth feature maps are different; and then, inputting each sixth feature map in the plurality of sixth feature maps into a network layer corresponding to the resolution of each sixth feature map in the feature pyramid network respectively for feature extraction, so as to obtain a plurality of fifth feature maps.

For example, as shown in fig. 10, the encoding network includes a backbone network and a feature pyramid network encoder, and after a plurality of sixth feature maps are obtained, each sixth feature map in the plurality of sixth feature maps may be respectively input to a network layer corresponding to a resolution of each sixth feature map in the feature pyramid network to perform feature extraction, so as to obtain a plurality of seventh feature maps, where resolutions of the plurality of seventh feature maps are different; tiling a plurality of target seventh feature maps to obtain a plurality of third feature vectors, wherein the target seventh feature map is any one of the seventh feature maps with the resolution smaller than a second threshold value in the sixth feature maps; inputting a plurality of third feature vectors and initial position codes into an encoder to be encoded to obtain a plurality of fourth feature vectors, wherein the initial position codes are obtained by random initialization, the value of the initial position codes is continuously adjusted in the process of neural network training, when the training of the neural network for target detection is completed, the obtained position codes are the target position codes, and the plurality of fourth feature vectors correspond to the plurality of third feature vectors one to one; combining the plurality of fourth feature vectors according to the reverse order of tiling the target seventh feature map to obtain an eighth feature map; finally, the eighth feature map and a seventh feature map of the plurality of seventh feature maps, excluding the target seventh feature map, are set as the plurality of fifth feature maps.

1402: and inputting the target fifth feature map and the initial reference vector into the first decoding layer for target detection to obtain a plurality of sixth candidate frames and a plurality of third categories.

The initial reference vectors are obtained by random initialization, the value of each initial reference vector is continuously adjusted in the training process, and when the training of the neural network for target detection is completed, the obtained reference vectors are the plurality of target reference vectors.

Illustratively, similar to the processing of the target first feature map and the plurality of target reference vectors by the first decoding layer in the application side, the self-attention processing is firstly performed on the plurality of initial reference vectors based on the self-attention layer of the first decoding layer; then, multi-head attention processing is carried out on the target fifth feature map and the self-attention processing result on the basis of a multi-head attention layer of the first decoding layer; secondly, performing feature enhancement processing on the multi-head attention processing result through a feedforward network of a first decoding layer to obtain a plurality of fifth reference vectors; then, target detection is carried out on the fifth reference vectors based on a feedforward network of the first decoding layer, and a plurality of fifth candidate frames and a plurality of third classes are obtained; then, performing dynamic interaction processing on a feature map corresponding to each fifth candidate frame in a plurality of fifth candidate frames and a fifth reference vector in a plurality of fifth reference vectors based on a dynamic interaction layer of a first decoding layer, and performing feature enhancement on a dynamic interaction processing result based on a feed-forward network of the first decoding layer to obtain a plurality of sixth reference vectors, wherein the feature map corresponding to each fifth candidate frame represents features framed by the fifth candidate frame in at least one fifth feature map in the plurality of fifth feature maps, and the plurality of fifth candidate frames correspond to the plurality of fifth reference vectors one to one; and finally, processing the plurality of sixth reference vectors and the plurality of fifth candidate frames based on a feedforward network of the first decoding layer to obtain a plurality of sixth candidate frames.

1403: and training the neural network according to the fifth candidate boxes, the sixth candidate boxes, the third classes and the labels of the training images.

Illustratively, determining the corresponding loss of the first decoding layer according to a plurality of fifth candidate frames, a plurality of sixth candidate frames, a plurality of third categories and the label of the training image; training the neural network for target detection based on the loss until the neural network for target detection converges or a training number is reached, and stopping the training of the neural network for target detection.

The process of determining the corresponding penalty for the first decoding layer is described below.

Illustratively, the label of the training image includes at least one real box and at least one real category corresponding to the at least one real box. Determining the loss corresponding to each real frame according to the fifth candidate frames, the sixth candidate frames, the third categories and the labels of the training images, and taking the loss corresponding to each real frame as the loss corresponding to the first decoding layer.

Illustratively, obtaining a first loss between the real frame A and each first candidate frame, wherein the first loss includes an intersection ratio between the real frame A and each first candidate frame, a positioning loss between the real frame A and each first candidate frame, and a classification loss between a real class corresponding to the real frame A and a third class corresponding to each first candidate frame, and the real frame A is any one of at least one real frame; determining a target first candidate frame according to the first loss corresponding to each first candidate frame, for example, weighting each loss in the first loss corresponding to each first candidate frame to obtain a first target loss corresponding to each first candidate frame, then taking the first candidate frame with the smallest first target loss as the target first candidate frame, and taking the first loss corresponding to the target first candidate frame as the first loss corresponding to the real frame a;

similarly, a second loss between the real frame a and each second candidate frame in the plurality of second candidate frames is obtained, where the second loss includes an intersection ratio between the real frame a and each second candidate frame, a positioning loss between the real frame and each second candidate frame, and a classification loss between a real category corresponding to the real frame a and a third category corresponding to each second candidate frame; for example, the respective losses in the second losses corresponding to each second candidate frame may be weighted to obtain a second target loss corresponding to each second candidate frame, and then, the second candidate frame with the smallest second target loss is used as the target second candidate frame, and the second loss corresponding to the target first candidate frame is used as the second loss corresponding to the real frame a;

the positioning loss represents a difference between a position of the real frame in the image to be detected and a position of the first candidate frame or the second candidate frame in the image to be detected, for example, the positioning loss in the first loss may be a difference between pixel coordinates of four vertices of the real frame and pixel coordinates of four vertices of the first candidate frame.

And finally, taking a first loss between each real frame and the target first candidate frame and a second loss between each real frame and the target second candidate frame as corresponding losses of the first decoding layer.

For example, 10 real boxes are marked in the training image and the real category in each real box is marked; the first decoding layer outputs 100 first candidate frames, 100 second candidate frames and 100 third categories; then, any one real frame is sequentially matched with each first candidate frame to obtain a first loss between the real frame and each first candidate frame; then, weighting the classification loss, the intersection ratio and the positioning loss in the first loss to obtain a first target loss corresponding to each first candidate frame, and taking the first candidate frame with the minimum first target loss as a first target candidate frame corresponding to the real frame; similarly, a second target candidate frame corresponding to the real frame may be obtained; finally, a first penalty between each real frame of the 10 real frames and the target first candidate frame and a second penalty between each real frame and the second target candidate frame are taken as penalties corresponding to the first decoding layer.

In one embodiment of the present application, the decoding network further comprises a second decoding layer; therefore, before the neural network is trained according to the fifth candidate frames, the sixth candidate frames, the third classes and the labels of the training images, the fifth feature map of the target, the sixth reference vectors and the sixth candidate frames are input into the second decoding layer for target detection, and a plurality of eighth candidate frames and a plurality of fourth classes are obtained; the second decoding layer is used for performing attention processing on the target fifth feature map and the sixth reference vectors to obtain a plurality of seventh reference vectors; obtaining a plurality of seventh candidate frames and a plurality of fourth categories based on a plurality of seventh reference vectors and a plurality of sixth candidate frames; processing a feature map corresponding to each of the seventh candidate frames in the plurality of seventh candidate frames and each of the seventh reference vectors to obtain a plurality of eighth reference vectors, wherein the feature map corresponding to each of the seventh candidate frames represents features framed by the seventh candidate frame in at least one fifth feature map in the plurality of fifth feature maps; obtaining the eighth candidate frames based on the eighth reference vectors and the seventh candidate frames.

Accordingly, a neural network may be trained based on the fifth candidate boxes, the sixth candidate boxes, the third categories, the seventh candidate boxes, the eighth candidate boxes, the fourth categories, and the labels of the training images. Specifically, the loss corresponding to the first decoding layer is determined according to the fifth candidate frames, the sixth candidate frames, the third categories, and the labels of the training images, and the loss of the second decoding layer is determined according to the seventh candidate frames, the eighth candidate frames, the fourth categories, and the labels of the training images; and then, training the neural network according to the loss corresponding to the first decoding layer and the loss corresponding to the second decoding layer. The loss of the second decoding layer is determined in a similar manner to the first decoding layer, and will not be described.

It should be noted that, in the training process, the neural network may be trained sequentially using three losses corresponding to each real frame obtained by each decoding layer; or weighting the three losses corresponding to each real frame to obtain the final loss corresponding to each real frame, and training the neural network according to the final loss corresponding to each real frame; or weighting the final loss corresponding to each real frame obtained by each decoding layer to obtain the final loss corresponding to each decoding layer, and training the neural network by using the final loss corresponding to each decoding layer; or when the decoding network comprises N decoding layers, weighting the final loss of the N decoding layers to obtain the target loss, and training the neural network by using the target loss.

It should be understood that the present application only illustrates the process of performing neural network training using one training image, and in practical applications, a plurality of training images are required to perform iterative training on the neural network until the neural network converges to obtain a trained neural network.

For the neural network structure shown in fig. 10 and the existing neural network, a ResNet network is used as a backbone network; the existing neural network and the neural network of the present application are trained respectively by using a Common Objects in Context (COCO) published by microsoft, so as to obtain the training effect of each network as shown in tables 1 and 2.

Table 1: when the ResNet network is ResNet-50, the training effect of each neural network is as follows:

it can be seen that when the backbone network is ResNet-50, the neural network of the present application achieves 46.3AP, the performance in each neural network is optimal, and compared with DETR, the target detection accuracy is higher by 4.3 AP; and only 36 iteration cycles are needed to converge, and the training cycle is shortened. In addition, the neural network structure of the application reaches 30.5AP on the small target detection result (namely APs), and is higher than other neural networks, so that the detection precision of the small target is improved.

Table 2: when the ResNet network is ResNet-101, the training effect of each neural network is as follows:

neural network	Backbone network	Period of time	AP	AP50	AP75	APs	APm	APl	FPS
										Faster R-CNN	ResNet-101	36	42.0	62.5	45.9	25.2	45.6	54.6	20
DETR	ResNet-101	500	43.5	63.8	46.4	21.9	48.0	61.8	20
										Sparse R-CNN	ResNet-101	36	45.6	64.6	49.5	28.3	48.3	61.6	18
TSP-RCNN	ResNet-101	36	44.8	63.8	49.2	29.0	47.9	57.1	-
										SMCA	ResNet-101	50	44.4	65.2	48.0	24.3	48.5	61.0	-
This application	ResNet-101	36	47.4	66.7	51.9	30.8	50.7	62.2	13

It can be seen that when the backbone network is the neural network of the present application on ResNet-101, which achieves 47.4AP, the performance in each neural network is optimal, and compared with the DETR, which is improved by 3.9AP, the target detection accuracy is higher. The neural network structure of the application achieves 30.8AP on the small target detection result, is higher than other neural networks, and improves the detection precision of the small target.

Referring to fig. 15, fig. 15 is a schematic structural diagram of a target detection apparatus according to an embodiment of the present disclosure. The object detection apparatus 1500 includes an acquisition unit 1501 and a processing unit 1502;

an acquisition unit 1501 for acquiring an image to be detected;

the processing unit 1502 is configured to perform feature extraction on an image to be detected to obtain a plurality of first feature maps, where resolutions of the plurality of first feature maps are different;

performing attention processing on a target first feature map and a plurality of target reference vectors to obtain a plurality of first reference vectors, wherein the target first feature map is any one of the first feature maps with the resolution lower than a first threshold value, and each target reference vector is used for representing the feature of an object;

performing target detection according to the first reference vectors to obtain a plurality of first candidate frames and a plurality of first categories;

processing a feature map corresponding to each of the plurality of first candidate frames and each of the plurality of first reference vectors to obtain a plurality of second reference vectors, wherein the feature map corresponding to each of the plurality of first candidate frames represents features of the first candidate frame framed in at least one first feature map of the plurality of first feature maps, and the plurality of first candidate frames and the plurality of first reference vectors are in one-to-one correspondence;

obtaining a plurality of second candidate frames according to the plurality of second reference vectors and the plurality of first candidate frames;

and obtaining a target detection result of the image to be detected based on the plurality of second candidate frames and the plurality of first categories.

Referring to fig. 16, fig. 16 is a schematic structural diagram of a neural network training device according to an embodiment of the present application. The neural network is used for target detection. The neural network training device 1600 comprises an acquisition unit 1601 and a processing unit 1602;

an acquisition unit 1601 configured to acquire a training image;

a processing unit 1602, configured to input a training image into the coding network to perform feature extraction, so as to obtain a plurality of fifth feature maps, where resolutions of the plurality of fifth feature maps are different; inputting a target fifth feature map and initial reference vectors into the first decoding layer for target detection to obtain a plurality of sixth candidate frames and a plurality of third categories, wherein the target fifth feature map is any one of the fifth feature maps with the resolution smaller than a first threshold value, and the initial reference vectors are obtained through initialization; the first decoding layer is configured to perform attention processing on the target fifth feature map and the plurality of initial reference vectors to obtain a plurality of fifth reference vectors; performing target detection on the fifth reference vectors to obtain a plurality of fifth candidate frames and a plurality of third categories; processing a feature map corresponding to each of the fifth candidate boxes and a fifth reference vector in the fifth reference vectors to obtain a plurality of sixth reference vectors, wherein the feature map corresponding to each of the fifth candidate boxes characterizes features of the fifth candidate box framed in at least one fifth feature map in the fifth feature maps; obtaining a plurality of sixth candidate frames based on the plurality of sixth reference vectors and the plurality of fifth candidate frames, wherein the plurality of fifth candidate frames and the plurality of fifth reference vectors are in one-to-one correspondence; training the neural network according to the fifth candidate boxes, the sixth candidate boxes, the third classes and the labels of the training images.

Fig. 17 is a schematic structural diagram of another object detection apparatus provided in the embodiment of the present application. The object detection apparatus 1700 shown in fig. 17 includes a memory 1701, a processor 1702, a communication interface 1703, and a bus 1704. The memory 1701, the processor 1702, and the communication interface 1703 are communicatively connected to each other via the bus 1704.

The Memory 1701 may be a Read Only Memory (ROM), a static Memory device, a dynamic Memory device, or a Random Access Memory (RAM). The memory 1701 may store programs that, when executed by the processor 1702, the processor 1702 and the communication interface 1703 are used to perform the various steps of the object detection method of the embodiments of the present application.

The processor 1702 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more Integrated circuits, and is configured to execute related programs to implement the functions required to be executed by the units in the object detection apparatus according to the embodiment of the present disclosure, or to execute the object detection method according to the embodiment of the present disclosure.

The processor 1702 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the object detection method of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1702. The processor 1702 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory 1701, and a processor 1702 reads information in the memory 1701, and completes functions required to be executed by a unit included in the object detection apparatus 1500 of the embodiment of the present application in combination with hardware thereof, or executes an object detection method of the embodiment of the method of the present application.

Communication interface 1703 enables communication between apparatus 1700 and other devices or a communication network using transceiver means, such as, but not limited to, a transceiver. For example, the image to be detected may be acquired through the communication interface 1703.

The bus 1704 may include a pathway to transfer information between various components of the device 1700 (e.g., the memory 1701, the processor 1702, and the communication interface 1703).

It is to be understood that the acquisition unit 1501 in the object detection apparatus 1500 corresponds to the communication interface 1703 in the object detection apparatus 1700; the processing unit 1502 in the object detection apparatus 1500 may correspond to the processor 1702.

Referring to fig. 18, fig. 18 is a schematic structural diagram of a neural network training device according to an embodiment of the present application. The neural network training device 1800 shown in fig. 18 includes a memory 1801, a processor 1802, a communication interface 1803, and a bus 1804. The memory 1801, the processor 1802, and the communication interface 1803 are communicatively connected to each other via a bus 1804.

The Memory 1801 may be a Read Only Memory (ROM), a static Memory device, a dynamic Memory device, or a Random Access Memory (RAM). The memory 1801 may store a program, and the processor 1802 and the communication interface 1803 are configured to perform the steps of the neural network training method of the embodiments of the present application when the program stored in the memory 1801 is executed by the processor 1802.

The processor 1802 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more Integrated circuits, and is configured to execute related programs to implement the functions that the units in the neural network training device 1600 of the embodiment of the present Application need to execute, or to execute the neural network training method of the embodiment of the present Application.

The processor 1802 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the neural network training method of the present application may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 1802. The processor 1802 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1801, and the processor 1802 reads information in the memory 1801, and completes, in combination with hardware of the processor, functions to be executed by units included in the neural network training apparatus 1600 according to the embodiment of the present application, or performs the neural network training method according to the embodiment of the present application.

The communication interface 1803 enables communication between the apparatus 1800 and other devices or communication networks using transceiver means, such as, but not limited to, a transceiver. For example, training images may be acquired through the communication interface 1803.

The bus 1804 may include a pathway to transfer information between various components of the apparatus 1800 (e.g., memory 1801, processor 1802, communication interface 1803).

It is to be appreciated that the acquisition unit 1601 in the neural network training device 1600 is equivalent to the communication interface 1803 in the neural network training device 1800 and that the processing unit 1602 may be equivalent to the processor 1802.

It should be noted that although the target detection apparatus 1700 and the neural network training apparatus 1800 shown in fig. 17 and 18 only illustrate memories, processors, and communication interfaces, in a specific implementation process, those skilled in the art will understand that the target detection apparatus 1700 and the neural network training apparatus 1800 also include other devices necessary for normal operation. Also, according to particular needs, those skilled in the art will appreciate that the object detection apparatus 1700 and the neural network training apparatus 1800 may also include hardware components for performing other additional functions. Furthermore, it should be understood by those skilled in the art that the object detection apparatus 1700 and the neural network training apparatus 1800 may also include only the components necessary to implement the embodiments of the present application, and need not include all of the components shown in fig. 17 or fig. 18.

It is understood that the neural network training apparatus 1800 corresponds to the training device 120 in fig. 2, and the object detecting apparatus 1700 corresponds to the executing device 110 in fig. 2. Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A neural network for object detection, the neural network comprising an encoding network and a decoding network, the decoding network comprising a first decoding layer;

the encoding network is used for extracting features of an image to be detected to obtain a plurality of first feature maps, wherein the resolution ratios of the first feature maps are different;

the first decoding layer is configured to perform attention processing on a target first feature map and a plurality of target reference vectors to obtain a plurality of first reference vectors, where the target first feature map is any one of the first feature maps with a resolution lower than a first threshold in the plurality of first feature maps, and the plurality of target reference vectors are network parameters of the neural network;

performing target detection on the plurality of first reference vectors to obtain a plurality of first candidate frames and a plurality of first categories;

and obtaining a plurality of second candidate frames based on the plurality of second reference vectors and the plurality of first candidate frames.

2. The neural network of claim 1,

the decoding network further comprises a second decoding layer;

the second decoding layer is used for performing attention processing on the target first feature map and the plurality of second reference vectors to obtain a plurality of third reference vectors;

obtaining a plurality of third candidate frames and a plurality of second classes based on the plurality of third reference vectors and the plurality of second candidate frames;

processing a feature map corresponding to each of the third candidate frames and each of the third reference vectors to obtain a plurality of fourth reference vectors, wherein the feature map corresponding to each of the third candidate frames represents features of the third candidate frame framed in at least one first feature map of the first feature maps, and the third candidate frames and the third reference vectors are in one-to-one correspondence;

obtaining a plurality of fourth candidate frames based on the plurality of fourth reference vectors and the plurality of third candidate frames.

3. The neural network of claim 1 or 2,

the coding network comprises a backbone network and a characteristic pyramid network;

the backbone network is used for extracting features of the image to be detected to obtain a plurality of second feature maps, wherein the resolution ratios of the second feature maps are different;

the feature pyramid network is used for extracting features of the second feature maps to obtain the first feature maps.

4. The neural network of claim 1 or 2,

the coding network comprises a backbone network, a characteristic pyramid network and a coder;

the feature pyramid network is used for performing feature extraction on the plurality of second feature maps to obtain a plurality of third feature maps, wherein the resolution ratios of the plurality of third feature maps are different;

the encoder is configured to encode a plurality of first feature vectors and a target position code to obtain a plurality of second feature vectors, where the plurality of first feature vectors are obtained by tiling a target third feature map, the target third feature map is any one of the third feature maps with a resolution smaller than a second threshold in the plurality of third feature maps, and the target position code is a network parameter of the neural network;

the first feature maps include a fourth feature map and a third feature map of the third feature maps except for the target feature map, and the fourth feature map is obtained by combining the second feature vectors according to the reverse order of tiling the target third feature map.

5. A method of object detection, comprising:

performing feature extraction on an image to be detected to obtain a plurality of first feature maps, wherein the resolution ratios of the first feature maps are different;

6. The method according to claim 5, wherein before obtaining the target detection result of the image to be detected based on the plurality of second candidate frames and the plurality of first categories, the method further comprises:

performing attention processing on the target first feature map and the plurality of second reference vectors to obtain a plurality of third reference vectors;

obtaining a plurality of third candidate frames and a plurality of second classes according to the plurality of third reference vectors and the plurality of second candidate frames;

obtaining a plurality of fourth candidate frames according to the plurality of fourth reference vectors and the plurality of third candidate frames;

the obtaining of the target detection result of the image to be detected based on the plurality of second candidate frames and the plurality of first categories includes:

and obtaining a target detection result of the image to be detected based on the fourth candidate frames and the second classes.

7. The method according to claim 5 or 6, wherein the extracting features of the image to be detected to obtain a plurality of first feature maps comprises:

inputting the image to be detected into a backbone network for feature extraction to obtain a plurality of second feature maps, wherein the resolution ratios of the second feature maps are different;

and inputting the plurality of second feature maps into a feature pyramid network for feature extraction to obtain the plurality of first feature maps.

8. The method according to claim 5 or 6, wherein the extracting features of the image to be detected to obtain a plurality of first feature maps comprises:

inputting the second feature maps into a feature pyramid network for feature extraction to obtain third feature maps, wherein the third feature maps have different resolutions;

tiling a target third feature map to obtain a plurality of first feature vectors, wherein the target third feature map is any one of the third feature maps with the resolution smaller than a second threshold value;

inputting the plurality of first feature vectors and target position codes into an encoder for encoding to obtain a plurality of second feature vectors, wherein the target position codes are used for representing the spatial relationship among all pixel points in the image to be detected;

combining the plurality of second feature vectors according to the reverse order of tiling the target third feature map to obtain a fourth feature map;

and taking the fourth feature map and a third feature map except the target third feature map in the plurality of third feature maps as the plurality of first feature maps.

9. A neural network training method, wherein the neural network is used for object detection, wherein the neural network comprises an encoding network and a decoding network, wherein the decoding network comprises a first decoding layer, and wherein the method comprises:

inputting a training image into the coding network for feature extraction to obtain a plurality of fifth feature maps, wherein the resolution ratios of the fifth feature maps are different;

inputting a target fifth feature map and initial reference vectors into the first decoding layer for target detection to obtain a plurality of sixth candidate frames and a plurality of third categories, wherein the target fifth feature map is any one of the fifth feature maps with the resolution smaller than a first threshold value, and the initial reference vectors are obtained through initialization;

the first decoding layer is configured to perform attention processing on the target fifth feature map and the plurality of initial reference vectors to obtain a plurality of fifth reference vectors; performing target detection on the fifth reference vectors to obtain a plurality of fifth candidate frames and a plurality of third categories; processing a feature map corresponding to each of the fifth candidate boxes and a fifth reference vector in the fifth reference vectors to obtain a plurality of sixth reference vectors, wherein the feature map corresponding to each of the fifth candidate boxes characterizes features of the fifth candidate box framed in at least one fifth feature map in the fifth feature maps; obtaining a plurality of sixth candidate frames based on the plurality of sixth reference vectors and the plurality of fifth candidate frames, wherein the plurality of fifth candidate frames and the plurality of fifth reference vectors are in one-to-one correspondence;

training the neural network according to the fifth candidate boxes, the sixth candidate boxes, the third classes and the labels of the training images.

10. The method of claim 9,

the decoding network further comprises a second decoding layer; before training the neural network according to the fifth candidate boxes, the sixth candidate boxes, the third classes, and the label of the training image, the method further includes:

inputting the target fifth feature map, the sixth reference vectors and the sixth candidate boxes to the second decoding layer for target detection to obtain eighth candidate boxes and fourth classes;

the second decoding layer is configured to perform attention processing on the target fifth feature map and the sixth reference vectors to obtain seventh reference vectors; obtaining a plurality of seventh candidate frames and a plurality of fourth categories based on the plurality of seventh reference vectors and the plurality of sixth candidate frames; processing a feature map corresponding to each of the seventh candidate frames in the plurality of seventh candidate frames and each of the seventh reference vectors to obtain a plurality of eighth reference vectors, wherein the feature map corresponding to each of the seventh candidate frames represents features framed by the seventh candidate frame in at least one fifth feature map in the plurality of fifth feature maps; obtaining a plurality of eighth candidate frames based on the plurality of eighth reference vectors and the plurality of seventh candidate frames;

the training the neural network according to the fifth candidate boxes, the sixth candidate boxes, the third classes, and the label of the training image includes:

training the neural network according to the fifth candidate boxes, the sixth candidate boxes, the third classes, the seventh candidate boxes, the eighth candidate boxes, the fourth classes, and the label of the training image.

11. The method according to claim 9 or 10,

the coding network comprises a backbone network and a characteristic pyramid network; inputting the training image into the coding network for feature extraction to obtain a plurality of fifth feature maps, wherein the method comprises the following steps:

inputting the training image into the backbone network to perform feature extraction on the training image to obtain a plurality of sixth feature maps, wherein the sixth feature maps have different resolutions;

and inputting the sixth feature maps into the feature pyramid network for feature extraction to obtain fifth feature maps.

12. The method according to claim 9 or 10,

inputting the training image into the coding network for feature extraction to obtain a plurality of fifth feature maps, wherein the method comprises the following steps:

inputting the sixth feature maps into the feature pyramid network for feature extraction to obtain seventh feature maps, wherein the seventh feature maps have different resolutions;

tiling a target seventh feature map to obtain a plurality of third feature vectors, wherein the target seventh feature map is any one of the seventh feature maps with resolution smaller than a second threshold value;

inputting the plurality of third feature vectors and initial position codes into the encoder for encoding to obtain a plurality of fourth feature vectors, wherein the initial position codes are obtained by initialization;

combining the plurality of fourth feature vectors according to the reverse order of tiling the target seventh feature map to obtain an eighth feature map;

and taking the eighth feature map and a seventh feature map of the plurality of seventh feature maps except the target seventh feature map as the plurality of fifth feature maps.

13. An object detection device, comprising: an acquisition unit and a processing unit;

the acquisition unit is used for acquiring an image to be detected;

the processing unit is used for extracting features of an image to be detected to obtain a plurality of first feature maps, wherein the resolution ratios of the first feature maps are different;

14. The apparatus according to claim 13, wherein before the processing unit obtains the target detection result of the image to be detected based on the plurality of second candidate frames and the plurality of first classes, the processing unit is further configured to:

15. The apparatus according to claim 13 or 14, wherein in terms of the processing unit performing feature extraction on the image to be detected to obtain a plurality of first feature maps, the processing unit is specifically configured to:

16. The apparatus according to claim 13 or 14, wherein in terms of the processing unit performing feature extraction on the image to be detected to obtain a plurality of first feature maps, the processing unit is specifically configured to:

17. A neural network training device is characterized in that the neural network is used for target detection, the neural network comprises an encoding network and a decoding network, and the neural network training device comprises an acquisition unit and a processing unit;

the acquisition unit is used for acquiring a training image;

the processing unit is used for inputting a training image into the coding network for feature extraction to obtain a plurality of fifth feature maps, wherein the resolution ratios of the fifth feature maps are different;

the first decoding layer is configured to process the target fifth feature map and the plurality of initial reference vectors to obtain a plurality of fifth reference vectors; performing target detection on the fifth reference vectors to obtain a plurality of fifth candidate frames and a plurality of third categories; processing a feature map corresponding to each of the fifth candidate boxes and a fifth reference vector in the fifth reference vectors to obtain a plurality of sixth reference vectors, wherein the feature map corresponding to each of the fifth candidate boxes characterizes features of the fifth candidate box framed in at least one fifth feature map in the fifth feature maps; obtaining a plurality of sixth candidate frames based on the plurality of sixth reference vectors and the plurality of fifth candidate frames, wherein the plurality of fifth candidate frames and the plurality of fifth reference vectors are in one-to-one correspondence;

18. The apparatus of claim 17,

the decoding network further comprises a second decoding layer; before the processing unit trains the neural network according to the fifth candidate frames, the sixth candidate frames, the third categories, and the labels of the training images, the processing unit is further configured to:

19. The apparatus of claim 17 or 18,

the coding network comprises a backbone network and a characteristic pyramid network; in the aspect that the processing unit inputs the training image to the coding network for feature extraction to obtain a plurality of fifth feature maps, the processing unit is specifically configured to:

20. The apparatus of claim 17 or 18,

in the aspect that the training image is input to the coding network for feature extraction to obtain a plurality of fifth feature maps, the processing unit is specifically configured to:

21. An electronic device, comprising: a memory for storing a program; a processor for executing programs stored in the memory; the processor is configured to implement the method of any one of claims 5-8 or claims 9-12 when the program stored in the memory is executed.

22. A computer-readable storage medium, characterized in that the computer-readable storage medium stores program code for execution by a device, the program code comprising instructions for implementing the method of any of claims 5-8 or claims 9-12.

23. A computer program product, characterized in that, when run on a computer, causes the computer to implement the method of any of claims 5-8 or claims 9-12.