CN113449586A

CN113449586A - Target detection method, target detection device, computer equipment and storage medium

Info

Publication number: CN113449586A
Application number: CN202110387750.9A
Authority: CN
Inventors: 张少林; 宁欣; 田伟娟
Original assignee: Shenzhen Wave Kingdom Co ltd; Beijing Wave Wisdom Security And Safety Technology Co ltd
Current assignee: Shenzhen Wave Kingdom Co ltd; Beijing Wave Wisdom Security And Safety Technology Co ltd
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2021-09-28

Abstract

The application relates to a target detection method, a target detection device, computer equipment and a storage medium. The method comprises the following steps: acquiring an image to be detected; inputting the image to be detected into a trained target detection model, wherein the target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit; extracting a first feature map corresponding to the image to be detected through the preprocessing unit to obtain a first low-dimensional feature map corresponding to the first feature map; performing feature extraction on the first low-dimensional feature map through the feature extraction unit to obtain target capsule information corresponding to the image to be detected; and carrying out target detection on the target capsule information through the prediction unit to obtain a target detection result corresponding to the image to be detected. By adopting the method, the calculation complexity and the memory complexity can be reduced when the target detection is carried out on a smaller target object or a partially shielded target object.

Description

Target detection method, target detection device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a target detection method, an apparatus, a computer device, and a storage medium.

Background

Target detection refers to detecting target objects in an image and predicting the location and class of each target object. The target detection is taken as an important branch of computer vision and digital image processing, is widely applied to various fields of robot navigation, intelligent video monitoring, industrial detection, aerospace and the like, reduces the consumption of human resources through the computer vision, and has important practical significance. Meanwhile, target detection is also a basic algorithm in the field of identity recognition, and plays an important role in subsequent tasks such as face recognition, gait recognition, crowd counting, instance segmentation and the like. Due to the wide application of deep learning, the target detection technology is developed more rapidly. In a conventional target Detection method, target Detection is achieved by extracting a feature map corresponding to a target object in an image, for example, target Detection is performed by a target detector DETR (Detection Transformer, target Detection based on set prediction).

However, the smaller target object is detected on the high-resolution feature map, and the target detection is performed in a traditional mode, so that the calculation complexity is high. Therefore, how to reduce the computational complexity of a small target object in the target detection process is a technical problem to be solved at present.

Disclosure of Invention

In view of the above, it is necessary to provide a target detection method, an apparatus, a computer device and a storage medium capable of reducing the computational complexity of a small target object in the target detection process.

A method of target detection, the method comprising:

acquiring an image to be detected;

inputting the image to be detected into a trained target detection model, wherein the target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit;

extracting a first feature map corresponding to the image to be detected through the preprocessing unit to obtain a first low-dimensional feature map corresponding to the first feature map;

performing feature extraction on the first low-dimensional feature map through the feature extraction unit to obtain target capsule information corresponding to the image to be detected;

and carrying out target detection on the target capsule information through the prediction unit to obtain a target detection result corresponding to the image to be detected.

In one embodiment, the extracting, by the preprocessing unit, the first feature map corresponding to the image to be detected includes:

and performing feature extraction on the image to be detected through a convolutional neural network in the preprocessing unit, and determining a feature map output by the last two convolutional layers of the convolutional neural network as a first feature map corresponding to the image to be detected.

In one embodiment, the performing attention-based pooling processing on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map includes:

performing multi-head attention calculation on the first feature map to obtain a multi-head attention value corresponding to the first feature map;

and carrying out normalization processing on the multi-head attention value to obtain a first low-dimensional feature map corresponding to the first feature map.

In one embodiment, the feature extraction unit comprises an encoding unit and a decoding unit; the feature extraction of the first low-dimensional feature map by the feature extraction unit to obtain the target capsule information corresponding to the image to be detected comprises the following steps:

global feature extraction is carried out on the first low-dimensional feature map through the coding unit to obtain global feature information, and capsule conversion is carried out on the global feature information to obtain initial capsule information;

inputting the initial capsule information into the decoding unit, extracting the category characteristics of the initial capsule information to obtain category characteristic information, and performing capsule conversion on the category characteristic information to obtain target capsule information.

In one embodiment, the performing, by the prediction unit, the target detection on the target capsule information to obtain the target detection result corresponding to the image to be detected includes:

performing target detection on the target capsule information through the prediction unit based on attention routing to obtain a first detection result;

performing linear transformation on the target capsule information through the prediction unit to obtain a second detection result;

and fusing the first detection result and the second detection result to obtain a target detection result corresponding to the image to be detected.

In one embodiment, before the acquiring the image to be detected, the method further includes:

acquiring a sample image set;

inputting the sample image set into a target detection model to be trained, extracting a second feature map corresponding to the sample image set through a preprocessing unit in the target detection model to be trained, and performing attention-based pooling processing on the second feature map to obtain a second low-dimensional feature map corresponding to the second feature map;

performing feature extraction on the second low-dimensional feature map through a feature extraction unit in the target detection model to be trained to obtain target capsule information corresponding to the sample image set;

performing target detection on target capsule information corresponding to the sample image set through a prediction unit in the target detection model to be trained to obtain a target detection result corresponding to the sample image set;

calculating a loss value of the target detection model to be trained according to a target detection result corresponding to the sample image set, and updating network parameters of the target detection model to be trained according to the loss value until a preset condition is met to obtain the trained target detection model.

In one embodiment, the sample image set is labeled with target label information; the calculating a loss value of the target detection model to be trained according to the target detection result corresponding to the sample image set includes:

performing binary matching on the target detection result corresponding to the sample image set and the target label information to obtain a matching result;

and calculating the loss value of the target detection model to be trained according to the matching result.

In one embodiment, the loss values of the target detection model to be trained include a target position offset loss value, a classification loss value and a matching loss value.

An object detection apparatus, the apparatus comprising:

the image acquisition module is used for acquiring an image to be detected;

the feature extraction module is used for inputting the image to be detected into a trained target detection model, and the target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit; extracting a first feature map corresponding to the image to be detected through the preprocessing unit to obtain a first low-dimensional feature map corresponding to the first feature map; performing feature extraction on the first low-dimensional feature map through the feature extraction unit to obtain target capsule information corresponding to the image to be detected;

and the target detection module is used for carrying out target detection on the target capsule information through the prediction unit to obtain a target detection result corresponding to the image to be detected.

A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, the processor implementing the steps in the various method embodiments described above when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the respective method embodiment described above.

According to the target detection method, the device, the computer equipment and the storage medium, the first feature map corresponding to the image to be detected is extracted through the pre-processing unit of the trained target detection model, and the first feature map is subjected to attention-based pooling processing to obtain the first low-dimensional feature map corresponding to the first feature map. And then, the target capsule information is subjected to target detection through the prediction unit to obtain a target detection result corresponding to the image to be detected. Through the attention-based pooling processing of the first feature map, irrelevant information in the first feature map can be removed, only information relevant to target detection is concerned, and therefore the calculation complexity is reduced. When the target detection is carried out on a smaller target object or a partially shielded target object, the calculation complexity and the memory complexity can be reduced.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a target detection method;

FIG. 2 is a schematic flow chart diagram of a method for object detection in one embodiment;

FIG. 3 is a flowchart illustrating a step of performing attention-based pooling on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map in an embodiment;

FIG. 4 is a schematic flowchart illustrating a step of performing feature extraction on the first low-dimensional feature map by the feature extraction unit to obtain target capsule information corresponding to an image to be detected in one embodiment;

FIG. 5 is a flowchart illustrating a step of performing target detection on target capsule information by a prediction unit to obtain a target detection result corresponding to an image to be detected in one embodiment;

FIG. 6 is a block diagram of an embodiment of an object detection device;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The target detection method provided by the application can be applied to computer equipment, and the computer equipment can be a terminal or a server. It can be understood that the target detection method provided by the present application can be applied to a terminal, can also be applied to a server, can also be applied to a system comprising the terminal and the server, and is implemented through interaction between the terminal and the server.

The target detection method provided by the application can reduce the calculation complexity of a small target object, and is suitable for multiple application scenes of target detection. For example, in a scene of face recognition, the target detection method can improve the face detection precision and reduce the false rate; as in the vehicle detection scenario, the vehicle in the monitored image may be more accurately identified.

The target detection method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 and the server 104 communicate via a network. When target detection is needed, the server 104 obtains an image to be detected sent by the terminal 102, inputs the image to be detected into a trained target detection model, the target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit, a first feature map corresponding to the image to be detected is extracted through the preprocessing unit, pooling processing based on attention is performed on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map, feature extraction is performed on the first low-dimensional feature map through the feature extraction unit to obtain target capsule information corresponding to the image to be detected, and target detection is performed on the target capsule information through the prediction unit to obtain a target detection result corresponding to the image to be detected. The terminal 102 may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In one embodiment, as shown in fig. 2, an object detection method is provided, which is described by taking the application of the method to a server as an example, and includes the following steps:

step 202, obtaining an image to be detected.

And 204, inputting the image to be detected into a trained target detection model, wherein the target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit.

The image to be detected is an image which needs to be subjected to target detection. Target detection refers to detecting target objects in an image and predicting the position and the category of each target object, wherein the target objects such as human faces, vehicles or buildings are determined according to actual application scenes.

The server can obtain a target detection request sent by the terminal, and the target detection request is analyzed to obtain an image to be detected. The image to be detected stored in the terminal in advance can be an image including a target object and acquired by an image sensor, and the position, the size, the acquisition angle and the like of the target object in the image to be detected can be any. For example, the target object is smaller than the image to be detected, and may occupy only a small portion of the image to be detected, and in addition, due to the influence of the acquisition angle, the target object may be inclined in the image to be detected, or the size ratio of the target object may be out of order compared to the real size ratio thereof, for example, two parallel sides of the rectangular target object have different lengths in the image to be detected, and the other two sides are not parallel.

In one embodiment, the target object in the image to be detected may have a regular-shaped frame, the target object may have a fixed number of vertices, and the vertices are connected to form the frame of the target object. For example, the frame of the target object may be a square, rectangle, or the like having four vertices.

After the server acquires the image to be detected, a pre-stored trained target detection model is called, wherein the trained target detection model is obtained by training a sample image set marked with a target class label. The target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit, and can effectively reduce the calculation complexity of a small target object.

And step 206, extracting a first feature map corresponding to the image to be detected through the preprocessing unit, and performing attention-based pooling on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map.

The first feature map may be used to detect smaller objects in the image or partially occluded objects. The first low-dimensional feature map refers to a low-dimensional feature representation corresponding to the first feature map.

And the preprocessing unit in the target detection model is used for acquiring a first low-dimensional feature map corresponding to the first feature map. Specifically, the preprocessing unit extracts a first feature map corresponding to the image to be detected, and performs attention pooling on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map. The first feature map may include two feature maps with different resolutions, and the feature map with a higher resolution is used for detecting a smaller target object or a partially occluded target object in the image. The first characteristic diagram comprises global image information of an image to be detected.

In order to reduce the calculation complexity and the memory complexity of the feature map with higher resolution, namely, reduce the calculation complexity of a small target object or a partially shielded target object in the process of target detection, the server performs attention pooling on the first feature map before performing feature extraction on the first feature map, wherein the attention pooling is to remove an irrelevant point with a smaller response value in the first feature, namely, to perform sparseness on the first feature map, and only information related to target detection needs to be extracted, so that the calculation amount is reduced, and the calculation complexity and the memory complexity are reduced.

And 208, performing feature extraction on the first low-dimensional feature map through a feature extraction unit to obtain target capsule information corresponding to the image to be detected.

The target capsule information refers to the capsule representation of the target object in the image to be detected, namely, the characteristic information of the target object is represented by the capsule.

The feature extraction unit in the target detection model is configured to extract target capsule information of a target object in an image to be detected, where the target capsule information may include a plurality of capsule vectors, each capsule vector is configured to represent feature information of a corresponding target object, each capsule vector includes a plurality of dimensions, and each dimension is configured to represent pose information of a local feature of a corresponding target object. Therefore, the local characteristic information of the target object in the image to be detected can be accurately reflected through the target capsule information, so that the target object in the image to be detected can be represented through the local characteristic information.

And step 210, classifying the target capsule information through a prediction unit to obtain a target detection result corresponding to the image to be detected.

After the characteristic extraction unit extracts the target capsule information corresponding to the image to be detected, the target capsule information is used as the input of the prediction unit, the target detection is carried out through the prediction unit according to the target capsule information, and the position and the corresponding category of a target object in the image to be detected are predicted. The position of the target object refers to a frame corresponding to the target object.

Based on the brain cognition principle, the brain firstly performs the significance attention driven by external stimulation in the cognition process, does not need active intervention, and is a screening process of the bottom layer to the upper layer, namely, the bottom-to-top inference is performed firstly. Therefore, according to the target information of the upper layer, corresponding lower layer information is screened, namely, the information transmission from top to bottom is realized. Therefore, in this embodiment, the prediction unit may classify the target capsule information respectively through a capsule transmission mode and a full connection mode, the capsule transmission mode may realize inference from bottom to top, the full connection mode may realize information transmission from top to bottom, the capsule transmission mode employs the local feature information of the target object, and predicts the position and the category of the target object through the local feature information, while the full connection mode employs the overall information of the target object, and predicts the position and the category of the target object through the overall information of the target object, and by combining the above two modes, the local feature information and the overall information of the target object are fully utilized, thereby effectively improving the accuracy of target detection.

The conventional target detector DETR (Detection transform, target Detection based on set prediction) has the defects of slow convergence and high computational complexity, and particularly, when target Detection is performed on a small target object, Detection needs to be performed in a high-resolution feature map, which results in higher computational complexity and memory complexity. In this embodiment, a first feature map corresponding to an image to be detected is extracted by a preprocessing unit of a trained target detection model, and pooling processing based on attention is performed on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map. And then, the target capsule information is subjected to target detection through the prediction unit to obtain a target detection result corresponding to the image to be detected. Through the attention-based pooling processing of the first feature map, irrelevant information in the first feature map can be removed, only information relevant to target detection is concerned, and therefore the calculation complexity is reduced. When the target detection is carried out on a smaller target object or a partially shielded target object, the calculation complexity and the memory complexity can be reduced.

In one embodiment, the extracting, by the preprocessing unit, the first feature map corresponding to the image to be detected includes: and performing feature extraction on the image to be detected through a convolutional neural network in the preprocessing unit, and determining a feature map output by the last two convolutional layers of the convolutional neural network as a first feature map corresponding to the image to be detected.

The preprocessing unit of the trained target detection model may include a Convolutional Neural Network (CNN for short)). The convolutional neural network is used for extracting a first feature map corresponding to the image to be detected so that a subsequent feature extraction unit can extract target capsule information corresponding to the image to be detected. The convolutional neural network may include multiple network layers, such as an input layer, multiple convolutional layers, a pooling layer, a fully-connected layer, and so on. The feature maps output by the last two convolutional layers can be determined as the first feature map corresponding to the image to be detected. Along with the sequential processing sequence of the plurality of convolution layers, the resolution of the output feature map is gradually reduced. Thus, the resolution of the signature output by the penultimate convolutional layer is higher than the resolution of the signature output by the last convolutional layer. For convenience of description, the penultimate convolutional layers may be referred to as a-1 th layer, and the output characteristic diagram may be referred to as F_a-1Size of [ bs, d_a-1,h_a-1,w_a-1]Wherein bs represents F_a-1Batch size (number of samples in batch), d_a-1Is represented by F_a-1Characteristic dimension of h_a-1Is represented by F_a-1High, w of_a-1Is represented by F_a-1Is wide. The last convolutional layer is called layer a, and the output characteristic diagram can be called F_aSize of [ bs, d_a,h_a,w_a]Wherein bs represents F_aBatch size (number of samples in batch), d_aIs represented by F_aCharacteristic dimension of h_aIs represented by F_aHigh, w of_aIs represented by F_aIs wide. F_a-1And F_aIs of a size h_a-1>h_a，w_a-1>w_a。

Further, since the structure of the feature extraction unit does not use any recursive structure or convolution structure, in order for the feature extraction unit to utilize the sequence information of the image to be detected, it is necessary to introduce information capable of expressing the absolute or relative position of each element in the image to be detected. For example, the first feature map may be position-coded using a convolutional neural network, and the coded first feature map may be subjected to attention-based pooling. The position encoding is to encode the positions of elements included in the first feature map.

In this embodiment, the first feature map includes not only the feature map output by the last convolutional layer but also the feature map output by the second last convolutional layer with a higher resolution, thereby enabling target detection of a small target object.

In one embodiment, as shown in fig. 3, the step of performing attention-based pooling on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map includes:

step 302, performing multi-head attention calculation on the first feature map to obtain a multi-head attention value corresponding to the first feature map.

And step 304, carrying out normalization processing on the multi-head attention value to obtain a first low-dimensional feature map corresponding to the first feature map.

The preprocessing unit in the trained target detection model can comprise a convolutional neural network and a pooling unit, wherein the convolutional neural network is used for extracting a first feature map corresponding to an image to be detected, the first feature map comprises feature maps output by the last two convolutional layers, and the feature maps can be represented as a feature map F output by the a-1 layer_a-1And a characteristic diagram F output by the a-th layer_a. Pooling unit for matching feature maps F_a-1And feature map F_aThe pooling process based on attention was performed separately. The attention-based pooling process is to perform multi-head attention calculation on the feature map and perform normalization process on the multi-head attention value. The pooling unit may be preceded by a feature map F_a-1Performing a pooling process based on attention, and then comparing the feature map F_aAn attention-based pooling process is performed.

To the feature map F_a-1The example of performing the pooling process based on attention is illustrated, and specifically, the pooling unit may adopt a multi-head attention mechanism to perform the feature map F_a-1And performing multi-head attention calculation to obtain a multi-head attention value corresponding to the first feature map, and performing normalization processing on the multi-head attention value to obtain a first low-dimensional feature map corresponding to the first feature map. The process of attention-based pooling treatment (pma (z)) may be expressed by the following formula:

PMA(Z)＝LayerNorm(S+Multihead(S,Z,Z)) (1)

wherein S represents a first low-dimensional feature map, Z represents key and value vectors, i.e. feature map F_a-1Multihead (S, Z, Z) indicates a multi-head attention value, LayerNorm indicates normalization processing,

representing scale factors, dim representing a feature map F_a-1Of (c) is calculated.

It will be appreciated that the characteristic diagram F is processed in the manner described above_aPerforming pooling treatment based on attention to obtain a feature map F_aA corresponding first low-dimensional feature map. Thereby obtaining a characteristic diagram F_a-1Corresponding first low-dimensional feature map and feature map F_aA corresponding first low-dimensional feature map.

In this embodiment, by performing multi-head attention calculation and normalization processing on the first feature map, only information related to target detection needs to be extracted, so that the calculation amount is effectively reduced, calculation consumption caused by attention calculation in a subsequent feature extraction unit is reduced, and therefore, the calculation complexity and the memory complexity are effectively reduced.

In one embodiment, as shown in fig. 4, the step of extracting the features of the first low-dimensional feature map by the feature extraction unit to obtain the target capsule information corresponding to the image to be detected includes:

and 402, performing global feature extraction on the first low-dimensional feature map through a coding unit to obtain global feature information, and performing capsule conversion on the global feature information to obtain initial capsule information.

And step 404, inputting the initial capsule information into a decoding unit, performing class feature extraction on the initial capsule information to obtain class feature information, and performing capsule conversion on the class feature information to obtain target capsule information.

The feature extraction unit may be a transform network based on capsule representation. The feature extraction unit includes an encoding unit and a decoding unit. The encoding unit and the decoding unit both comprise a capsule conversion unit for converting the information into the form of capsules.

The encoding unit in the feature extraction unit is used for extracting global feature information in the first low-dimensional feature map, such as color features, texture features, shape features and the like, so that the global feature information is subjected to capsule conversion through the capsule conversion unit in the encoding unit to obtain initial capsule information. The initial capsule information refers to the capsule representation corresponding to the global characteristic information. And performing capsule conversion on the global characteristic information, namely gathering the characteristic information of the same target object in the global characteristic information to generate a capsule. The capsule is embodied in the form of a capsule vector, which therefore includes a plurality of local features of the target object. The modular length of the capsule vector corresponding to each capsule represents the probability of existence of each local feature in the target object, and the dimension of the capsule vector represents the attitude information corresponding to each local feature in the target object. The first low-dimensional feature map includes a feature map F_a-1Corresponding first low-dimensional feature map and feature map F_aCorresponding first low-dimensional feature map, and therefore, feature map F is included in the initial capsule information_a-1Corresponding initial capsule representation and characteristic map F_aCorresponding initial capsule representation, feature map F_a-1The corresponding initial capsule is denoted P_a-1Size of [ bs, mum_a-1,d_a-1/mum_a-1,s_a-1]Wherein bs represents F_a-1Corresponding batch size (number of samples in batch), mum of the initial capsule representation_a-1Is represented by F_a-1Number of capsules represented by corresponding initial capsules, d_a-1/mum_a-1Is represented by F_a-1A capsule vector for each capsule in the corresponding initial capsule representation. Feature map F_aThe corresponding initial capsule is denoted P_aSize of [ bs, mum_a,d_a/mum_a,s_a]Wherein bs represents F_aCorresponding batch size (number of samples in batch), mum of the initial capsule representation_aIs represented by F_aNumber of capsules represented by corresponding initial capsules, d_a/mum_aIs represented by F_aEach of the corresponding initial capsule representationsCapsule vector of capsule. By performing capsule conversion on the global feature information, feature information of the same instance type in the global feature information can be classified into one type, such as eyes, mouths, noses and the like of the same target object.

And a decoding unit in the feature extraction unit is used for extracting the category feature information in the initial capsule information and the boundary information of the target object, and the extracted category feature information and the boundary information of the target object are subjected to capsule conversion through a capsule conversion unit in the decoding unit to obtain target capsule information. The target capsule information includes P_a-1Corresponding target capsule representation O_a-1And P_aCorresponding target capsule representation O_a. Target capsule representation O_a-1May be [100, bs, mum ] in size_a-1,d_a-1/mum_a-1]Wherein 100 represents the number of capsules to be detected in the image to be detected. Target capsule representation O_aMay be [100, bs, mum ] in size_a,d_a/mum_a]. By performing capsule conversion on the extracted category feature information and the boundary information of the target objects, the feature information and the corresponding boundary information of the same target object can be clustered together, and therefore, the target capsule information includes local feature information corresponding to each target object, the feature information of the target object and the corresponding boundary information.

In this embodiment, by adding the process of capsule conversion to the encoding unit and the decoding unit, different postures of the target object can be accurately identified, the target object can be represented by local feature information, the local representation of the target object is more accurate, and the accuracy of target detection is improved.

In one embodiment, as shown in fig. 5, the step of performing target detection on the target capsule information by the prediction unit to obtain a target detection result corresponding to the image to be detected includes:

and 502, performing target detection on target capsule information through a prediction unit based on attention routing to obtain a first detection result.

And step 504, performing linear transformation on the target capsule information through a prediction unit to obtain a second detection result.

And 506, fusing the first detection result and the second detection result to obtain a target detection result corresponding to the image to be detected.

Because the target capsule information comprises P_a-1Corresponding target capsule representation O_a-1And P_aCorresponding target capsule representation O_aThe prediction units may respectively represent O to the target capsules_a-1And target capsule representation O_aAnd carrying out target detection, wherein the target detection refers to identifying target objects in the image to be detected according to the target capsule information and predicting the position and the category of each target object. The target detection process of the prediction unit comprises two target detection modes, wherein one mode is that the target detection is carried out according to local characteristic information in a capsule transmission mode, and the other mode is that the target detection is carried out according to overall information in a full connection mode. The prediction unit can process the target capsule information through two target detection modes at the same time, and fuse the results of the two target detection modes, so that the target detection accuracy is higher. The first detection result comprises a target capsule representation O_a-1Corresponding first detection result and target capsule representation O_aCorresponding first detection result, and second detection result including target capsule representation O_a-1Corresponding second detection result and target capsule representation O_aAnd the corresponding second detection result.

To represent O to the target capsule_a-1For example, when the target detection is performed by capsule delivery, the prediction unit may respectively represent the target capsules with O based on a bottom-up attention routing algorithm_a-1And target capsule representation O_aAnd carrying out target detection. The bottom-up attention routing algorithm refers to a multi-head attention mechanism-based routing algorithm, which needs to acquire probability values of lower-layer capsules to upper-layer capsules, such as probability of lower-layer capsules to be allocated to faces of target objects, wherein the lower-layer capsules can be target capsules and represent O_a-1The upper layer of the capsule can be the capsule after target detectionA capsule in the first detection result. The bottom-up attention routing algorithm operates by representing the target capsule as O_a-1The corresponding number of capsules is used as the head of a multi-head attention mechanism, and the target capsule represents O_a-1Calculating the correlation between each upper layer capsule and the lower layer capsule after affine transformation by adopting a multi-head attention mechanism according to the corresponding dimension of the number of the capsules, thereby realizing information transfer between the capsules and obtaining a target capsule representation O_a-1The calculation formula of the corresponding first detection result can be shown as (2). The first detection result may include a plurality of capsules, and the category and the position of the target object corresponding to each capsule. The number of the capsules and the number of the capsules to be detected in the target capsule information. The full-connection mode is a top-down information transmission mode, and when the target detection is carried out in the full-connection mode, the prediction unit represents O to the target capsule_a-1Performing linear transformation, determining the type and position of the target object by using the overall information of the target object to obtain a target capsule representation O_a-1And the corresponding second detection result.

The target capsule representation O can be obtained through prediction by the target detection mode_a-1And target capsule representation O_aAnd the corresponding first detection result and the second detection result. The first detection result is obtained by predicting the local characteristic information of the target object, and the second detection result is obtained by predicting the overall information of the target object, so that the prediction unit fuses the first detection result and the second detection result, the local characteristic information and the overall information of the target object can be fully utilized, and the accuracy of the target detection result is effectively improved.

In one embodiment, before acquiring the image to be detected, the method further comprises: acquiring a sample image set; inputting the sample image set into a target detection model to be trained, extracting a second feature map corresponding to the sample image set through a preprocessing unit in the target detection model to be trained, and performing attention-based pooling processing on the second feature map to obtain a second low-dimensional feature map corresponding to the second feature map; performing feature extraction on the second low-dimensional feature map through a feature extraction unit in the target detection model to be trained to obtain target capsule information corresponding to the sample image set; performing target detection on target capsule information corresponding to the sample image set through a prediction unit in a target detection model to be trained to obtain a target detection result corresponding to the sample image set; and calculating a loss value of the target detection model to be trained according to the target detection result corresponding to the sample image set, and updating the network parameters of the target detection model to be trained according to the loss value until a preset condition is met to obtain the trained target detection model.

The sample image set refers to training samples for training the target detection model, and the sample image set may include a plurality of sample images, where a plurality refers to two or more. The sample image set may be selected according to an application scenario, such as a vehicle detection scenario, where the sample images in the sample image set may include vehicles, pedestrians, and the like. In one embodiment, the storage location of the sample image set may be multiple, may be stored in the database, or may be stored in the terminal, so that when the object detection model is trained, the corresponding sample image set is obtained from the database or the terminal.

The target detection mode of the target detection model in the training process and the application process is the same. The target detection model to be trained comprises a preprocessing unit, a feature extraction unit and a prediction unit. The preprocessing unit in the target detection model to be trained is used for extracting a second feature map corresponding to each sample image in the sample image set, and the second feature map refers to a feature map output by the last two convolutional layers of the convolutional neural network when the convolutional neural network in the preprocessing unit extracts features of each sample image. The prediction unit performs attention-based pooling on the second feature map to obtain a second low-dimensional feature map corresponding to the second feature map, wherein the second low-dimensional feature map is a low-dimensional feature representation corresponding to the second feature map. And taking the second low-dimensional feature map as the input of the feature extraction unit, and performing feature extraction to obtain target capsule information corresponding to the sample image set. The target capsule information corresponding to the sample image set comprises local characteristic information of the target object in the sample image set. And then target detection is carried out on target capsule information corresponding to the sample image set through a prediction unit in a target detection model to be trained, specifically, the prediction unit can classify the target capsule information respectively through a capsule transmission mode and a full connection mode, the capsule transmission mode can realize inference from bottom to top, the local characteristic information of a target object is utilized, the full connection mode can realize information transmission from top to bottom, the overall information of the target object is utilized, the local characteristic information and the overall information of the target object are fully utilized by combining the two modes, and the accuracy of a target detection result is effectively improved.

Further, when the prediction unit in the target detection model to be trained performs target detection in a capsule transmission manner, the prediction unit may perform target detection on target capsule information corresponding to the sample image set by using a bottom-up attention routing algorithm to obtain a corresponding detection result. Specifically, the sample image set may include the number of categories, and the prediction unit may expand the target capsule information according to the number of categories by using a bottom-up attention routing algorithm, so as to ensure that the dimension of the output capsule corresponds to the number of categories, that is, the detection result includes the target objects of the number of categories. The expansion mode can be that 1 dimension is added on the basis of the original dimension of the target capsule information, and the first 4 dimensions are copied according to the category number. Therefore, the number of capsules corresponding to the target capsule information is used as the head of a multi-head attention mechanism, and the multi-head attention mechanism is adopted to calculate the correlation between each output capsule and each capsule in the target capsule information after affine transformation along the dimension where the number of capsules corresponding to the target capsule information is located, so that information transmission among the capsules is realized, and a corresponding detection result is obtained.

After the target detection result is obtained, the loss value of the target detection model to be trained can be calculated according to the target detection result corresponding to the sample image set. The loss value is a parameter for evaluating the prediction effect of the model, and the smaller the network loss value is, the better the prediction effect of the model is. Correspondingly, the loss value of the target detection model to be trained is used for evaluating one parameter of the target detection effect of the target detection network, and the smaller the loss value is, the better the target detection effect is.

And updating the network parameters of the target detection model to be trained according to the loss value and a preset network parameter updating mode to obtain an updated target detection model. And judging whether the updated target detection model to be trained meets the preset condition or not every time of updating. And if so, stopping model training, and taking the updated target detection model to be trained as the trained target detection model. And if not, returning to the step of inputting the sample image set into the target detection model to be trained until a preset condition is met, and determining the updated target detection model to be trained as the trained target detection model. The preset network parameter updating mode can be any one of error correction algorithms such as a gradient descent method and a back propagation algorithm. For example, Adam (Adaptive Moment Estimation) algorithm. The preset condition may be that the generated network loss value reaches a loss threshold, or that the number of iterations reaches an iteration number threshold, which is not limited herein.

In this embodiment, a preprocessing unit of the target detection model to be trained extracts a second feature map corresponding to the sample image set, and performs attention-based pooling on the second feature map to obtain a second low-dimensional feature map corresponding to the second feature map. And then, performing target detection on the target capsule information through the prediction unit to obtain a target detection result corresponding to the sample image set. Through the attention-based pooling treatment of the second characteristic diagram, the calculation complexity and the memory consumption can be reduced while effective information is extracted, the training time is greatly reduced, and the model convergence speed is accelerated. When the trained target detection model is used for carrying out target detection on a smaller target object or a partially shielded target object, the calculation complexity and the memory complexity can be reduced.

In one embodiment, the sample image set is labeled with target label information; calculating a loss value of the target detection model to be trained according to the target detection result corresponding to the sample image set comprises the following steps: performing binary matching on a target detection result corresponding to the sample image set and target label information to obtain a matching result; and calculating the loss value of the target detection model to be trained according to the matching result.

Target label information is marked in the sample image set, and the target label information comprises a category label of a target object and a frame corresponding to the target object in each sample image in the sample image set.

Specifically, the Hungarian algorithm can be adopted to conduct binary matching on the target detection result corresponding to the sample image set and the target label information, the Hungarian algorithm with the matching result can conduct unique matching on the target detection result and the target label information, and the prediction unit and the Hungarian algorithm are combined to predict a plurality of target objects in parallel. And calculating the loss value of the target detection model to be trained according to the matching result. In one embodiment, the loss values of the target detection model to be trained include a target position offset loss value, a classification loss value and a matching loss value. The target position offset loss value is the loss of position fitting between the frame of the target object in the target detection result and the frame of the corresponding target object in the target label information, and is used for improving the accuracy of the frame detection of the target object. The target position offset penalty value may be an IOU (cross-over ratio) penalty. The classification loss value, i.e. the class loss value, can adopt the common cross entropy loss to realize the multi-classification process of the target detection model and directly output the class of the target object. The matching loss value is used for realizing the unique matching of the frame of the target object in the target detection result and the frame of the corresponding target object in the target label information, is obtained by measuring the distance between the matching results, and is used for improving the matching accuracy of the target detection result and the target label information.

It should be understood that although the steps in the flowcharts of fig. 2 to 5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-5 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 6, there is provided an object detection apparatus including: an image acquisition module 602, a pre-processing module 604, a feature extraction module 606, and a target detection module 608, wherein:

an image obtaining module 602, configured to obtain an image to be detected.

The preprocessing module 604 is configured to input an image to be detected into a trained target detection model, where the target detection model includes a preprocessing unit, a feature extraction unit, and a prediction unit; and extracting a first feature map corresponding to the image to be detected through a preprocessing unit to obtain a first low-dimensional feature map corresponding to the first feature map.

The feature extraction module 606 performs feature extraction on the first low-dimensional feature map through a feature extraction unit to obtain target capsule information corresponding to the image to be detected.

And the target detection module 608 is configured to perform target detection on the target capsule information through the prediction unit to obtain a target detection result corresponding to the image to be detected.

In one embodiment, the preprocessing module 604 is further configured to perform feature extraction on the image to be detected through a convolutional neural network in the preprocessing unit, and determine a feature map output by the last two convolutional layers of the convolutional neural network as a first feature map corresponding to the image to be detected.

In an embodiment, the preprocessing module 604 is further configured to perform multi-head attention calculation on the first feature map to obtain a multi-head attention value corresponding to the first feature map; and carrying out normalization processing on the multi-head attention value to obtain a first low-dimensional feature map corresponding to the first feature map.

In one embodiment, the feature extraction unit includes an encoding unit and a decoding unit, and the feature extraction module 606 is further configured to perform global feature extraction on the first low-dimensional feature map through the encoding unit to obtain global feature information, and perform capsule transformation on the global feature information to obtain initial capsule information; inputting the initial capsule information into a decoding unit, extracting the category characteristics of the initial capsule information to obtain category characteristic information, and performing capsule conversion on the category characteristic information to obtain target capsule information.

In one embodiment, the target detection module 608 is further configured to perform target detection on the target capsule information based on the attention routing through the prediction unit, resulting in a first detection result; performing linear transformation on the target capsule information through a prediction unit to obtain a second detection result; and fusing the first detection result and the second detection result to obtain a target detection result corresponding to the image to be detected.

In one embodiment, the above apparatus further comprises:

and the sample acquisition module is used for acquiring the sample image set.

And the sample preprocessing module is used for inputting the sample image set into the target detection model to be trained, extracting a second feature map corresponding to the sample image set through a preprocessing unit in the target detection model to be trained, and performing attention-based pooling on the second feature map to obtain a second low-dimensional feature map corresponding to the second feature map.

And the sample characteristic extraction module is used for performing characteristic extraction on the second low-dimensional characteristic graph through a characteristic extraction unit in the target detection model to be trained to obtain target capsule information corresponding to the sample image set.

And the sample target detection module is used for carrying out target detection on the target capsule information corresponding to the sample image set through a prediction unit in the target detection model to be trained to obtain a target detection result corresponding to the sample image set.

And the parameter updating module is used for calculating a loss value of the target detection model to be trained according to the target detection result corresponding to the sample image set, and updating the network parameters of the target detection model to be trained according to the loss value until a preset condition is met to obtain the trained target detection model.

In one embodiment, the sample image set is labeled with target label information; the parameter updating module is also used for carrying out binary matching on the target detection result corresponding to the sample image set and the target label information to obtain a matching result; and calculating the loss value of the target detection model to be trained according to the matching result.

In one embodiment, the parameter updating module is further configured to calculate a target position offset loss value, a classification loss value, and a matching loss value of the target detection model to be trained.

For specific limitations of the target detection device, reference may be made to the above limitations of the target detection method, which are not described herein again. The modules in the target detection device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a cell interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data of an object detection method. The unit interface of the computer device is used for communicating with an external terminal through unit connection. The computer program is executed by a processor to implement a method of object detection.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the various embodiments described above when the processor executes the computer program.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the respective embodiments described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of object detection, the method comprising:

acquiring an image to be detected;

2. The method according to claim 1, wherein the extracting, by the preprocessing unit, the first feature map corresponding to the image to be detected comprises:

3. The method according to claim 1, wherein the performing attention-based pooling processing on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map comprises:

4. The method of claim 1, wherein the feature extraction unit comprises an encoding unit and a decoding unit; the feature extraction of the first low-dimensional feature map by the feature extraction unit to obtain the target capsule information corresponding to the image to be detected comprises the following steps:

5. The method according to claim 1, wherein the performing, by the prediction unit, the target detection on the target capsule information to obtain a target detection result corresponding to the image to be detected comprises:

6. The method according to claim 1, wherein prior to said acquiring an image to be detected, said method further comprises:

acquiring a sample image set;

7. The method of claim 6, wherein the sample image set is labeled with target label information; the calculating a loss value of the target detection model to be trained according to the target detection result corresponding to the sample image set includes:

8. The method of claim 7, wherein the loss values of the target detection model to be trained comprise a target position offset loss value, a classification loss value, and a matching loss value.

9. An object detection apparatus, characterized in that the apparatus comprises:

the image acquisition module is used for acquiring an image to be detected;

the image detection device comprises a preprocessing module, a feature extraction module and a prediction unit, wherein the preprocessing module is used for inputting the image to be detected into a trained target detection model, and the target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit; extracting a first feature map corresponding to the image to be detected through the preprocessing unit to obtain a first low-dimensional feature map corresponding to the first feature map;

the characteristic extraction module is used for extracting the characteristics of the first low-dimensional characteristic diagram through the characteristic extraction unit to obtain target capsule information corresponding to the image to be detected;

10. A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.