CN115205547A

CN115205547A - Target image detection method and device, electronic equipment and storage medium

Info

Publication number: CN115205547A
Application number: CN202210916890.5A
Authority: CN
Inventors: 王学彬; 王秋明
Original assignee: Beijing Yuanjian Information Technology Co Ltd
Current assignee: Beijing Yuanjian Information Technology Co Ltd
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2022-10-18

Abstract

The application provides a target image detection method, a target image detection device, electronic equipment and a storage medium, which are applied to the technical field of image detection, wherein the detection method comprises the following steps: acquiring a target image to be detected; inputting a target image to be detected into a backbone network of a pre-trained target detection model, performing feature extraction on the target image to be detected, and determining a shallow feature map and a deep feature map of the target image to be detected; inputting the shallow feature map and the deep feature map into a feature pyramid network of a target detection model, and weighting the shallow feature map and the deep feature map with different weights respectively to determine a fusion feature map; and inputting the fusion characteristic graph into a head network of the target detection model, and determining the target object in the target image to be detected. The shallow feature map and the deep feature map are weighted according to different weights in the feature pyramid network to obtain a fusion feature map, and the detection efficiency and accuracy of the target image are improved.

Description

Target image detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image detection technologies, and in particular, to a method and an apparatus for detecting a target image, an electronic device, and a storage medium.

Background

In the image auditing process, auditors usually rely on manpower to audit in the face of huge data volume every day, and the auditing process is time-consuming and labor-consuming, or a deep learning neural network is utilized to audit images, wherein the deep learning neural network respectively comprises a backbone network, a neck network and a head network, and the neck network plays a role in starting and stopping, so that a good neck network can better transfer characteristics.

At present, in the current image detection method, a neck network generally adopts a Feature Pyramid (FPN for short), which mainly functions to fuse a shallow local Feature and a deep semantic Feature, but in the process of Feature fusion, the importance of different features is often different, and further the fused features are not optimal, thereby affecting the detection result of the head network on an image, and therefore, how to improve the accuracy of image detection becomes a non-trivial technical problem.

Disclosure of Invention

In view of the above, an object of the present application is to provide a method, an apparatus, an electronic device, and a storage medium for detecting a target image, in which a shallow feature map and a deep feature map of the target image are input into a feature pyramid network, and weighting processing with different weights is performed on the shallow feature map and the deep feature map in the feature pyramid network to obtain a fused feature map, so as to improve detection efficiency and accuracy of the target image.

The embodiment of the application provides a target image detection method, which comprises the following steps:

acquiring a target image to be detected;

inputting the target image to be detected to a backbone network of a pre-trained target detection model, and performing feature extraction on the target image to be detected to determine a shallow feature map and a deep feature map of the target image to be detected;

inputting the shallow feature map and the deep feature map into a feature pyramid network of the target detection model, and performing weighting processing on the shallow feature map and the deep feature map with different weights respectively to determine a fusion feature map;

and inputting the fusion characteristic graph into a head network of the target detection model, and determining the target object in the target image to be detected.

In a possible implementation manner, the inputting the shallow feature map and the deep feature map into a feature pyramid network of the target detection model, and performing weighting processing on the shallow feature map and the deep feature map with different weights to determine a fused feature map includes:

inputting the shallow layer feature map and the deep layer feature map into the feature pyramid network for feature splicing to determine a spliced feature map;

inputting the spliced feature map into a convolution layer of the feature pyramid network, and performing convolution processing on the spliced feature map to determine a first target feature map;

inputting the first target feature map into a pooling layer of the feature pyramid network, and performing global average pooling on the first target feature map to determine a second target feature map;

inputting the second target feature map into an activation layer of the feature pyramid network, and performing activation processing on the second target feature map to determine a first target weight and a second target weight;

and inputting the first target weight and the second target weight into a fusion layer of the feature pyramid network, and performing weighting processing on the shallow feature map by using the first target weight and performing weighting processing on the deep feature map by using the second target weight to determine the fusion feature map.

In a possible implementation manner, the inputting the stitched feature map into a convolution layer of the feature pyramid network, performing convolution processing on the stitched feature map, and determining a first target feature map includes:

acquiring an initialized two-dimensional matrix, wherein the dimensionality of the initialized two-dimensional matrix is determined according to the number of convolution kernels and the number of eigenvectors of the spliced eigen map in different dimensionalities;

determining target numerical values of the feature vectors of the spliced feature maps under different dimensions based on feature information corresponding to the feature vectors of the spliced feature maps under different dimensions;

adding target values of the feature vectors of the spliced feature map under different dimensions to the initialized two-dimensional matrix to generate a target two-dimensional matrix;

and screening out the corresponding characteristic vectors of the spliced characteristic diagram under different dimensions by each convolution kernel according to the target numerical values of the characteristic vectors of the spliced characteristic diagram under different dimensions in the target two-dimensional matrix, and performing convolution processing to determine the first target characteristic diagram.

In a possible implementation manner, the inputting the second target feature map into an activation layer of the feature pyramid network, performing activation processing on the second target feature map, and determining a first target weight and a second target weight includes:

normalizing the second target feature map by using a sigmoid activation function to determine a weight vector;

and dividing based on the dimensionality of the weight vector to determine the first target weight and the second target weight.

In a possible implementation manner, the inputting the first target weight and the second target weight to a fusion layer of the feature pyramid network, and performing weighting processing on the shallow feature map by using the first target weight and the deep feature map by using the second target weight to determine the fusion feature map includes:

weighting the shallow feature map by using the first target weight to determine a weighted shallow feature map;

weighting the deep characteristic map by using the second target weight to determine a weighted deep characteristic map;

and performing feature addition on the weighted shallow feature map and the weighted deep feature map to determine the fusion feature map.

In one possible embodiment, the target detection model is trained by:

the method comprises the steps of obtaining a plurality of sample pictures and label information corresponding to each sample picture, and dividing the sample pictures into a training set and a verification set;

training an initial training model for multiple times by using the training set and label information corresponding to the sample pictures corresponding to the training set, and determining the target detection model;

and testing the target detection model by using the verification set to verify the detection accuracy of the target detection model.

The embodiment of the present application further provides a device for detecting a target image, where the device includes:

the acquisition module is used for acquiring a target image to be detected;

the characteristic extraction module is used for inputting the target image to be detected to a backbone network of a pre-trained target detection model, extracting the characteristics of the target image to be detected and determining a shallow characteristic diagram and a deep characteristic diagram of the target image to be detected;

the feature fusion module is used for inputting the shallow feature map and the deep feature map into a feature pyramid network of the target detection model, respectively weighting the shallow feature map and the deep feature map by different weights, and determining a fusion feature map;

and the determining module is used for inputting the fusion characteristic graph into a head network of the target detection model and determining a target object in the target image to be detected.

In a possible implementation manner, when the feature fusion module is configured to input the shallow feature map and the deep feature map into a feature pyramid network of the target detection model, perform weighting processing on the shallow feature map and the deep feature map with different weights, and determine a fusion feature map, the feature fusion module is specifically configured to:

inputting the shallow feature map and the deep feature map into the feature pyramid network for feature splicing to determine a spliced feature map;

and inputting the first target weight and the second target weight into a fusion layer of the feature pyramid network, and performing weighting processing on the shallow feature map by using the first target weight and the deep feature map by using the second target weight to determine the fusion feature map.

An embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions being executable by the processor to perform the steps of the method of detecting an object image as described above.

The embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program performs the steps of the method for detecting a target image as described above.

The embodiment of the application provides a target image detection method, a target image detection device, an electronic device and a storage medium, wherein the detection method comprises the following steps: acquiring a target image to be detected; inputting the target image to be detected to a backbone network of a pre-trained target detection model, and performing feature extraction on the target image to be detected to determine a shallow feature map and a deep feature map of the target image to be detected; inputting the shallow feature map and the deep feature map into a feature pyramid network of the target detection model, and respectively weighting the shallow feature map and the deep feature map with different weights to determine a fusion feature map; and inputting the fusion characteristic graph into a head network of the target detection model, and determining the target object in the target image to be detected. The shallow feature map and the deep feature map of the target image are input into the feature pyramid network, weighting processing with different weights is carried out on the shallow feature map and the deep feature map in the feature pyramid network, a fusion feature map is obtained, and the detection efficiency and accuracy of the target image are improved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart of a method for detecting a target image according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram illustrating a convolution kernel performing convolution in a detection method for a target image according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating a process of determining a fusion feature map in a method for detecting a target image according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an apparatus for detecting a target image according to an embodiment of the present disclosure;

fig. 5 is a second schematic structural diagram of an apparatus for detecting a target image according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

To make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Further, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and that steps without logical context may be performed in reverse order or concurrently. In addition, one skilled in the art, under the guidance of the present disclosure, may add one or more other operations to the flowchart, or may remove one or more operations from the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

To enable those skilled in the art to use the present disclosure in conjunction with a particular application scenario "detecting images," the following embodiments are presented to enable those skilled in the art to apply the general principles defined herein to other embodiments and application scenarios without departing from the spirit and scope of the present disclosure.

The following method, apparatus, electronic device or computer-readable storage medium in the embodiments of the present application may be applied to any scene that needs to detect an image, and the embodiments of the present application do not limit a specific application scene.

First, an application scenario to which the present application is applicable will be described. The method and the device can be applied to the technical field of image detection.

Research shows that, in the current image detection method, a neck network generally adopts a Feature Pyramid (FPN) in the current image detection method, which mainly functions to fuse a local Feature of a shallow layer and a semantic Feature of a deep layer, but in the process of Feature fusion, importance degrees of different features are often different, and thus the fused features are not optimal, so that the detection result of the head network on an image is influenced, and how to improve the accuracy of image detection becomes a technical problem which is not trivial.

Based on this, an embodiment of the present application provides a method for detecting a target image, where a shallow feature map and a deep feature map of the target image are input into a feature pyramid network, and weighting processing with different weights is performed on the shallow feature map and the deep feature map in the feature pyramid network to obtain a fused feature map, so as to improve detection efficiency and accuracy of the target image.

Referring to fig. 1, fig. 1 is a flowchart of a method for detecting a target image according to an embodiment of the present disclosure. As shown in fig. 1, a detection method provided in an embodiment of the present application includes:

s101: and acquiring a target image to be detected.

In the step, a target image to be detected is obtained, where the target image to be detected may be an image of contraband that is checked by a checker, or an image that needs to be checked and detected.

S102: inputting the target image to be detected to a backbone network of a pre-trained target detection model, performing feature extraction on the target image to be detected, and determining a shallow feature map and a deep feature map of the target image to be detected.

In the step, a target image to be detected is input into a backbone network of a pre-trained target detection model, and the backbone network performs feature extraction on the target image to be detected to determine a shallow feature map and a deep feature map of the target image to be detected.

Here, the shallow feature map is closer to the input target image to be detected, and includes information of more pixels, and information of some fine granularities, such as information of some colors, textures, edges, and edges of the target image to be detected, the receptive field of the shallow feature map is smaller, and the receptive field overlapping area is also smaller, so that it is ensured that the network captures more details.

Here, the deep feature map includes some coarse-grained information, which includes more abstract information, i.e. semantic information, the receptive fields of the deep feature map are increased, and when image information is added to the overlapping region between the receptive fields for compression, some information of the integrity of the target image to be detected is acquired.

In one possible embodiment, the target detection model is trained by:

a: the method comprises the steps of obtaining a plurality of sample pictures and label information corresponding to each sample picture, and dividing the sample pictures into a training set and a verification set.

Here, a plurality of sample pictures are collected in the network, and each sample picture corresponds to tag information, and the plurality of sample pictures are divided into a training set and a verification set in a ratio of 9.

B: and training an initial training model for multiple times by using the training set and the label information corresponding to the sample pictures corresponding to the training set, and determining the target detection model.

Firstly, an initial training model is built, the initial training model is a neural network model, and the initial training model is trained for multiple times by using label information corresponding to a plurality of sample pictures in a training set and a training set to obtain a target detection model.

C: and testing the target detection model by using the verification set to verify the detection accuracy of the target detection model.

After the training is completed, the target detection model is tested by using the verification set, the performance of the target detection model is verified, if the performance of the target detection model is high, the target detection model can detect the target image, and if the performance of the target detection model is low, the target detection model needs to be trained continuously.

In a specific embodiment, pictures of a plurality of sample guns are acquired, the pictures of the plurality of sample guns are made into a detection picture data set, and the detection picture data set is divided into a training set and a verification set; an initial training model is built, multiple times of training are conducted on the initial training model through label information corresponding to a plurality of sample gun pictures in a training set and a training set, a target detection model is obtained, after training is completed, the target detection model is tested through a verification set, and the performance of the target detection model is verified, so that the target detection model can detect out contraband guns according to target images.

S103: inputting the shallow feature map and the deep feature map into a feature pyramid network of the target detection model, and performing weighting processing on the shallow feature map and the deep feature map with different weights respectively to determine a fusion feature map.

In the step, the shallow feature map and the deep feature map acquired in the backbone network are input to a feature pyramid network, weighting processing is carried out on the shallow feature map and the deep feature map with different weights in the feature pyramid network, feature fusion is carried out on the shallow feature map and the deep feature map, and a fusion feature map is determined.

Here, the weighting for weighting the shallow feature map and the deep feature map is different.

a: and inputting the shallow feature map and the deep feature map into the feature pyramid network for feature splicing to determine a spliced feature map.

And inputting the shallow feature map and the deep feature map into the feature pyramid network, and splicing the shallow feature map and the deep feature map to obtain a spliced feature map.

And if the dimensions of the shallow feature map and the deep feature map are W H C, the dimension of the spliced feature map is W H2C.

b: and inputting the spliced characteristic diagram into a convolution layer of the characteristic pyramid network, and performing convolution processing on the spliced characteristic diagram to determine a first target characteristic diagram.

Here, the stitched feature map is input to the convolution layer of the feature pyramid network, and the stitched feature map is subjected to convolution processing, so that the first target feature map is determined.

And the first target characteristic diagram is a characteristic diagram which is consistent with the dimension of the spliced characteristic diagram and is obtained after the spliced characteristic diagram is convolved.

In a possible implementation manner, the inputting the stitched feature map into the convolution layer of the feature pyramid network, performing convolution processing on the stitched feature map, and determining a first target feature map includes:

(1): acquiring an initialized two-dimensional matrix, wherein the dimensionality of the initialized two-dimensional matrix is determined according to the number of convolution kernels and the feature vectors of the spliced feature map in different dimensionalities.

If the dimension of the spliced feature map is W x H x 2C, the feature vectors of the spliced feature map under different dimensions are the feature vectors corresponding to the spliced feature map under the dimensions W x H1, W x H2 \8230andW x H2C.

And if the dimension of the spliced feature map is W x H x 2C, the dimension of the initialized two-dimensional matrix is 2C x 2C.

The columns of the initialized two-dimensional matrix represent 2C convolution kernels, and the rows of the initialized two-dimensional matrix represent feature vectors of the spliced feature map under 2C different dimensions.

(2): and determining target numerical values of the feature vectors of the spliced feature maps under different dimensions based on feature information corresponding to the feature vectors of the spliced feature maps under different dimensions.

Here, the target values of the feature vectors of the feature maps spliced in different dimensions are determined by using the importance degrees of the feature information corresponding to the feature vectors of the feature maps spliced in different dimensions.

In the training and learning process of the initialized two-dimensional matrix, according to the importance degree of the feature information corresponding to the feature vectors of the feature maps spliced under different dimensions, the feature vectors of the spliced feature maps under the dimension which is considered to be required to be convolved are selected from the initialized two-dimensional matrix, the target value of the feature vector of the spliced feature map under the dimension which is required to be convolved is given to be greater than 0, and the target data of the feature vector of the feature spliced map under the dimension which is not required to be convolved is given to be less than 0.

The importance degree of the corresponding feature information of the feature vectors of the spliced feature maps under different dimensions is determined according to expert experience or network learning.

And if the importance degree of the feature information corresponding to the feature vector of the feature map spliced under different dimensions is higher, the target value of the feature vector is given to be more than 0, and if the importance degree is lower, the target value of the feature vector is given to be less than 0.

For example, the stitching feature map is an image of contraband of the tool, the feature images of the images of contraband of the tool in different dimensions are different, if the feature image of the image of contraband of the tool in a certain dimension includes color and texture information, the importance level of the feature information in the dimension is low, and if the feature image of the image of contraband of the tool in a certain dimension includes the contour of the tool, the importance level of the feature information in the dimension is high.

(3): and adding the target numerical value of the feature vector of the spliced feature map under different dimensions to the initialized two-dimensional matrix to generate a target two-dimensional matrix.

Here, target values of the feature vectors of the stitched feature map in different dimensions are added to the initialized two-dimensional matrix, and a target two-dimensional matrix is generated.

Each row in the target two-dimensional matrix is provided with characteristic vectors of spliced characteristic graphs under 2C different dimensions and target numerical values corresponding to the characteristic vectors of the spliced characteristic graphs under the dimensions, and each column in the target two-dimensional matrix is provided with 2C convolution kernels.

(4): and each convolution kernel screens out the corresponding characteristic vectors of the spliced characteristic diagram under different dimensionalities according to the target numerical values of the characteristic vectors of the spliced characteristic diagram under different dimensionalities in the target two-dimensional matrix, and performs convolution processing to determine the first target characteristic diagram.

And each convolution kernel screens out the corresponding characteristic vectors of the spliced characteristic diagrams under different dimensions according to the target numerical values of the characteristic vectors of the spliced characteristic diagrams under different dimensions in the target two-dimensional matrix, and performs convolution processing to determine the first target characteristic diagram.

Here, each convolution kernel needs to determine the corresponding feature vector of the feature map to be convolved under the dimensionality, and the feature vector of the feature map to be convolved under each dimensionality does not need to be convolved.

And if M (i, j) > =0, the ith convolution kernel is used for convoluting the feature vector of the spliced feature map under the jth dimension, and if M (i, j) <0, the ith convolution kernel is used for not convoluting the feature vector of the spliced feature map under the jth dimension.

After the feature vectors of the spliced feature maps under different dimensionalities corresponding to each convolution kernel are convolved, a plurality of convolution feature maps are obtained, and the plurality of convolution feature maps are spliced to obtain a first target feature map.

Here, please refer to fig. 2, fig. 2 is a schematic diagram illustrating convolution performed by a convolution kernel in a method for detecting a target image according to an embodiment of the present disclosure, where as shown in fig. 2, a black dot represents that a target value of a feature vector of a feature map stitched under a dimension in a target two-dimensional matrix is greater than 0, and a white dot represents that a target value of a feature vector of a feature map stitched under a dimension is less than 0. The first square is a feature vector of the spliced feature map under different dimensions corresponding to the target two-dimensional matrix, and the second square is a convolution kernel corresponding to the target matrix. For example, the first black dot, the third black dot, the fourth black dot, and the seventh black dot in the first row of the target two-dimensional matrix represent that the first convolution kernel needs to perform convolution on the feature vector of the stitched feature map in the first dimension, the feature vector of the stitched feature map in the third dimension, the feature vector of the stitched feature map in the fourth dimension, and the feature vector of the stitched feature map in the seventh dimension; the first black dot, the fourth black dot, the fifth black dot and the eighth black dot in the second row of the target two-dimensional matrix represent that the second convolution kernel needs to convolve the feature vector of the spliced feature map in the first dimension, the feature vector of the spliced feature map in the fourth dimension, the feature vector of the spliced feature map in the fifth dimension and the feature vector of the spliced feature map in the eighth dimension, and so on. Therefore, the technical problem that different feature vectors of the FPN are fused by the same weight in the feature fusion process, so that the fused features are not optimal is solved, the feature vectors of the spliced feature map under the dimensionality corresponding to each convolution kernel are selected, the feature vectors of the spliced feature map under all the dimensionalities do not need to be convolved, and therefore the feature semantics of the first target feature map is higher.

c: and inputting the first target feature map into a pooling layer of the feature pyramid network, and performing global average pooling on the first target feature map to determine a second target feature map.

Here, the first target feature map is input to the pooling layer, and the second target feature map is determined by performing global average pooling on the first target feature map.

d: and inputting the second target feature map into an activation layer of the feature pyramid network, and performing activation processing on the second target feature map to determine a first target weight and a second target weight.

Here, the second target feature map is input to an activation layer of the feature pyramid network, and activation processing is performed on the second target feature map to determine a first target weight and a second target weight.

firstly, the following steps: and normalizing the second target feature map by using a sigmoid activation function to determine a weight vector.

Here, normalization processing is performed on the second target feature map by using a sigmoid activation function, and a weight vector is determined.

II, secondly: and dividing based on the dimension of the weight vector to determine the first target weight and the second target weight.

Here, the first target weight and the second target weight are determined by dividing the dimensions of the weight vector. For example, if the dimension of the first weight is 1 × 2c, the weight vector is reduced to two first target weights and two second target weights with the size of 1 × c, wherein the first target weights and the second target weights have the same dimension but different weight contents.

e: and inputting the first target weight and the second target weight into a fusion layer of the feature pyramid network, and performing weighting processing on the shallow feature map and the deep feature map by using the first target weight to determine the fusion feature map.

Here, the first target weight and the second target weight are input to the fusion layer of the feature pyramid network, and the shallow feature map and the deep feature map are subjected to weighting processing to determine the fusion feature map.

1): and carrying out weighting processing on the shallow feature map by using the first target weight to determine the weighted shallow feature map.

Here, the shallow feature map is weighted by the first target weight, and the weighted shallow feature map is specified.

2): and weighting the deep feature map by using the second target weight to determine the weighted deep feature map.

Here, the deep feature map is weighted by the second target weight, and the weighted deep feature map is determined.

3): and performing characteristic addition on the weighted shallow characteristic diagram and the weighted deep characteristic diagram to determine the fusion characteristic diagram.

Here, the weighted shallow feature map and the weighted deep feature map are subjected to feature addition to determine a fused feature map.

Here, the deep feature map may be weighted by the first target weight to determine a weighted deep feature map, and the shallow feature map may be weighted by the second target weight to determine a weighted shallow feature map.

Further, please refer to fig. 3, fig. 3 is a schematic diagram illustrating a process of determining a fusion feature map in a method for detecting a target image according to an embodiment of the present application. As shown in fig. 3, the dimension of the shallow feature map is W × H × C, the dimension of the deep feature map is W × H × C, the shallow feature map and the deep feature map are spliced to obtain a spliced feature map, the dimension of the spliced feature map is W × H2C, each convolution kernel screens out feature vectors of the spliced feature maps in different corresponding dimensions from the target two-dimensional matrix, and performs convolution processing to determine a first target feature map, the dimension of the first target feature map is W × H2C, the first target feature map is pooled to determine a second target feature map, the dimension of the second target feature map is 1 × 1C, the second target feature map is activated to determine a first target weight 1 × C and a second target weight 1 × 1C, the shallow feature map is weighted by the first target weight, the weighted shallow feature map is determined, the deep feature map is weighted by the second target feature map, and the deep feature maps are determined by the weighted sum of the shallow feature maps, and the deep feature maps are fused. Because the feature semantics of the shallow feature map and the deep feature map are different, if equal weights are used for weighting processing, the semantic information of the fused feature map is inaccurate, the shallow feature map is weighted by the determined first target weight, and the deep feature map is weighted by the second target weight, so that the technical problem that the fused features are not optimal due to the fact that different feature vectors of the FPN are fused by the same weights in the feature fusion process is avoided, and the accuracy of the semantic information of the fused feature map is improved.

S104: and inputting the fusion characteristic graph into a head network of the target detection model, and determining a target object in the target image to be detected.

In the step, the fusion characteristic diagram is input to a head network of a target detection model, and a target object in a target image to be detected is determined.

Here, if the target image is an image of contraband that needs to be checked by the auditor, the identified target object may be a knife, a gun, or other contraband.

In a specific embodiment, a target image to be detected is obtained, the target image to be detected is input to a backbone network of a target detection model, and a shallow feature map and a deep feature map are output; inputting the shallow feature map and the deep feature map into a feature pyramid network of a target detection model, respectively performing different weight weighting processing on the shallow feature map and the deep feature map, and outputting a fusion feature map; the characteristic pyramid network layer is additionally provided with a two-dimensional matrix so that the characteristic fusion is carried out in a non-equal weight mode in the process of carrying out characteristic fusion on the shallow characteristic diagram and the deep characteristic diagram; and inputting the fusion characteristic graph into a head network of the target detection model, predicting the fusion characteristic graph, and detecting contraband contained in the target image to be detected.

The detection method of the target image provided by the embodiment of the application comprises the following steps: acquiring a target image to be detected; inputting a target image to be detected into a backbone network of a pre-trained target detection model, performing feature extraction on the target image to be detected, and determining a shallow feature map and a deep feature map of the target image to be detected; inputting the shallow feature map and the deep feature map into a feature pyramid network of a target detection model, and weighting the shallow feature map and the deep feature map with different weights respectively to determine a fusion feature map; and inputting the fusion characteristic graph into a head network of the target detection model, and determining the target object in the target image to be detected. The shallow feature map and the deep feature map of the target image are input into the feature pyramid network, weighting processing with different weights is carried out on the shallow feature map and the deep feature map in the feature pyramid network, a fusion feature map is obtained, and the detection efficiency and accuracy of the target image are improved.

Referring to fig. 4 and 5, fig. 4 is a schematic structural diagram of a device for detecting a target image according to an embodiment of the present disclosure, and fig. 5 is a second schematic structural diagram of the device for detecting a target image according to an embodiment of the present disclosure. As shown in fig. 4, the detection apparatus 400 of the target image includes:

an obtaining module 410, configured to obtain a target image to be detected;

the feature extraction module 420 is configured to input the target image to be detected to a backbone network of a pre-trained target detection model, perform feature extraction on the target image to be detected, and determine a shallow feature map and a deep feature map of the target image to be detected;

the feature fusion module 430 is configured to input the shallow feature map and the deep feature map into a feature pyramid network of the target detection model, perform weighting processing on the shallow feature map and the deep feature map with different weights, and determine a fusion feature map;

the determining module 440 is configured to input the fusion feature map to the head network of the target detection model, and determine a target object in the target image to be detected.

Further, when the feature fusion module 430 is configured to input the shallow feature map and the deep feature map into the feature pyramid network of the target detection model, perform weighting processing on the shallow feature map and the deep feature map with different weights, and determine a fusion feature map, the feature fusion module 430 is specifically configured to:

inputting the spliced characteristic diagram into a convolution layer of the characteristic pyramid network, and performing convolution processing on the spliced characteristic diagram to determine a first target characteristic diagram;

Further, when the feature fusion module 430 is configured to input the stitched feature map into the convolution layer of the feature pyramid network, perform convolution processing on the stitched feature map, and determine a first target feature map, the feature fusion module 430 is specifically configured to:

acquiring an initialized two-dimensional matrix, wherein the dimensionality of the initialized two-dimensional matrix is determined according to the number of convolution kernels and the number of characteristic vectors of the spliced characteristic diagram under different dimensionalities;

determining target numerical values of the feature vectors of the spliced feature maps in different dimensions based on feature information corresponding to the feature vectors of the spliced feature maps in different dimensions;

and each convolution kernel screens out the corresponding characteristic vectors of the spliced characteristic diagram under different dimensionalities according to the target numerical values of the characteristic vectors of the spliced characteristic diagram under different dimensionalities in the target two-dimensional matrix, and performs convolution processing to determine the first target characteristic diagram.

Further, when the feature fusion module 430 is configured to input the second target feature map to an activation layer of the feature pyramid network, perform activation processing on the second target feature map, and determine a first target weight and a second target weight, the feature fusion module 430 is specifically configured to:

and dividing based on the dimension of the weight vector to determine the first target weight and the second target weight.

Further, when the feature fusion module 430 is configured to input the first objective weight and the second objective weight to a fusion layer of the feature pyramid network, perform weighting processing on the shallow feature map by using the first objective weight and perform weighting processing on the deep feature map by using the second objective weight, and determine the fusion feature map, the feature fusion module 430 is specifically configured to:

and performing characteristic addition on the weighted shallow characteristic diagram and the weighted deep characteristic diagram to determine the fusion characteristic diagram.

Further, as shown in fig. 5, the apparatus 400 for detecting a target image further includes a model training module 450, wherein the model training module 450 trains the target detection model by:

The embodiment of the application provides a detection device of a target image, the detection device includes: the acquisition module is used for acquiring a target image to be detected; the characteristic extraction module is used for inputting the target image to be detected to a backbone network of a pre-trained target detection model, extracting the characteristics of the target image to be detected and determining a shallow characteristic diagram and a deep characteristic diagram of the target image to be detected; the feature fusion module is used for inputting the shallow feature map and the deep feature map into a feature pyramid network of the target detection model, respectively weighting the shallow feature map and the deep feature map by different weights, and determining a fusion feature map; and the determining module is used for inputting the fusion characteristic graph into a head network of the target detection model and determining the target object in the target image to be detected. The shallow feature map and the deep feature map of the target image are input into the feature pyramid network, weighting processing with different weights is carried out on the shallow feature map and the deep feature map in the feature pyramid network, a fusion feature map is obtained, and the detection efficiency and accuracy of the target image are improved.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 6, the electronic device 600 includes a processor 610, a memory 620, and a bus 630.

The memory 620 stores machine-readable instructions executable by the processor 610, when the electronic device 600 runs, the processor 610 communicates with the memory 620 through the bus 630, and when the machine-readable instructions are executed by the processor 610, the steps of the method for detecting a target image in the method embodiment shown in fig. 1 may be performed.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for detecting a target image in the method embodiment shown in fig. 1 may be executed.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the units into only one type of logical function may be implemented in other ways, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in software functional units and sold or used as a stand-alone product, may be stored in a non-transitory computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: those skilled in the art can still make modifications or changes to the embodiments described in the foregoing embodiments, or make equivalent substitutions for some features, within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for detecting a target image, the method comprising:

acquiring a target image to be detected;

inputting the target image to be detected to a backbone network of a pre-trained target detection model, performing feature extraction on the target image to be detected, and determining a shallow feature map and a deep feature map of the target image to be detected;

2. The detection method according to claim 1, wherein the inputting the shallow feature map and the deep feature map into a feature pyramid network of the target detection model, weighting the shallow feature map and the deep feature map with different weights respectively, and determining a fused feature map comprises:

and inputting the first target weight and the second target weight to a fusion layer of the feature pyramid network, and performing weighting processing on the shallow feature map by using the first target weight and performing weighting processing on the deep feature map by using the second target weight to determine the fusion feature map.

3. The method of claim 2, wherein the inputting the stitched feature map into a convolutional layer of the feature pyramid network, performing convolution processing on the stitched feature map, and determining a first target feature map comprises:

4. The detection method according to claim 2, wherein the inputting the second target feature map into an activation layer of the feature pyramid network, performing activation processing on the second target feature map, and determining a first target weight and a second target weight includes:

5. The detection method according to claim 2, wherein the inputting the first target weight and the second target weight into a fusion layer of the feature pyramid network, and performing weighting processing on the shallow feature map by using the first target weight and the deep feature map by using the second target weight to determine the fusion feature map comprises:

6. The detection method according to claim 1, wherein the target detection model is trained by:

and testing the target detection model by using the verification set, and verifying the detection accuracy of the target detection model.

7. An apparatus for detecting an image of an object, the apparatus comprising:

the acquisition module is used for acquiring a target image to be detected;

the feature fusion module is used for inputting the shallow feature map and the deep feature map into a feature pyramid network of the target detection model, respectively performing weighting processing on the shallow feature map and the deep feature map by different weights, and determining a fusion feature map;

and the determining module is used for inputting the fusion characteristic graph into a head network of the target detection model and determining the target object in the target image to be detected.

8. The detecting device according to claim 7, wherein when the feature fusion module is configured to input the shallow feature map and the deep feature map into a feature pyramid network of the target detection model, perform weighting processing on the shallow feature map and the deep feature map with different weights, and determine a fused feature map, the feature fusion module is specifically configured to:

9. An electronic device, comprising: processor, memory and bus, said memory storing machine-readable instructions executable by said processor, said processor and said memory communicating via said bus when the electronic device is running, said machine-readable instructions being executed by said processor to perform the steps of the method of detecting an object image according to any one of claims 1 to 6.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the method for detecting an object image according to any one of claims 1 to 6.