CN117953202A

CN117953202A - Method and device for detecting object in image, electronic equipment and readable storage medium

Info

Publication number: CN117953202A
Application number: CN202410128403.8A
Authority: CN
Inventors: 师平
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2024-01-29
Filing date: 2024-01-29
Publication date: 2024-04-30

Abstract

The disclosure relates to the technical field of artificial intelligence, and provides a method and a device for detecting an object in an image, electronic equipment and a readable storage medium. The method comprises the following steps: acquiring an image to be detected; inputting the image to be detected into an object detection model to perform feature extraction of different scales to obtain a first scale feature map of the image to be detected, a second scale feature map of the image to be detected and a third scale feature map of the image to be detected; performing feature fusion on the third scale feature map of the image to be detected and the second scale feature map of the image to be detected to obtain a second fusion feature map of the image to be detected; carrying out feature fusion on the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected to obtain a third fusion feature map of the image to be detected; respectively carrying out object prediction on the third scale feature map of the image to be detected, the second fusion feature map of the image to be detected and the third fusion feature map of the image to be detected to obtain a first detection result of the image to be detected, a second detection result of the image to be detected and a third detection result of the image to be detected; according to the first detection result of the image to be detected, the second detection result of the image to be detected and the third detection result of the image to be detected, the number of objects in the image to be detected is determined, the problem of low object detection precision caused by object density in the prior art is solved, the accuracy and the robustness of an object detection model are improved, and the application range of the object detection model is enlarged.

Description

Method and device for detecting object in image, electronic equipment and readable storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, and in particular relates to a method and a device for detecting an object in an image, electronic equipment and a readable storage medium.

Background

In recent years, with the development of artificial intelligence technology, image recognition technology is increasingly applied. The object detection is an important branch in the technical field of image recognition, and can be applied to a plurality of scenes, such as video monitoring, security and protection scenes. In video monitoring people counting, the problem of low detection precision and tracking interruption caused by high shielding of traditional human body detection due to the blockage of body clothes can be caused. The existing people counting methods mainly comprise two methods, namely a regression-based people counting method and a detection-based people counting method. The regression-based demographics method is to train a regression model based on the training mode of the input image prediction density map, predict the sum of the whole density map in units of pixels to calculate the final demographics, but the method is closely related to the resolution of the network input image and does not consider the demographics position in the density map. The statistical method of the number of people based on detection is to directly input images into a target detection frame trained in advance for classification, and classify bounding boxes belonging to people to obtain the final number of people. However, due to the fact that the human bodies are different in height, fat and thin, and the like, the detection accuracy is low due to the fact that shielding is easy to occur in densely populated scenes.

Disclosure of Invention

In view of the above, embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a readable storage medium for detecting objects in an image, so as to solve the problem in the prior art that the accuracy of object detection is low due to object density.

In a first aspect of an embodiment of the present disclosure, there is provided a method for detecting an object in an image, including: acquiring an image to be detected; inputting the image to be detected into an object detection model to perform feature extraction of different scales to obtain a first scale feature map of the image to be detected, a second scale feature map of the image to be detected and a third scale feature map of the image to be detected; performing feature fusion on the third scale feature map of the image to be detected and the second scale feature map of the image to be detected to obtain a second fusion feature map of the image to be detected; carrying out feature fusion on the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected to obtain a third fusion feature map of the image to be detected; respectively carrying out object prediction on the third scale feature map of the image to be detected, the second fusion feature map of the image to be detected and the third fusion feature map of the image to be detected to obtain a first detection result of the image to be detected, a second detection result of the image to be detected and a third detection result of the image to be detected; and determining the number of objects in the image to be detected according to the first detection result of the image to be detected, the second detection result of the image to be detected and the third detection result of the image to be detected.

In a second aspect of the embodiments of the present disclosure, there is provided an object detection apparatus in an image, including: the acquisition module is used for acquiring the image to be detected; the feature extraction module is used for inputting the image to be detected into the object detection model to perform feature extraction of different scales to obtain a first scale feature map of the image to be detected, a second scale feature map of the image to be detected and a third scale feature map of the image to be detected; the first fusion module is used for carrying out feature fusion on the third scale feature map of the image to be detected and the second scale feature map of the image to be detected to obtain a second fusion feature map of the image to be detected; the second fusion module is used for carrying out feature fusion on the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected to obtain a third fusion feature map of the image to be detected; the prediction module is used for respectively carrying out object prediction on the third scale feature map of the image to be detected, the second fusion feature map of the image to be detected and the third fusion feature map of the image to be detected to obtain a first detection result of the image to be detected, a second detection result of the image to be detected and a third detection result of the image to be detected; the determining module is used for determining the number of objects in the image to be detected according to the first detection result of the image to be detected, the second detection result of the image to be detected and the third detection result of the image to be detected.

In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the disclosed embodiments, there is provided a readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.

Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: the method provided by the embodiment of the disclosure is applied to the object detection model after training, and the first scale feature map, the second scale feature map and the third scale feature map of the images to be detected, which are different in size, are obtained by extracting the features of the images to be detected in different scales. The individual scale feature maps represent representations of the image to be detected on different scales, each scale capturing different details and context information. And carrying out feature fusion on the basis of the third-scale feature map of the detection image and the second-scale feature map of the image to be detected to obtain a second fusion feature map of the image to be detected, carrying out feature fusion on the basis of the second fusion feature map of the image to be detected and the first-scale feature map of the image to be detected to obtain the third fusion feature map of the image to be detected, integrating multi-scale information through feature fusion, enhancing feature representation, and therefore improving the accuracy of object detection. Respectively carrying out object prediction on a third scale feature map of the image to be detected, a second fusion feature map of the image to be detected and a third fusion feature map of the image to be detected, wherein an object detection model predicts objects possibly existing in the image to be detected and positions and categories of the objects, and a first detection result of the image to be detected, which is predicted based on the third scale feature map of the image to be detected, a second detection result of the image to be detected, which is predicted based on the second fusion feature map of the image to be detected, and a third detection result of the image to be detected, which is predicted based on the third fusion feature map of the image to be detected, are obtained. The results may include a detection box and a corresponding confidence score. And determining the final number of the objects in the image to be detected according to the detection results based on the three different feature maps. According to the method for detecting the object in the image, the feature fusion of the object detection model is enhanced by carrying out feature extraction on the image to be detected with different scales and carrying out feature fusion on the feature extraction result, so that the object detection model can pay more attention to information of smaller objects, the information from different levels and scales can be effectively integrated, the problem of low object detection precision caused by object density in the prior art is solved, the accuracy and the robustness of the object detection model are improved, and the application range of the object detection model is enlarged.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a scene schematic diagram of an application scene of an embodiment of the present disclosure;

Fig. 2 is a flowchart of a method for detecting an object in an image according to an embodiment of the disclosure;

FIG. 3 is a flowchart of another method for detecting an object in an image according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an object detection device in an image according to an embodiment of the present disclosure;

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

An object detection method and apparatus in an image according to embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 1 is a scene diagram of an application scene of an embodiment of the present disclosure. The application scenario may include terminal devices 1,2 and 3, a server 4 and a network 5.

The terminal devices 1,2 and 3 may be hardware or software. When the terminal devices 1,2 and 3 are hardware, they may be various electronic devices having a display screen and supporting communication with the server 4, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like; when the terminal apparatuses 1,2, and 3 are software, they can be installed in the electronic apparatus as above. The terminal devices 1,2 and 3 may be implemented as a plurality of software or software modules, or as a single software or software module, to which the embodiments of the present disclosure are not limited. Further, various applications, such as a data processing application, an instant messaging tool, social platform software, a search class application, a shopping class application, and the like, may be installed on the terminal devices 1,2, and 3.

The server 4 may be a server that provides various services, for example, a background server that receives a request transmitted from a terminal device with which communication connection is established, and the background server may perform processing such as receiving and analyzing the request transmitted from the terminal device and generate a processing result. The server 4 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center, which is not limited in the embodiment of the present disclosure.

The server 4 may be hardware or software. When the server 4 is hardware, it may be various electronic devices that provide various services to the terminal devices 1, 2, and 3. When the server 4 is software, it may be a plurality of software or software modules providing various services to the terminal devices 1, 2, and 3, or may be a single software or software module providing various services to the terminal devices 1, 2, and 3, which is not limited by the embodiments of the present disclosure.

The network 5 may be a wired network using coaxial cable, twisted pair wire, and optical fiber connection, or may be a wireless network that can implement interconnection of various Communication devices without wiring, for example, bluetooth (Bluetooth), near Field Communication (NFC), infrared (Infrared), etc., which are not limited by the embodiments of the present disclosure.

The user can establish a communication connection with the server 4 via the network 5 through the terminal devices 1,2, and 3 to receive or transmit information or the like. Specifically, the server 4 acquires an image to be detected; inputting the image to be detected into an object detection model to perform feature extraction of different scales to obtain a first scale feature map of the image to be detected, a second scale feature map of the image to be detected and a third scale feature map of the image to be detected; performing feature fusion on the third scale feature map of the image to be detected and the second scale feature map of the image to be detected to obtain a second fusion feature map of the image to be detected; carrying out feature fusion on the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected to obtain a third fusion feature map of the image to be detected; respectively carrying out object prediction on the third scale feature map of the image to be detected, the second fusion feature map of the image to be detected and the third fusion feature map of the image to be detected to obtain a first detection result of the image to be detected, a second detection result of the image to be detected and a third detection result of the image to be detected; and determining the number of objects in the image to be detected according to the first detection result of the image to be detected, the second detection result of the image to be detected and the third detection result of the image to be detected. It should be noted that the specific types, numbers and combinations of the terminal devices 1,2 and 3, the server 4 and the network 5 may be adjusted according to the actual requirements of the application scenario, which is not limited by the embodiment of the present disclosure.

Fig. 2 is a flowchart of an object detection method in an image according to an embodiment of the present disclosure. The method of object detection in the image of fig. 2 may be performed by the server of fig. 1. As shown in fig. 2, the method for detecting an object in an image includes:

in step 201, an image to be detected is acquired.

In some embodiments, the image to be detected may be an image or video frame that requires object detection. The present disclosure provides a method for object detection in an image, which may be specifically applied to an object detection model. The obtained size of the image to be detected may be different, before the image to be detected is input into the object detection model, the input size of the image to be detected may be adjusted, and the image to be detected is adjusted to the size required by the model, for example, the image to be detected is adjusted to the pixel size of 640 x 480, so that the accuracy and the operation efficiency of the object detection model may be improved.

Step 202, inputting an image to be detected into an object detection model to perform feature extraction of different scales, and obtaining a first scale feature map of the image to be detected, a second scale feature map of the image to be detected and a third scale feature map of the image to be detected.

In some embodiments, the present disclosure performs feature extraction of different sizes on an image to be detected through an object detection model, performs feature fusion on a result of the feature extraction, performs object prediction based on a result of the feature fusion, and finally obtains the number of objects in the image to be detected. The object detection model comprises a feature extraction network, wherein the feature extraction network comprises a plurality of feature processing modules and a plurality of residual error processing modules, and each feature processing module comprises a convolution layer, a normalization layer and an activation function. By stacking the plurality of feature processing modules and the plurality of residual processing modules, the feature information in the image to be detected can be captured from different levels and scales, and the output of each feature processing module or the residual processing module can be regarded as a scale feature map which is gradually abstract and has higher semantic information along with the deepening of the levels, namely a first scale feature map of the image to be detected, a second scale feature map of the image to be detected and a third scale feature map of the image to be detected. The problem of gradient disappearance and gradient explosion in the deep neural network can be solved by introducing the residual processing module, and the information in the feature map can be directly transmitted to the network of the next layer or higher layer through residual connection, so that the gradient can be better transmitted and updated, and the stability of the object detection model can be improved. The method comprises the steps that an image to be detected is input into a feature extraction network of an object detection model to extract features of different scales, and the object detection model can capture feature information in the image to be detected from multiple layers and multiple scales through the processing of the multiple feature extraction modules and the multiple residual error processing modules, so that a first scale feature map of the image to be detected, a second scale feature map of the image to be detected and a third scale feature map of the image to be detected of different scales are obtained, and further a subsequent target detection task is better carried out.

And 203, carrying out feature fusion on the third scale feature map of the image to be detected and the second scale feature map of the image to be detected to obtain a second fusion feature map of the image to be detected.

In some embodiments, the third scale feature map of the image to be detected and the second scale feature map of the image to be detected respectively correspond to feature extraction results of different degrees, and include different feature information in the image to be detected. The third scale feature map of the image to be detected contains a higher level of feature information than the second scale feature map of the image to be detected. The method can be used for carrying out some adjustment on the third-scale feature map of the image to be detected, so that the size and the channel number of the feature map after adjustment are consistent with those of the second-scale feature map of the image to be detected, and the feature map after adjustment is spliced with the second-scale feature map of the image to be detected, so that feature information from different levels can be integrated, feature representation is further enhanced, a second fusion feature map of the image to be detected, which contains information of the third-scale feature map of the image to be detected and information of the second-scale feature map of the image to be detected, is obtained, and the method has richer semantic information and higher abstract level, thereby being beneficial to better understanding the content of the image to be detected by a computer, capturing more detail information and context information, being beneficial to improving the accuracy and robustness of object detection, improving the object detection accuracy of an object detection model under complex scenes such as object shielding and small object scale, and obtaining better target detection effect.

And 204, carrying out feature fusion on the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected to obtain a third fusion feature map of the image to be detected.

In some embodiments, the second fused feature map of the image to be detected contains a higher level of feature information than the first scale feature map of the image to be detected. The second fusion feature map of the image to be detected can be adjusted in some ways, so that the size and the channel number of the feature map after adjustment are consistent with those of the first scale feature map of the image to be detected, the feature map after adjustment is spliced with the first scale feature map of the image to be detected, feature information from different scales is integrated, feature representation is further enhanced, a third fusion feature map of the image to be detected, which contains information of the first scale feature map of the image to be detected and information of the second fusion feature map of the image to be detected, is obtained, the third fusion feature map has richer semantic information and higher abstraction level, a computer can be helped to better understand the content of the image to be detected, more detail information and context information are captured, accuracy and robustness of object detection are improved, object detection accuracy of an object detection model under complex scenes such as object shielding and small object scale are improved, and a better target detection effect is obtained.

Step 205, performing object prediction on the third scale feature map of the image to be detected, the second fused feature map of the image to be detected, and the third fused feature map of the image to be detected, respectively, to obtain a first detection result of the image to be detected, a second detection result of the image to be detected, and a third detection result of the image to be detected.

In some embodiments, the third scale feature map of the image to be detected is predicted to obtain a first detection result of the image to be detected, where the first detection result of the image to be detected is a series of detection frames and corresponding confidence levels, each detection frame corresponds to an object, each detection frame is determined by a set of coordinates, including coordinates of an upper left corner and a lower right corner of the detection frame, and the confidence level corresponding to each detection frame is used to represent a reliability level of the object in the detection frame.

In some embodiments, the second fused feature map of the image to be detected is predicted to obtain a second detection result of the image to be detected, where the second detection result of the image to be detected is a series of detection frames and corresponding confidence levels, each detection frame corresponds to an object, each detection frame is determined by a set of coordinates, including coordinates of an upper left corner and a lower right corner of the detection frame, and the confidence level corresponding to each detection frame is used to represent the reliability of the object in the detection frame.

In some embodiments, predicting a third fused feature map of an image to be detected to obtain a third detection result of the image to be detected, where the third detection result of the image to be detected is a series of detection frames and corresponding confidence levels, each detection frame corresponds to an object, each detection frame is determined by a set of coordinates, including coordinates of an upper left corner and a lower right corner of the detection frame, and the confidence level corresponding to each detection frame is used to represent a reliability level of the object in the detection frame.

By predicting on the feature map of the plurality of scales (including the third-scale feature map of the image to be detected, the second fused feature map of the image to be detected, and the third fused feature map of the image to be detected), a plurality of detection results can be obtained. The plurality of detection results are prediction results obtained by integrating the feature information with different levels and abstract degrees, and the final target detection result can be obtained by integrating the detection results on different feature graphs and comprehensively considering the prediction results and the confidence information of higher feature graphs. Post-processing operations such as non-maximum suppression and the like can be performed on the plurality of detection frames to remove redundant and overlapped detection frames, finally obtain the number of objects in the image to be detected, and further improve the accuracy and the robustness of target detection.

Step 206, determining the number of objects in the image to be detected according to the first detection result of the image to be detected, the second detection result of the image to be detected and the third detection result of the image to be detected.

In some embodiments, the first detection result of the image to be detected, the second detection result of the image to be detected and the third detection result of the image to be detected include a series of detection frames and corresponding confidence levels, the detection frames with the confidence levels lower than a certain threshold value can be removed, redundant and overlapped detection frames can be removed by a non-maximum suppression method, the remaining detection frames are determined to be the detection frames of the object, and the number of the objects in the image to be detected is determined by counting the number of the remaining detection frames. The number of objects in the image to be detected can be more accurately determined by comprehensively considering the first detection result of the image to be detected, the second detection result of the image to be detected and the third detection result of the image to be detected and through a series of post-processing operations. The method can improve the accuracy and the robustness of target detection, improve the object detection accuracy of the object detection model under complex scenes such as object shielding, small object scale and the like, and obtain more accurate number of objects in the image to be detected.

According to the method for detecting the object in the image, the first scale feature map of the image to be detected, the second scale feature map of the image to be detected and the third scale feature map of the image to be detected, which are different in size, are obtained through feature extraction of different scales of the image to be detected. The individual scale feature maps represent representations of the image to be detected on different scales, each scale capturing different details and context information. And carrying out feature fusion on the basis of the third-scale feature map of the detection image and the second-scale feature map of the image to be detected to obtain a second fusion feature map of the image to be detected, carrying out feature fusion on the basis of the second fusion feature map of the image to be detected and the first-scale feature map of the image to be detected to obtain the third fusion feature map of the image to be detected, integrating multi-scale information through feature fusion, enhancing feature representation, and therefore improving the accuracy of object detection. Respectively carrying out object prediction on a third scale feature map of the image to be detected, a second fusion feature map of the image to be detected and a third fusion feature map of the image to be detected, wherein an object detection model predicts objects possibly existing in the image to be detected and positions and categories of the objects, and a first detection result of the image to be detected, which is predicted based on the third scale feature map of the image to be detected, a second detection result of the image to be detected, which is predicted based on the second fusion feature map of the image to be detected, and a third detection result of the image to be detected, which is predicted based on the third fusion feature map of the image to be detected, are obtained. The results may include a detection box and a corresponding confidence score. And determining the final number of the objects in the image to be detected according to the detection results based on the three different feature maps. According to the method for detecting the object in the image, the feature fusion of the object detection model is enhanced by carrying out feature extraction on the image to be detected with different scales and carrying out feature fusion on the feature extraction result, so that the object detection model can pay more attention to information of smaller objects, the information from different levels and scales can be effectively integrated, the problem of low object detection precision caused by object density in the prior art is solved, the accuracy and the robustness of the object detection model are improved, and the application range of the object detection model is enlarged.

In some embodiments, the object detection model includes a feature extraction network, where the feature extraction network includes a first feature processing module, a first residual processing module, a second residual processing module, a third residual processing module, a fourth residual processing module, a fifth residual processing module, a sixth residual processing module, a seventh residual processing module, an eighth residual processing module, and a second feature processing module, and the feature extraction method inputs an image to be detected into the object detection model to extract features of different scales, and obtains a first scale feature map of the image to be detected, a second scale feature map of the image to be detected, and a third scale feature map of the image to be detected, and includes: inputting the image to be detected into a first feature processing module for feature processing to obtain a first feature processing result of the image to be detected; inputting a first characteristic processing result of the image to be detected into a first residual processing module for residual processing, carrying out residual processing on an output result of the first residual processing module through a second residual processing module, and carrying out residual processing on an output result of the second residual processing module through a third residual processing module to obtain a first scale characteristic diagram of the image to be detected; inputting the first scale feature map of the image to be detected into a fourth residual error processing module for residual error processing, carrying out residual error processing on an output result of the fourth residual error processing module through a fifth residual error processing module, carrying out residual error processing on an output result of the fifth residual error processing module through a sixth residual error processing module, and carrying out residual error processing on an output result of the sixth residual error processing module through a seventh residual error processing module to obtain a second scale feature map of the image to be detected; and inputting the second scale feature map of the image to be detected into an eighth residual processing module for residual processing, and carrying out feature processing on the output result of the eighth residual processing module through the second feature processing module to obtain a third scale feature map of the image to be detected.

In some embodiments, the first feature processing module comprises one feature processing unit including a convolution layer, a normalization layer, and leaky relu activation function layers, and the second feature processing module comprises five feature processing units. The first residual processing module, the second residual processing module, the third residual processing module, the fourth residual processing module, the fifth residual processing module, the sixth residual processing module, the seventh residual processing module and the eighth residual processing module have the same structure, and each residual processing module comprises a convolution layer, an activation function and residual connection. The first scale feature map of the image to be detected, the second scale feature map of the image to be detected and the third scale feature map of the image to be detected represent feature representations of the image to be detected at different levels and degrees of abstraction.

In some embodiments, the detected image is input into a first feature processing module, the input image to be detected is subjected to preliminary feature processing, a convolution layer of the first feature processing module is used for extracting local features in the image, a normalization layer accelerates a training process and improves convergence speed of a model, a leakage ReLU activation function increases nonlinear expression capacity of the model, and a first feature processing result of the image to be detected is output. Inputting a first characteristic processing result of the image to be detected into a first residual processing module, obtaining the first residual processing result of the image to be detected after a series of operations such as convolution and activation functions, inputting the first residual processing result of the image to be detected into a second residual processing module, continuing residual processing, similarly obtaining a second residual processing result of the image to be detected, and inputting the second residual processing result of the image to be detected into a third residual processing module for residual processing, thus obtaining a first scale characteristic diagram of the image to be detected. Through the series of residual processing, each residual processing module carries out nonlinear transformation such as convolution on the output of the previous module, and the representation capability of the features is enhanced. And obtaining a first scale feature map of the image to be detected after residual processing of the first residual processing module, the second residual processing module and the third residual processing module. The first scale feature map of the image to be detected contains feature information of the image to be detected on a certain scale, and valuable feature representation is provided for subsequent target detection tasks. The object detection model can be helped to learn and represent complex features in the image to be detected better through multi-stage residual processing, and accuracy and robustness of target detection are improved.

In some embodiments, the first scale feature map of the image to be detected is input to a fourth residual processing module for residual processing, so that feature representation can be further enhanced, more complex features can be extracted, a series of operations such as convolution and activation functions can be performed on the input first scale feature map of the image to be detected in the fourth residual processing module, a fourth residual processing result of the image to be detected is obtained, the fourth residual processing result of the image to be detected is input to a fifth residual processing module, and a series of operations such as convolution and activation functions are performed continuously to obtain a fifth residual processing result of the image to be detected. Similarly, inputting a fifth residual processing result of the image to be detected into a sixth residual processing module for residual processing, further enhancing feature representation to obtain a sixth residual processing result of the image to be detected, inputting the sixth residual processing result of the image to be detected into a seventh residual processing module for performing a series of operations such as convolution and nonlinear activation to obtain a second scale feature map of the image to be detected. Through the series of residual processing, each residual processing module carries out nonlinear transformation such as convolution on the output of the previous module, and the characteristic representation capability is enhanced. And after the processing, obtaining a second scale characteristic diagram of the image to be detected. The second scale feature map of the image to be detected contains rich feature information of the image to be detected on a certain scale, and valuable feature representation is provided for subsequent target detection tasks. The object detection model can be helped to learn and represent complex features in the image to be detected better through multi-stage residual processing, and accuracy and robustness of target detection are improved.

In some embodiments, the second scale feature map of the image to be detected is input to an eighth residual processing module to perform residual processing, a series of operations such as convolution and nonlinear activation are performed on the second scale feature map of the image to be detected to obtain an eighth residual processing result of the image to be detected, the eighth residual processing result of the image to be detected is input to the second feature processing module, and the third scale feature map of the image to be detected is obtained through processing of five feature processing units of the second feature processing module, wherein the third scale feature map contains rich feature information of the image to be detected on the other scale, and valuable feature representation is provided for a subsequent target detection task. The characteristic extraction mode is beneficial to better understand and represent the content of the image to be detected by the object detection model, and improves the accuracy and the robustness of target detection.

In some embodiments, the object detection model includes a feature fusion network, the feature fusion network includes a third feature processing module, and feature fusion is performed on a third scale feature map of the image to be detected and a second scale feature map of the image to be detected to obtain a second fused feature map of the image to be detected, including: performing up-sampling treatment on the third-scale feature map of the image to be detected to obtain an up-sampling result of the third-scale feature map of the image to be detected; splicing the up-sampling result of the third-scale feature map of the image to be detected and the second-scale feature map of the image to be detected to obtain an initial second fusion feature map of the image to be detected; and inputting the initial fusion feature map of the image to be detected into a third feature processing module for feature processing to obtain a second fusion feature map of the image to be detected.

In some embodiments, the third feature processing module includes five feature processing units including a convolution layer, a normalization layer, and leaky relu activation function layers. And up-sampling the third-scale feature map of the image to be detected by interpolation and transposition convolution methods so that the size of the third-scale feature map of the image to be detected is matched with the size of the second-scale feature map of the image to be detected, and an up-sampling result of the third-scale feature map of the image to be detected is obtained. And splicing an up-sampling result of the third-scale feature map of the image to be detected and the second-scale feature map of the image to be detected in the channel dimension, and combining the feature maps of different scales together to obtain a richer feature representation, namely an initial second fusion feature map of the image to be detected, wherein the feature map can contain detail information and context information of the image. And inputting the initial second fusion feature map of the image to be detected into a third feature processing module for feature processing, wherein the third feature processing module comprises five feature processing units, each unit extracts and optimizes features through a convolution layer, a normalization layer and a leak ReLU activation function layer, so that the object detection model can be helped to learn and express more complex features, the nonlinear expression capability is enhanced, the second fusion feature map of the image to be detected is obtained, the second fusion feature map of the image to be detected is further fused with information from different scales and subjected to feature processing, and more powerful feature expression is provided for subsequent target detection tasks. Through the feature fusion network of the object detection model, the object detection model can effectively combine feature information of different scales, and accuracy and robustness of target detection are improved.

In some embodiments, the feature fusion network includes a fourth feature processing module, performing feature fusion on the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected to obtain a third fusion feature map of the image to be detected, including: performing up-sampling treatment on the second fusion feature map of the image to be detected to obtain an up-sampling result of the second fusion feature map of the image to be detected; splicing the up-sampling result of the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected to obtain a third initial fusion feature map of the image to be detected; carrying out convolution processing on the third initial fusion feature map of the image to be detected to obtain a convolution result of the third initial fusion feature map of the image to be detected; and inputting a convolution result of the third initial fusion feature map of the image to be detected into a fourth feature processing module for feature processing to obtain the third fusion feature map of the image to be detected.

In some embodiments, the fourth feature processing module includes five feature processing units including a convolution layer, a normalization layer, and leaky relu activation function layers. And performing up-sampling processing on the second fusion feature map of the image to be detected by interpolation and transposition convolution methods so that the size of the second fusion feature map of the image to be detected is matched with the size of the first scale feature map of the image to be detected, and obtaining an up-sampling result of the second fusion feature map of the image to be detected. And splicing the up-sampling result of the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected in the channel dimension, and combining the feature maps of different scales and different stages to obtain a richer feature representation, namely an initial third fusion feature map of the image to be detected, wherein the feature map can contain detail information and context information of the image. And carrying out convolution processing on the initial third fusion feature map of the image to be detected through a convolution layer of 1 multiplied by 1, thereby being beneficial to further extracting and integrating the features and increasing the channel number of the feature map. By means of convolution processing, more abstract and useful feature representations, namely convolution results of the third initial fusion feature map of the image to be detected, can be extracted from the initial third fusion feature map of the image to be detected. And the convolution result of the third initial fusion feature map of the image to be detected is input into a fourth feature processing module for feature processing, the fourth feature processing module comprises five feature processing units, each unit extracts and optimizes features through a convolution layer, a normalization layer and a leakage ReLU activation function layer, the object detection model can be helped to learn and express more complex features, meanwhile, the nonlinear expression capacity is enhanced, the third fusion feature map of the image to be detected is obtained, the third fusion feature map of the image to be detected further fuses information from different scales and through feature processing, and more powerful feature expression is provided for subsequent target detection tasks. Through the feature fusion network of the object detection model, the object detection model can effectively combine feature information of different scales, and accuracy and robustness of target detection are improved.

Referring to fig. 3, the feature extraction network and the feature fusion network of the object detection model include: the first feature processing module 301, the first residual processing module 302, the second residual processing module 303, the third residual processing module 304, the fourth residual processing module 305, the fifth residual processing module 306, the sixth residual processing module 307, the seventh residual processing module 308, the eighth residual processing module 309, the second feature processing module 310, the first upsampling layer 311, the first stitching processing module 312, the third feature processing module 313, the second upsampling layer 314, the second stitching processing module 315, the convolution layer 316, and the fourth feature processing module 317. The image to be detected is input into the first feature processing module 301 for feature processing, and a first feature processing result of the image to be detected is obtained. The first feature processing result of the image to be detected is input into a first residual processing module 302 to perform residual processing, a first residual processing result of the image to be detected is obtained, the first residual processing result of the image to be detected is input into a second residual processing module 303 to perform residual processing, a second residual processing result of the image to be detected is obtained, the second residual processing result of the image to be detected is input into a third residual processing module 304 to perform residual processing, and a first scale feature map of the image to be detected is obtained. The first scale feature map of the image to be detected is input into a fourth residual processing module 305 for residual processing to obtain a fourth residual processing result of the image to be detected, the fourth residual processing result of the image to be detected is input into a fifth residual processing module 306 for residual processing to obtain a fifth residual processing result of the image to be detected, the fifth residual processing result of the image to be detected is input into a sixth residual processing module 307 for residual processing to obtain a sixth residual processing result of the image to be detected, and the sixth residual processing result of the image to be detected is input into a seventh residual processing module 308 for residual processing to obtain a second scale feature map of the image to be detected. The second scale feature map of the image to be detected is input into an eighth residual processing module 309 to perform residual processing, so as to obtain an eighth residual processing result of the image to be detected, and the eighth residual processing result of the image to be detected is input into a second feature processing module 310 to perform feature processing, so as to obtain a third scale feature map of the image to be detected. The third scale feature map of the image to be detected is input into the first upsampling layer 311 for upsampling to obtain an upsampling result of the third scale feature map of the image to be detected, the upsampling result of the third scale feature map of the image to be detected and the second scale feature map of the image to be detected are input into the first splicing processing module 312 for splicing to obtain an initial second fusion feature map of the image to be detected, and the initial second fusion feature map of the image to be detected is input into the third feature processing module for feature processing to obtain a second fusion feature map of the image to be detected. The second fused feature map of the image to be detected is input into a second upsampling layer 314 to perform upsampling processing to obtain an upsampling result of the second fused feature map of the image to be detected, the upsampling result of the second fused feature map of the image to be detected and the first scale feature map of the image to be detected are input into a second splicing processing module 315 to perform splicing to obtain a third initial fused feature map of the image to be detected, the third initial fused feature map of the image to be detected is input into a convolution layer 316 to perform convolution processing to obtain a convolution result of the third initial fused feature map of the image to be detected, and the convolution result of the third initial fused feature map of the image to be detected is input into a fourth feature processing module 317 to perform feature processing to obtain the third fused feature map of the image to be detected. Through the feature extraction network and the feature fusion network of the object detection model, the object detection model can effectively extract feature information of different scales and combine the feature information of different scales, and through carrying out feature extraction of different scales on an image to be detected and carrying out feature fusion on a feature extraction result, the feature fusion capability of the object detection model is enhanced, so that the object detection model can pay more attention to information of smaller objects, the information from different levels and scales can be effectively integrated, the problem of low object detection precision caused by object concentration in the prior art is solved, the accuracy and the robustness of the object detection model are improved, and the application range of the object detection model is enlarged.

In some embodiments, the object detection model includes an object detection network, the object prediction network includes a first deconvolution network, a second deconvolution network, and a third deconvolution network, and performing object prediction on a third scale feature map of an image to be detected, a second fused feature map of the image to be detected, and a third fused feature map of the image to be detected, respectively, to obtain a first detection result of the image to be detected, a second detection result of the image to be detected, and a third detection result of the image to be detected, including: inputting a third scale feature map of the image to be detected into a first deconvolution feature network, and mapping the size of the third feature map of the image to be detected into a first deconvolution result of the image to be detected, which is the same as the size of the image to be detected, based on the first deconvolution feature network; executing detection operation on the first deconvolution result of the image to be detected to obtain a first detection result of the image to be detected; inputting a second fusion feature map of the image to be detected into a second deconvolution feature network, and mapping the size of the second fusion feature map of the image to be detected into a second deconvolution result of the image to be detected, which is the same as the size of the image to be detected, based on the second deconvolution feature network; performing detection operation on the second deconvolution result of the image to be detected to obtain a second detection result of the image to be detected; inputting a third fusion feature image of the image to be detected into a third deconvolution feature network, and mapping the size of the third fusion feature image of the image to be detected into a third deconvolution result of the image to be detected, which is the same as the size of the image to be detected, based on the third deconvolution feature network; and executing detection operation on the third deconvolution result of the image to be detected to obtain a third detection result of the image to be detected.

In some embodiments, the third scale feature map of the image to be detected is input into the first deconvolution feature network, and the size of the third scale feature map of the image to be detected is gradually enlarged so as to gradually approach the size of the image to be detected. And mapping the size of the third scale feature map of the image to be detected into a first deconvolution result of the image to be detected, which is the same as the size of the image to be detected, based on the first deconvolution feature network. The performing of the detection operation on the first deconvolution result of the image to be detected may include performing detection frame prediction on the adjusted first deconvolution result of the image to be detected to identify and locate an object in the image to be detected, to obtain a first detection result of the image to be detected, where the first detection result of the image to be detected includes a series of first detection frames and corresponding confidence levels. Specifically, the dimension of the third scale feature map of the image to be detected is 20×15×18, the resolution dimension of the image to be detected is 640×480, and the first detection result of the image to be detected includes 20×15×3=900 first detection frames.

In some embodiments, the second fused feature map of the image to be detected is input into the second deconvolution feature network, and the size of the second fused feature map of the image to be detected is gradually enlarged so as to gradually approach the size of the image to be detected. And mapping the size of the second fusion feature map of the image to be detected into a second deconvolution result of the image to be detected, which is the same as the size of the image to be detected, based on the second deconvolution feature network. The performing of the detection operation on the second deconvolution result of the image to be detected may include performing detection frame prediction on the adjusted second deconvolution result of the image to be detected to identify and locate an object in the image to be detected, to obtain a second detection result of the image to be detected, where the second detection result of the image to be detected includes a series of second detection frames and corresponding confidence levels. Specifically, the dimension of the third scale feature map of the image to be detected is 80×60×18, the resolution dimension of the image to be detected is 640×480, and the second detection result of the image to be detected includes 80×60×3=14400 second detection frames.

In some embodiments, the third fused feature map of the image to be detected is input into a third deconvolution feature network, and the size of the third fused feature map of the image to be detected is gradually enlarged so as to gradually approach the size of the image to be detected. And mapping the size of the third fusion feature map of the image to be detected into a third deconvolution result of the image to be detected, which is the same as the size of the image to be detected, based on the third deconvolution feature network. Executing the detection operation on the third deconvolution result of the image to be detected may include performing detection frame prediction on the adjusted third deconvolution result of the image to be detected to identify and locate an object in the image to be detected, thereby obtaining a third detection result of the image to be detected, where the third detection result of the image to be detected includes a series of third detection frames and corresponding confidence levels. Specifically, the dimension of the third scale feature map of the image to be detected is 80×60×18, the resolution dimension of the image to be detected is 640×480, and the third detection result of the image to be detected includes 80×60×3=14400 third detection frames. Through multi-scale and multi-stage processing, the object detection model can combine characteristic information from different scales and different stages, and accuracy and robustness of target identification are improved.

In some embodiments, performing a detection operation on a first deconvolution result of the image to be detected to obtain a first detection result of the image to be detected includes: and carrying out detection frame prediction on the first deconvolution result of the image to be detected to obtain a plurality of first detection frames of the image to be detected, and determining the confidence coefficient of each first detection frame of each image to be detected.

In some embodiments, a series of first candidate boxes are generated using a preset anchor box technique based on a first deconvolution result of the image to be detected. The first candidate frame covers an area where an object may exist in the image to be detected. And fine-tuning the position of each first candidate frame to enable the position of each first candidate frame to be closer to the actual target boundary, so as to obtain a plurality of first detection frames of the image to be detected. Specifically, the dimension of the third scale feature map of the image to be detected is 20×15×18, and the number of the first detection frames is 20×15×3=900. And calculating the matching degree of each first detection frame and the actual object to obtain the confidence score of each first detection frame. The confidence score reflects the degree of matching of the first detection box to the actual object. The confidence coefficient of each first detection frame is obtained by comprehensively classifying the confidence coefficient and the regression confidence coefficient of the boundary frame.

In some embodiments, performing a detection operation on a second deconvolution result of the image to be detected to obtain a second detection result of the image to be detected, including: and carrying out detection frame prediction on a second deconvolution result of the image to be detected to obtain a plurality of second detection frames of the image to be detected, and determining the confidence coefficient of each second detection frame of each image to be detected.

In some embodiments, a series of second candidate boxes is generated using a preset anchor box technique based on a second deconvolution result of the image to be detected. The second candidate frame covers an area where an object may exist in the image to be detected. And fine-tuning the positions of the second candidate frames to enable the positions to be closer to the actual target boundary, so as to obtain a plurality of second detection frames of the image to be detected. Specifically, the dimension of the second fused feature map of the image to be detected is 80×60×18, and the number of the second detection frames is 80×60×3=14400. And calculating the matching degree of each second detection frame and the actual object to obtain the confidence score of each second detection frame. The confidence score reflects the degree of matching of the third detection box to the actual object. The confidence of each second detection frame is obtained by comprehensively classifying the confidence and the confidence of the regression of the boundary frame.

Executing the detection operation on the third deconvolution result of the image to be detected to obtain a third detection result of the image to be detected, including: and carrying out detection frame prediction on a third deconvolution result of the image to be detected to obtain a plurality of third detection frames of the image to be detected, and determining the confidence coefficient of each third detection frame of each image to be detected.

In some embodiments, a series of third candidate boxes is generated using a preset anchor box technique based on a third deconvolution result of the image to be detected. The third candidate frame covers an area where an object may exist in the image to be detected. And fine-tuning the positions of the third candidate frames to enable the positions to be closer to the actual target boundary, so as to obtain a plurality of third detection frames of the image to be detected. Specifically, the dimension of the third fused feature map of the image to be detected is 80×60×18, and the number of the third detection frames is 80×60×3=14400. And calculating the matching degree of each third detection frame and the actual object to obtain the confidence score of each third detection frame. The confidence score reflects the degree of matching of the third detection box to the actual object. The confidence of each third detection frame is obtained by comprehensively classifying the confidence and regressing the confidence of the boundary frame. And carrying out detection frame prediction on the first deconvolution result of the image to be detected, the second deconvolution result of the image to be detected and the third deconvolution result of the image to be detected through the detection frame prediction, and determining the confidence coefficient of each detection frame, thereby providing a basis for subsequent object detection and counting statistics of the objects.

In some embodiments, determining the number of objects in the image to be detected based on the first detection result of the image to be detected, the second detection result of the image to be detected, and the third detection result of the image to be detected includes: comparing the confidence coefficient of each first detection frame of each image to be detected with a preset value, and determining a first target detection frame by the first detection frames with the confidence coefficient larger than a preset threshold value; comparing the confidence coefficient of each second detection frame of each image to be detected with a preset value, and determining a second target detection frame by the second detection frames with the confidence coefficient larger than a preset threshold value; comparing the confidence coefficient of each third detection frame of each image to be detected with a preset value, and determining a third target detection frame by the third detection frames with the confidence coefficient larger than a preset threshold value; the number of objects is obtained based on the number of first target detection frames, the number of second target detection frames, and the number of third target detection frames.

In some embodiments, comparing the confidence coefficient of each first detection frame of the image to be detected with a preset threshold value, reserving the first detection frames of the image to be detected, the confidence coefficient of which is larger than the preset threshold value, and determining the first detection frames as first target detection frames; comparing the confidence coefficient of each second detection frame of the image to be detected with a preset threshold value, reserving the second detection frames of the image to be detected, the confidence coefficient of which is larger than the preset threshold value, and determining the second detection frames as second target detection frames; and comparing the confidence coefficient of each third detection frame of the image to be detected with a preset threshold value, reserving the third detection frames of the image to be detected, the confidence coefficient of which is larger than the preset threshold value, and determining the third detection frames as third target detection frames. The target detection frames with high confidence can be screened out by comparing the confidence of each detection frame with a preset threshold. The preset threshold in the embodiment of the disclosure is not limited, and is set according to actual test conditions. The number of objects is calculated based on the number of first, second and third target detection frames screened. If there are multiple target detection boxes that overlap more, the target overlapping boxes can be combined using a non-maximum suppression algorithm to ensure that each object is counted only once and the final determined number of objects is output.

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Fig. 4 is a schematic diagram of an object detection device in an image according to an embodiment of the present disclosure. As shown in fig. 4, the object detection device in the image includes:

an acquisition module 401, configured to acquire an image to be detected;

The feature extraction module 402 is configured to input an image to be detected into the object detection model to perform feature extraction with different scales, so as to obtain a first scale feature map of the image to be detected, a second scale feature map of the image to be detected, and a third scale feature map of the image to be detected;

the first fusion module 403 is configured to perform feature fusion on the third scale feature map of the image to be detected and the second scale feature map of the image to be detected, so as to obtain a second fused feature map of the image to be detected;

the second fusion module 404 is configured to perform feature fusion on the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected, so as to obtain a third fusion feature map of the image to be detected;

The prediction module 405 is configured to respectively perform object prediction on the third scale feature map of the image to be detected, the second fused feature map of the image to be detected, and the third fused feature map of the image to be detected, so as to obtain a first detection result of the image to be detected, a second detection result of the image to be detected, and a third detection result of the image to be detected;

the determining module 406 is configured to determine the number of objects in the image to be detected according to the first detection result of the image to be detected, the second detection result of the image to be detected, and the third detection result of the image to be detected.

According to the technical scheme provided by the embodiment of the disclosure, the feature extraction module 402 performs feature extraction of different scales on the image to be detected to obtain a first scale feature map of the image to be detected, a second scale feature map of the image to be detected and a third scale feature map of the image to be detected, wherein the first scale feature map, the second scale feature map and the third scale feature map are different in size. The individual scale feature maps represent representations of the image to be detected on different scales, each scale capturing different details and context information. The first fusion module 403 performs feature fusion based on the third scale feature map of the detected image and the second scale feature map of the image to be detected to obtain a second fusion feature map of the image to be detected, and the second fusion module 404 performs feature fusion based on the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected to obtain the third fusion feature map of the image to be detected, integrates multi-scale information through feature fusion, and enhances feature representation, thereby improving accuracy of object detection. Object prediction is respectively carried out on the third scale feature map of the image to be detected, the second fusion feature map of the image to be detected and the third fusion feature map of the image to be detected through a prediction module 405, and an object detection model predicts objects possibly existing in the image to be detected and positions and categories of the objects to be detected, so that a first detection result of the image to be detected, which is predicted based on the third scale feature map of the image to be detected, a second detection result of the image to be detected, which is predicted based on the second fusion feature map of the image to be detected, and a third detection result of the image to be detected, which is predicted based on the third fusion feature map of the image to be detected, are obtained. The results may include a detection box and a corresponding confidence score. The determining module 406 determines the final number of objects in the image to be detected according to the detection results based on the three different feature maps. According to the method for detecting the object in the image, the feature fusion of the object detection model is enhanced by carrying out feature extraction on the image to be detected with different scales and carrying out feature fusion on the feature extraction result, so that the object detection model can pay more attention to information of smaller objects, the information from different levels and scales can be effectively integrated, the problem of low object detection precision caused by object density in the prior art is solved, the accuracy and the robustness of the object detection model are improved, and the application range of the object detection model is enlarged.

In some embodiments, the object detection model includes a feature extraction network, where the feature extraction network includes a first feature processing module, a first residual processing module, a second residual processing module, a third residual processing module, a fourth residual processing module, a fifth residual processing module, a sixth residual processing module, a seventh residual processing module, an eighth residual processing module, and a second feature processing module, and the feature extraction module 402 is configured to input an image to be detected into the first feature processing module for feature processing, to obtain a first feature processing result of the image to be detected; inputting a first characteristic processing result of the image to be detected into a first residual processing module for residual processing, carrying out residual processing on an output result of the first residual processing module through a second residual processing module, and carrying out residual processing on an output result of the second residual processing module through a third residual processing module to obtain a first scale characteristic diagram of the image to be detected; inputting the first scale feature map of the image to be detected into a fourth residual error processing module for residual error processing, carrying out residual error processing on an output result of the fourth residual error processing module through a fifth residual error processing module, carrying out residual error processing on an output result of the fifth residual error processing module through a sixth residual error processing module, and carrying out residual error processing on an output result of the sixth residual error processing module through a seventh residual error processing module to obtain a second scale feature map of the image to be detected; and inputting the second scale feature map of the image to be detected into an eighth residual processing module for residual processing, and carrying out feature processing on the output result of the eighth residual processing module through the second feature processing module to obtain a third scale feature map of the image to be detected.

In some embodiments, the object detection model includes a feature fusion network, where the feature fusion network includes a third feature processing module, and the first fusion module 403 is configured to perform upsampling processing on a third scale feature map of the image to be detected, to obtain an upsampling result of the third scale feature map of the image to be detected; splicing the up-sampling result of the third-scale feature map of the image to be detected and the second-scale feature map of the image to be detected to obtain an initial second fusion feature map of the image to be detected; and inputting the initial fusion feature map of the image to be detected into a third feature processing module for feature processing to obtain a second fusion feature map of the image to be detected.

In some embodiments, the feature fusion network includes a fourth feature processing module, and the second fusion module 404 is configured to perform upsampling processing on the second fusion feature map of the image to be detected, so as to obtain an upsampling result of the second fusion feature map of the image to be detected; splicing the up-sampling result of the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected to obtain a third initial fusion feature map of the image to be detected; carrying out convolution processing on the third initial fusion feature map of the image to be detected to obtain a convolution result of the third initial fusion feature map of the image to be detected; and inputting a convolution result of the third initial fusion feature map of the image to be detected into a fourth feature processing module for feature processing to obtain the third fusion feature map of the image to be detected.

In some embodiments, the object detection model comprises an object detection network, the object prediction network comprising a first deconvolution network, a second deconvolution network, and a third deconvolution network, the prediction module 405 configured to input a third scale feature map of the image to be detected into the first deconvolution feature network, map a size of the third feature map of the image to be detected to a first deconvolution result of the image to be detected that is the same size as the image to be detected based on the first deconvolution feature network; executing detection operation on the first deconvolution result of the image to be detected to obtain a first detection result of the image to be detected; inputting a second fusion feature map of the image to be detected into a second deconvolution feature network, and mapping the size of the second fusion feature map of the image to be detected into a second deconvolution result of the image to be detected, which is the same as the size of the image to be detected, based on the second deconvolution feature network; performing detection operation on the second deconvolution result of the image to be detected to obtain a second detection result of the image to be detected; inputting a third fusion characteristic diagram of the image to be detected into a third deconvolution characteristic network, and mapping the size of the third fusion characteristic diagram of the image to be detected into a third deconvolution result of the image to be detected, which is the same as the size of the image to be detected, based on the third deconvolution characteristic network; and executing detection operation on the third deconvolution result of the image to be detected to obtain a third detection result of the image to be detected.

In some embodiments, the prediction module 405 is configured to perform a detection operation on a first deconvolution result of an image to be detected, obtain a first detection result of the image to be detected, perform detection frame prediction on the first deconvolution result of the image to be detected, obtain a plurality of first detection frames of the image to be detected, and determine a confidence level of each first detection frame of each image to be detected. And executing detection operation on the second deconvolution result of the image to be detected to obtain a second detection result of the image to be detected, carrying out detection frame prediction on the second deconvolution result of the image to be detected to obtain a plurality of second detection frames of the image to be detected, and determining the confidence coefficient of each second detection frame of each image to be detected. And executing detection operation on the third convolution result of the image to be detected to obtain a third detection result of the image to be detected, carrying out detection frame prediction on the third convolution result of the image to be detected to obtain a plurality of third detection frames of the image to be detected, and determining the confidence degree of each third detection frame of each image to be detected.

In some embodiments, the determining module 406 is configured to compare the confidence level of each first detection frame of each image to be detected with a preset value, and determine a first target detection frame for a first detection frame having a confidence level greater than a preset threshold; comparing the confidence coefficient of each second detection frame of each image to be detected with a preset value, and determining a second target detection frame by the second detection frames with the confidence coefficient larger than a preset threshold value; comparing the confidence coefficient of each third detection frame of each image to be detected with a preset value, and determining a third target detection frame by the third detection frames with the confidence coefficient larger than a preset threshold value; the number of objects is obtained based on the number of the first target detection frame, the second target detection frame, and the third target detection frame.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the disclosure.

Fig. 5 is a schematic diagram of an electronic device 500 provided by an embodiment of the present disclosure. As shown in fig. 5, the electronic apparatus 500 of this embodiment includes: a processor 501, a memory 502 and a computer program 503 stored in the memory 502 and executable on the processor 501. The steps of the various method embodiments described above are implemented by processor 501 when executing computer program 503. Or the processor 501 when executing the computer program 503 performs the functions of the modules/units in the above-described apparatus embodiments.

The electronic device 500 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. Electronic device 500 may include, but is not limited to, a processor 501 and a memory 502. It will be appreciated by those skilled in the art that fig. 5 is merely an example of an electronic device 500 and is not limiting of the electronic device 500 and may include more or fewer components than shown, or different components.

The Processor 501 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.

The memory 502 may be an internal storage unit of the electronic device 500, for example, a hard disk or a memory of the electronic device 500. The memory 502 may also be an external storage device of the electronic device 500, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 500. Memory 502 may also include both internal storage units and external storage devices of electronic device 500. The memory 502 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium (e.g., a computer readable storage medium). Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable storage medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims

1. A method for detecting an object in an image, the method being applied to a trained object detection model, comprising:

Acquiring an image to be detected;

inputting the image to be detected into the object detection model to perform feature extraction of different scales to obtain a first scale feature map of the image to be detected, a second scale feature map of the image to be detected and a third scale feature map of the image to be detected;

Performing feature fusion on the third scale feature map of the image to be detected and the second scale feature map of the image to be detected to obtain a second fusion feature map of the image to be detected;

Performing feature fusion on the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected to obtain a third fusion feature map of the image to be detected;

respectively carrying out object prediction on the third scale feature map of the image to be detected, the second fusion feature map of the image to be detected and the third fusion feature map of the image to be detected to obtain a first detection result of the image to be detected, a second detection result of the image to be detected and a third detection result of the image to be detected;

And determining the number of objects in the image to be detected according to the first detection result of the image to be detected, the second detection result of the image to be detected and the third detection result of the image to be detected.

2. The method for detecting an object in an image according to claim 1, wherein the object detection model includes a feature extraction network, the feature extraction network includes a first feature processing module, a first residual processing module, a second residual processing module, a third residual processing module, a fourth residual processing module, a fifth residual processing module, a sixth residual processing module, a seventh residual processing module, an eighth residual processing module, and a second feature processing module, the inputting the image to be detected into the object detection model performs feature extraction of different scales, and a first scale feature map of the image to be detected, a second scale feature map of the image to be detected, and a third scale feature map of the image to be detected are obtained, including:

inputting the image to be detected into the first feature processing module for feature processing to obtain a first feature processing result of the image to be detected;

Inputting a first characteristic processing result of the image to be detected into the first residual processing module to carry out residual processing, carrying out residual processing on an output result of the first residual processing module through the second residual processing module, and carrying out residual processing on an output result of the second residual processing module through the third residual processing module to obtain a first scale characteristic diagram of the image to be detected;

Inputting the first scale feature map of the image to be detected to the fourth residual processing module for residual processing, carrying out residual processing on an output result of the fourth residual processing module through the fifth residual processing module, carrying out residual processing on an output result of the fifth residual processing module through the sixth residual processing module, and carrying out residual processing on an output result of the sixth residual processing module through the seventh residual processing module to obtain a second scale feature map of the image to be detected;

And inputting the second scale feature map of the image to be detected into the eighth residual processing module for residual processing, and performing feature processing on an output result of the eighth residual processing module through the second feature processing module to obtain a third scale feature map of the image to be detected.

3. The method for detecting an object in an image according to claim 1, wherein the object detection model includes a feature fusion network, the feature fusion network includes a third feature processing module, and the feature fusion of the third scale feature map of the image to be detected and the second scale feature map of the image to be detected to obtain a second fused feature map of the image to be detected includes:

Performing up-sampling processing on the third-scale feature map of the image to be detected to obtain an up-sampling result of the third-scale feature map of the image to be detected;

Splicing the up-sampling result of the third scale feature map of the image to be detected and the second scale feature map of the image to be detected to obtain an initial second fusion feature map of the image to be detected;

And inputting the initial fusion feature map of the image to be detected into the third feature processing module for feature processing to obtain a second fusion feature map of the image to be detected.

4. The method for detecting an object in an image according to claim 1, wherein the object detection model includes a feature fusion network, the feature fusion network includes a fourth feature processing module, and the feature fusion of the second fused feature map of the image to be detected and the first scale feature map of the image to be detected to obtain a third fused feature map of the image to be detected includes:

performing up-sampling processing on the second fusion feature map of the image to be detected to obtain an up-sampling result of the second fusion feature map of the image to be detected;

Splicing the up-sampling result of the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected to obtain a third initial fusion feature map of the image to be detected;

Carrying out convolution processing on the third initial fusion feature map of the image to be detected to obtain a convolution result of the third initial fusion feature map of the image to be detected;

And inputting a convolution result of the third initial fusion feature map of the image to be detected into the fourth feature processing module for feature processing to obtain the third fusion feature map of the image to be detected.

5. The method for detecting an object in an image according to claim 1, wherein the object detection model includes an object detection network, the object prediction network includes a first deconvolution network, a second deconvolution network, and a third deconvolution network, and the performing object prediction on the third scale feature map of the image to be detected, the second fusion feature map of the image to be detected, and the third fusion feature map of the image to be detected, respectively, to obtain a first detection result of the image to be detected, a second detection result of the image to be detected, and a third detection result of the image to be detected, includes:

inputting a third scale feature map of the image to be detected into the first deconvolution feature network, and mapping the size of the third feature map of the image to be detected into a first deconvolution result of the image to be detected, which is the same as the size of the image to be detected, based on the first deconvolution feature network;

performing detection operation on a first deconvolution result of the image to be detected to obtain a first detection result of the image to be detected;

inputting a second fusion feature map of the image to be detected into the second deconvolution feature network, and mapping the size of the second fusion feature map of the image to be detected into a second deconvolution result of the image to be detected, which is the same as the size of the image to be detected, based on the second deconvolution feature network;

Performing detection operation on a second deconvolution result of the image to be detected to obtain a second detection result of the image to be detected;

inputting a third fusion feature image of the image to be detected into the third deconvolution feature network, and mapping the size of the third fusion feature image of the image to be detected into a third deconvolution result of the image to be detected, which is the same as the size of the image to be detected, based on the third deconvolution feature network;

and executing detection operation on the third deconvolution result of the image to be detected to obtain a third detection result of the image to be detected.

6. The method for detecting an object in an image according to claim 5, wherein the performing a detection operation on the first deconvolution result of the image to be detected to obtain a first detection result of the image to be detected includes:

Performing detection frame prediction on a first deconvolution result of the image to be detected to obtain a plurality of first detection frames of the image to be detected, and determining the confidence coefficient of each first detection frame of each image to be detected;

And performing a detection operation on the second deconvolution result of the image to be detected to obtain a second detection result of the image to be detected, including:

performing detection frame prediction on a second deconvolution result of the image to be detected to obtain a plurality of second detection frames of the image to be detected, and determining the confidence coefficient of each second detection frame of each image to be detected;

and performing a detection operation on the third deconvolution result of the image to be detected to obtain a third detection result of the image to be detected, including:

And carrying out detection frame prediction on the third deconvolution result of the image to be detected to obtain a plurality of third detection frames of the image to be detected, and determining the confidence coefficient of each third detection frame of each image to be detected.

7. The method according to claim 6, wherein determining the number of objects in the image to be detected based on the first detection result of the image to be detected, the second detection result of the image to be detected, and the third detection result of the image to be detected, comprises:

Comparing the confidence coefficient of each first detection frame of each image to be detected with a preset value, and determining a first target detection frame by the first detection frames with the confidence coefficient larger than the preset threshold value;

comparing the confidence coefficient of each second detection frame of each image to be detected with the preset value, and determining a second target detection frame by the second detection frames with the confidence coefficient larger than the preset threshold value;

comparing the confidence coefficient of each third detection frame of each image to be detected with the preset value, and determining a third target detection frame by the third detection frames with the confidence coefficient larger than the preset threshold value;

The number of objects is obtained based on the number of first target detection frames, the number of second target detection frames, and the number of third target detection frames.

8. An in-image object detection apparatus, comprising:

The acquisition module is used for acquiring the image to be detected;

The feature extraction module is used for inputting the image to be detected into the object detection model to perform feature extraction of different scales to obtain a first scale feature map of the image to be detected, a second scale feature map of the image to be detected and a third scale feature map of the image to be detected;

The first fusion module is used for carrying out feature fusion on the third scale feature map of the image to be detected and the second scale feature map of the image to be detected to obtain a second fusion feature map of the image to be detected;

The second fusion module is used for carrying out feature fusion on the second fusion feature map of the image to be detected and the first scale feature map of the image to be detected to obtain a third fusion feature map of the image to be detected;

The prediction module is used for respectively carrying out object prediction on the third scale feature map of the image to be detected, the second fusion feature map of the image to be detected and the third fusion feature map of the image to be detected to obtain a first detection result of the image to be detected, a second detection result of the image to be detected and a third detection result of the image to be detected;

The determining module is used for determining the number of objects in the image to be detected according to the first detection result of the image to be detected, the second detection result of the image to be detected and the third detection result of the image to be detected.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.