CN111832559B

CN111832559B - Target detection method and device, storage medium and electronic device

Info

Publication number: CN111832559B
Application number: CN202010567707.6A
Authority: CN
Inventors: 胡来丰
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2024-07-02
Anticipated expiration: 2040-06-19
Also published as: CN111832559A

Abstract

The invention provides a target detection method and device, a storage medium and an electronic device, wherein the method comprises the steps of extracting candidate frames based on a deep learning network; determining a target group in which the candidate frame is positioned according to the relative positions of the candidate frame and the real frame; and processing the candidate frame according to the target group where the candidate frame is located to obtain a target position and a category. The invention solves the problem that the position of the candidate frame is more random than the real target in target detection, which is unfavorable for the position convergence and regression of the candidate frame, thereby achieving the effect of facilitating the position convergence and the frame regression of the frame based on the fixity of the position of the candidate frame.

Description

Target detection method and device, storage medium and electronic device

Technical Field

The invention relates to the field of deep learning and target detection, in particular to a target detection method and device, a storage medium and an electronic device.

Background

The target detection method based on deep learning mainly comprises the following steps: single-stage, e.g., YOLO, SSD, and candidate frame based two-stage object detection, e.g., RCNN series. Two-stage-based target detection schemes have improved performance over single-stage ones, and many improved solutions based on RCNN have emerged in the past, such as RFCN, cascade-rcnn, iou-Net, FPN, and the like.

The positions of the candidate frames obtained in the training or testing stage in the two-stage target detection scheme are only restricted by virtue of the real target frames, and because the positions of the candidate frames are more random than the real targets, sometimes the positions of the candidate frames are far left and right, and the like, the included real target information is inconsistent, the classification confusion of the candidate frames is possibly caused, and the randomness of the positions of the target frames is unfavorable for the position convergence and the frame regression of the frames.

Aiming at the problems that the position of a candidate frame is more random than a real target in the target detection process in the related technology, and the position convergence and regression of the candidate frame are unfavorable, no effective solution exists at present.

Disclosure of Invention

The embodiment of the invention provides a target detection method and device, a storage medium and an electronic device, which at least solve the problem that the position of a candidate frame is more true and random than a target during target detection in the related technology, and the position convergence and regression of the candidate frame are not facilitated.

According to an embodiment of the present invention, there is provided a target detection method including: extracting candidate frames based on the deep learning network; determining a target group in which the candidate frame is positioned according to the relative positions of the candidate frame and the real frame; and processing the candidate frame according to the target group where the candidate frame is located to obtain a target position and a category.

In an alternative embodiment of the present invention, determining the target group in which the candidate frame is located according to the relative position of the candidate frame and the real frame includes: and determining the relative position of the candidate frame in the real frame through position grouping, and determining the target group in which the candidate frame is positioned as a target position according to the relative position.

In an alternative embodiment of the present invention, determining the target group where the candidate frame is located according to the relative position of the candidate frame and the real frame includes: when the candidate frame is determined to be positioned at the upper left position of the real frame, determining a target group in which the candidate frame is positioned as a first position group; when the candidate frame is determined to be positioned at the upper right position of the real frame, determining that the target group where the candidate frame is positioned is a second position group; when the candidate frame is determined to be positioned at the lower left position of the real frame, determining that the target group where the candidate frame is positioned is a third position group; and when the candidate frame is determined to be positioned at the right lower position of the real frame, determining the target group of the candidate frame as a fourth position group.

In an alternative embodiment of the present invention, determining the target group where the candidate frame is located according to the relative position of the candidate frame and the real frame includes: when the center point of the candidate frame is determined to be positioned at the upper left position of the center point of the real frame, determining the target group of the candidate frame as a first position group; when the center point of the candidate frame is determined to be positioned at the upper right position of the center point of the real frame, determining the target group of the candidate frame as a second position group; when the center point of the candidate frame is determined to be positioned at the lower left position of the center point of the real frame, determining the target group of the candidate frame as a third position group; and when the center point of the candidate frame is determined to be positioned at the right lower position of the center point of the real frame, determining the target group of the candidate frame as a fourth position group.

In an optional embodiment of the present invention, the processing the candidate frame according to the target group in which the candidate frame is located includes: determining group classification of the candidate frame according to the position group in the candidate frame, wherein the position group at least comprises the group number in the position group, and the group classification at least comprises the category number in the group classification; and carrying out regression on the candidate frames according to the group classification of the candidate frames to obtain regression frames, wherein the regression frames at least comprise regression frame dimensions, the regression frame dimensions are determined by the group classification category numbers, and the group classification category numbers are determined by the position classification group numbers.

In an optional embodiment of the present invention, the performing regression on the candidate frame according to the group of the candidate frames to obtain a regression frame includes: obtaining a confidence coefficient group according to the group classification of the candidate frames, wherein the confidence coefficient group is used for indicating classification results of different backgrounds in the position group of the candidate frames; and determining a real category according to the confidence coefficient group, wherein the real category is used for indicating the real category of the classification result of the background in the position group of the candidate frame.

In an optional embodiment of the present invention, the performing regression according to the target set of candidate frames where the candidate frames are located to obtain a regression frame includes: determining the background of the regression frame dimension according to the real category; and determining the grouping for carrying out frame regression in the background of the regression frame dimension according to the category obtained by the position grouping.

According to another embodiment of the present invention, there is provided an object detection apparatus including: the extraction module is used for extracting candidate frames based on the deep learning network; the determining module is used for determining a target group where the candidate frame is located according to the relative positions of the candidate frame and the real frame; and the processing module is used for processing the candidate frame according to the target group where the candidate frame is positioned to obtain a target position and a category.

According to a further embodiment of the invention, there is also provided a storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

According to a further embodiment of the invention, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

According to the method and the device, after the candidate frame is extracted, the target group of the candidate frame is determined according to the relative position of the candidate frame and the real frame, and then the candidate frame is processed according to the target group of the candidate frame. Therefore, the problem that the position of the candidate frame is more random than a true target during target detection and is unfavorable for the position convergence and regression of the candidate frame can be solved, and the effect that the position convergence and the frame regression of the frame are favorable due to the fixity of the position of the candidate frame is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

Fig. 1 is a block diagram of a hardware structure of a mobile terminal of a target detection method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a target detection method according to an embodiment of the invention;

FIG. 3 is a block diagram of an object detection device according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a grouping of candidate boxes in accordance with an alternative embodiment of the present invention;

Fig. 5 is a flow chart of a method of object detection in accordance with an alternative embodiment of the present invention.

Detailed Description

The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

Example 1

The method according to the first embodiment of the present application may be implemented in a mobile terminal, a computer terminal or a similar computing device. Taking the mobile terminal as an example, fig. 1 is a block diagram of a hardware structure of the mobile terminal according to an embodiment of the present application. As shown in fig. 1, the mobile terminal 10 may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting of the structure of the mobile terminal described above. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1 or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the object detection method in the embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of networks described above may include wireless networks provided by the communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

In this embodiment, there is provided a target detection method running on the mobile terminal, and fig. 2 is a flowchart of target detection according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:

step S202, extracting candidate frames based on a deep learning network;

It should be noted that, the backbone network based on deep learning adopts a convolutional neural network, and the specific network result is not specifically limited in the present application, and those skilled in the art may select according to actual situations.

Step S204, determining a target group where the candidate frame is located according to the relative positions of the candidate frame and the real frame;

And S206, processing the candidate frame according to the target group where the candidate frame is located to obtain a target position and a category.

After the candidate frames are obtained through target detection and extraction, the positions of the candidate frames relative to the real frames in position are grouped, then corresponding frame regression and category classification are carried out, and finally final target positions and category information are obtained through combination, and position classification, group classification and frame regression are carried out, and final target positions and categories are obtained through combination.

Through the steps, after the candidate frame is extracted, the target group of the candidate frame is determined according to the relative position of the candidate frame and the real frame, and then the candidate frame is processed according to the target group of the candidate frame. Therefore, the problems that the position of the candidate frame is more random than a real target during target detection and is unfavorable for the position convergence and regression of the candidate frame can be solved, and the effects of the position convergence and the frame regression of the frame are improved.

In order to improve the grouping accuracy of the target group, determining the target group where the candidate frame is located according to the relative positions of the candidate frame and the real frame includes: and determining the relative position of the candidate frame in the real frame through position grouping, and determining the target group in which the candidate frame is positioned as a target position according to the relative position. And after the position grouping, the relative position of the candidate frame in the real frame can be determined, and the target position of the target group in which the candidate frame is positioned is determined according to the relative position.

Further, determining the target group where the candidate frame is located according to the relative positions of the candidate frame and the real frame, including: when the candidate frame is determined to be positioned at the upper left position of the real frame, determining a target group in which the candidate frame is positioned as a first position group; when the candidate frame is determined to be positioned at the upper right position of the real frame, determining that the target group where the candidate frame is positioned is a second position group; when the candidate frame is determined to be positioned at the lower left position of the real frame, determining that the target group where the candidate frame is positioned is a third position group; and when the candidate frame is determined to be positioned at the right lower position of the real frame, determining the target group of the candidate frame as a fourth position group. And the position grouping is carried out on the candidate frames at the positions of the upper left, the upper right, the lower left and the lower right by taking the real frame as the center, so as to determine the position group of the target group of the candidate frame.

Specifically, when the method is implemented, determining the target group where the candidate frame is located according to the relative positions of the candidate frame and the real frame, including: when the center point of the candidate frame is determined to be positioned at the upper left position of the center point of the real frame, determining the target group of the candidate frame as a first position group; when the center point of the candidate frame is determined to be positioned at the upper right position of the center point of the real frame, determining the target group of the candidate frame as a second position group; when the center point of the candidate frame is determined to be positioned at the lower left position of the center point of the real frame, determining the target group of the candidate frame as a third position group; and when the center point of the candidate frame is determined to be positioned at the right lower position of the center point of the real frame, determining the target group of the candidate frame as a fourth position group. As shown in fig. 3, when the positions are grouped, the positions are divided into 4 groups, i.e., upper left, upper right, lower left, and lower right, which are the center point positions with respect to the real target (gt), and the positions are denoted as offset= { ul, ur, ll, lr }, corresponding to the values of roi_ul, roi_ur, roi_ll, and roi_lr in fig. 3. The grouped target boxes are trained with a cross entropy loss function that acts to aid classification and convergence of box regression.

Further, a classification process is required, where the processing, according to the target group where the candidate frame is located, of the candidate frame includes:

Determining group classification of the candidate frame according to the position group in the candidate frame, wherein the position group at least comprises the group number in the position group, and the group classification at least comprises the category number in the group classification; and carrying out regression on the candidate frames according to the group classification of the candidate frames to obtain regression frames, wherein the regression frames at least comprise regression frame dimensions, the regression frame dimensions are determined by the group classification category numbers, and the group classification category numbers are determined by the position classification group numbers.

Specifically, the group classification and the frame regression are performed according to the number of categories in the group classification and the regression frame dimension in the regression frame. The regression frame dimension is determined by the number of classes classified by the group, and the number of classes is determined by the number of groups classified by the position.

Further, a frame regression process is required, where the regression is performed on the candidate frames according to the group of the candidate frames to obtain a regression frame, and the method includes: obtaining a confidence coefficient group according to the group classification of the candidate frames, wherein the confidence coefficient group is used for indicating classification results of different backgrounds in the position group of the candidate frames; and determining a real category according to the confidence coefficient group, wherein the real category is used for indicating the real category of the classification result of the background in the position group of the candidate frame.

Further, after the frame regression processing, combination is needed, and regression is performed on the candidate frames according to the target group where the candidate frames are located to obtain a regression frame, which includes: determining the background of the regression frame dimension according to the real category; and determining the grouping for carrying out frame regression in the background of the regression frame dimension according to the category obtained by the position grouping.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

Example 2

The embodiment also provides a target detection device, which is used for implementing the above embodiment and the preferred implementation manner, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 4 is a block diagram of an object detection apparatus according to an embodiment of the present invention, as shown in fig. 4, the apparatus includes:

an extraction module 40 for extracting candidate frames based on the deep learning network;

A determining module 42, configured to determine, according to the relative positions of the candidate frame and the real frame, a target group in which the candidate frame is located;

And a processing module 44, configured to process the candidate frame according to the target group where the candidate frame is located, so as to obtain a target position and a category.

In the above module, after the candidate frame is obtained through target detection and extraction, the relative positions of the candidate frame and the real frame are grouped, then corresponding frame regression and category classification are performed, and finally the final target position and category information is obtained through combination, and the final target position and category are obtained through position classification, group classification, frame regression and combination.

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; or the above modules may be located in different processors in any combination.

In order to better understand the above-mentioned target detection procedure, the following description is given with reference to the preferred embodiments, but the technical solutions of the embodiments of the present invention are not limited thereto.

The preferred embodiment of the invention can determine the target group where the candidate frame is located according to the relative positions of the candidate frame and the real frame, and solves the problems that the position convergence and regression of the candidate frame are not favorable due to the randomness of the candidate frame.

As shown in fig. 5, the two-stage target detection method based on the candidate frame position grouping includes the following specific implementation steps:

step S502, an image is input. RGB image, scaled and subtracted from mean normalization.

Step S504, the backbone network processes. The backbone network is a convolutional neural network, and common networks are VGG16, resNet and the like. In particular, its RPN, proposal, roiPooling pooling layer may take as an example the corresponding network result in RFCN as disclosed in document R-FCN: object Detection via Region-based Fully Convolutional Networks.

Step S506, the position classification is divided into groups (group=4), which are respectively upper left, upper right, lower left and lower right, and are all center point positions relative to the real target (gt), and the positions are denoted as offset= { ul, ur, ll, lr } in fig. 3 corresponding to the roi_ul, roi_ur, roi_ll, and roi_lr }, and the cross entropy loss function is used during training to assist classification and convergence of frame regression;

in step S508, the number of classification categories is cg=c×group+1, where C is the number of different categories, and Group is the number of groups in the location classification.

Step S510, frame regression, wherein the dimension of a regression frame is BOXG =CG 4, and is recorded as { BG ₀,BG₁,…,BG_c }, and CG is the category number of group classification;

step S512, combining the groups.

First, the confidence Group { g ₀,g₁,g₂,…g_c } obtained by the classification, where g ₀ is the background, the dimension is 1, g _i (1 < i < c) is the Group, and the result isDetermining the true category tc of the model;

Secondly, determining BOXG BG _tc in the box position according to the class tc through regression;

And finally, determining which group in the BG _tc is used for frame regression according to the category obtained by the position classification offset, and finally obtaining the real frame of the target.

An embodiment of the invention also provides a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:

S1, extracting candidate frames based on a deep learning network;

S2, determining a target group where the candidate frame is located according to the relative positions of the candidate frame and the real frame;

And S3, processing the candidate frame according to the target group where the candidate frame is located, and obtaining a target position and a category.

Optionally, the storage medium is further arranged to store a computer program for performing the steps of:

s1, determining the relative position of the candidate frame in the real frame through position grouping, and determining the target group in which the candidate frame is positioned as a target position according to the relative position;

alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

S1, extracting candidate frames based on a deep learning network;

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A target detection method is characterized in that,

Extracting candidate frames based on the deep learning network;

determining a target group in which the candidate frame is positioned according to the relative positions of the candidate frame and the real frame;

Processing the candidate frame according to the target group where the candidate frame is located to obtain a target position and a category;

The processing the candidate frame according to the target group where the candidate frame is located includes: determining group classification of the candidate frame according to the position group in the candidate frame, wherein the position group at least comprises the group number in the position group, and the group classification at least comprises the category number in the group classification; and carrying out regression on the candidate frames according to the group classification of the candidate frames to obtain regression frames, wherein the regression frames at least comprise regression frame dimensions, the regression frame dimensions are determined by the group classification category numbers, and the group classification category numbers are determined by the position group numbers.

2. The method of claim 1, wherein determining the target group in which the candidate frame is located based on the relative positions of the candidate frame and the real frame comprises:

And determining the relative position of the candidate frame in the real frame through position grouping, and determining the target group in which the candidate frame is positioned as a target position according to the relative position.

3. The method of claim 1, wherein determining the target group in which the candidate frame is located based on the relative positions of the candidate frame and the real frame comprises:

When the candidate frame is determined to be positioned at the upper left position of the real frame, determining a target group in which the candidate frame is positioned as a first position group;

When the candidate frame is determined to be positioned at the upper right position of the real frame, determining that the target group where the candidate frame is positioned is a second position group;

When the candidate frame is determined to be positioned at the lower left position of the real frame, determining that the target group where the candidate frame is positioned is a third position group;

And when the candidate frame is determined to be positioned at the right lower position of the real frame, determining the target group of the candidate frame as a fourth position group.

4. A method according to claim 1 or 3, wherein determining the target group in which the candidate frame is located based on the relative positions of the candidate frame and the real frame comprises:

when the center point of the candidate frame is determined to be positioned at the upper left position of the center point of the real frame, determining the target group of the candidate frame as a first position group;

when the center point of the candidate frame is determined to be positioned at the upper right position of the center point of the real frame, determining the target group of the candidate frame as a second position group;

When the center point of the candidate frame is determined to be positioned at the lower left position of the center point of the real frame, determining the target group of the candidate frame as a third position group;

and when the center point of the candidate frame is determined to be positioned at the right lower position of the center point of the real frame, determining the target group of the candidate frame as a fourth position group.

5. The method of claim 1, wherein the regression of the candidate boxes according to their respective groupings of boxes results in a regression box, comprising:

obtaining a confidence coefficient group according to the group classification of the candidate frames, wherein the confidence coefficient group is used for indicating classification results of different backgrounds in the position group of the candidate frames;

And determining a real category according to the confidence coefficient group, wherein the real category is used for indicating the real category of the classification result of the background in the position group of the candidate frame.

6. The method of claim 5, wherein regressing the candidate boxes according to their respective groups to obtain a regression box comprises:

determining the background of the regression frame dimension according to the real category;

And determining the grouping for carrying out frame regression in the background of the regression frame dimension according to the category obtained by the position grouping.

7. An object detection apparatus, comprising:

the extraction module is used for extracting candidate frames based on the deep learning network;

The determining module is used for determining a target group where the candidate frame is located according to the relative positions of the candidate frame and the real frame;

the processing module is used for processing the candidate frames according to the target group where the candidate frames are located to obtain target positions and categories;

The processing module is further configured to determine a group classification of the candidate frame according to a position group in the candidate frame, where the position group at least includes a group number in the position group, and the group classification at least includes a category number in the group classification; and carrying out regression on the candidate frames according to the group classification of the candidate frames to obtain regression frames, wherein the regression frames at least comprise regression frame dimensions, the regression frame dimensions are determined by the group classification category numbers, and the group classification category numbers are determined by the position group numbers.

8. A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of any of claims 1 to 6 when run.

9. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of the claims 1 to 6.