CN110619255A

CN110619255A - Target detection method and device

Info

Publication number: CN110619255A
Application number: CN201810632279.3A
Authority: CN
Inventors: 蔡晓蕙
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-06-19
Filing date: 2018-06-19
Publication date: 2019-12-27
Anticipated expiration: 2038-06-19
Also published as: CN110619255B

Abstract

The application provides a target detection method and a device thereof, wherein the method comprises the following steps: inputting an image to be detected into a pre-trained first cascade convolution neural network to obtain a first characteristic diagram of the image to be detected; extracting a first candidate region from the first feature map according to preset parameters; determining first foreground areas in the first candidate areas and confidence coefficients of the first foreground areas through a pre-trained second cascade convolution neural network; and performing regression processing on the first foreground region with the reliability meeting the preset condition through a pre-trained third-level convolution neural network to obtain a target region in the image to be detected. The method can improve the accuracy of target detection in a multi-angle scene.

Description

Target detection method and device

Technical Field

The present disclosure relates to image processing technologies, and in particular, to a target detection method and a target detection device.

Background

In the traditional vision field, target detection is a very popular research direction, and detection of some specific targets, such as face detection, pedestrian detection, vehicle detection and the like, has already been a very mature technology, but the applicable scenes of the technologies are simpler, and the generalization capability is insufficient.

The most common scene in the complex scene is a large-angle scene, and when the angle of the shooting camera is not fixed, the target in the image can present multiple angles (including a large angle and a small angle). At present, no matter the traditional detection algorithm or the detection algorithm based on deep learning, the detection performance on a small-angle target is better, but for a multi-angle target, the detection performance is inversely proportional to the angle, and the larger the target angle is, the worse the detection performance is.

The target angle refers to an included angle between a plane where the target is located and a plane where the image is shot, and taking vehicle detection as an example, the target angle may refer to an included angle between a license plate plane and the plane where the image is shot in the shot image.

Therefore, how to detect multi-angle targets in an inclined scene becomes a technical problem to be solved urgently in the field of target detection.

Disclosure of Invention

In view of the above, the present application provides a target detection method and a device thereof.

Specifically, the method is realized through the following technical scheme:

according to a first aspect of embodiments of the present application, there is provided a target detection method, including:

inputting an image to be detected into a pre-trained first cascade convolution neural network to obtain a first characteristic diagram of the image to be detected;

extracting a first candidate region from the first feature map according to preset parameters; the preset parameters comprise an x-axis direction angle or/and a y-axis direction angle;

determining first foreground areas in the first candidate areas and confidence coefficients of the first foreground areas through a pre-trained second cascade convolution neural network; the first foreground region is a first candidate region with a confidence coefficient greater than or equal to a preset confidence coefficient threshold;

and performing regression processing on the first foreground region with the reliability meeting the preset condition through a pre-trained third-level convolution neural network to obtain a target region in the image to be detected.

According to a second aspect of embodiments of the present application, there is provided an object detection apparatus, including:

the first extraction unit is used for inputting an image to be detected into a pre-trained first cascade convolution neural network so as to obtain a first characteristic diagram of the image to be detected;

the second extraction unit is used for extracting a first candidate region from the first feature map according to preset parameters; the preset parameters comprise an x-axis direction angle or/and a y-axis direction angle;

the determining unit is used for determining first foreground areas in the first candidate areas and the confidence coefficients of the first foreground areas through a pre-trained second cascade convolution neural network; the first foreground region is a first candidate region with a confidence coefficient greater than or equal to a preset confidence coefficient threshold;

and the processing unit is used for performing regression processing on the first foreground region with the reliability meeting the preset condition through a pre-trained third cascade convolution neural network so as to obtain a target region in the image to be detected.

According to a third aspect of embodiments herein, there is provided an object detection apparatus comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor, the processor being caused by the machine-executable instructions to:

According to a fourth aspect of embodiments herein, there is provided a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to:

According to the target detection method, the image to be detected is input into the pre-trained first-stage convolutional neural network to obtain the first feature map of the image to be detected, the first candidate area with the angle information is extracted from the first feature map according to the preset parameters, the first foreground area in the first candidate area and the confidence coefficient of each first foreground area are determined through the pre-trained second-stage convolutional neural network, regression processing is carried out on the first foreground area with the confidence coefficient meeting the preset conditions through the pre-trained third-stage convolutional neural network to obtain the target area in the image to be detected, the applicability of target detection to a multi-angle scene is improved, and the accuracy of target detection in the multi-angle scene is improved.

Drawings

FIG. 1 is a flow chart illustrating a method of object detection according to an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a method of training a cascaded convolutional neural network as shown in an exemplary embodiment of the present application;

FIG. 3 is a diagram illustrating a second candidate region and a labeling target region according to an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of an object detection device according to an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of an object detection device according to yet another exemplary embodiment of the present application;

fig. 6 is a schematic diagram of a hardware structure of an object detection apparatus according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

In order to make the technical solutions provided in the embodiments of the present application better understood and make the above objects, features and advantages of the embodiments of the present application more comprehensible, the technical solutions in the embodiments of the present application are described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic flow chart of a target detection method provided in an embodiment of the present application is shown, where the target detection method may be applied to a background server for video monitoring, and as shown in fig. 1, the method may include the following steps:

and S100, inputting the image to be detected into a pre-trained first cascade convolution neural network to obtain a first characteristic diagram of the image to be detected.

In the embodiment of the application, the image to be detected can be any frame image acquired by front-end equipment of video monitoring.

In the embodiment of the application, when the target needs to be detected, the image to be detected may be input to a first cascade convolutional neural network trained in advance to extract a feature map (referred to as a first feature map herein) of the image to be detected.

Wherein the first cascaded convolutional neural network may comprise a plurality of convolutional neural networks connected in series or in parallel for feature map extraction.

Step S110, extracting a first candidate region from the first feature map according to preset parameters.

In the embodiment of the present application, after obtaining the first feature map of the image to be detected, a candidate region (referred to as a first candidate region herein) may be extracted from the first feature map according to a preset parameter.

In order to ensure that the extracted candidate region is closer to the target region, the angle information needs to be considered when performing candidate region extraction.

Accordingly, in the embodiment of the present application, the preset parameters (referred to as preset parameters herein) for extracting the candidate region by the user may further include an x-axis direction angle or/and a y-axis direction angle of the candidate region, in addition to the scale, the aspect ratio, and the center point position (including the abscissa and the ordinate of the center point of the candidate region).

Wherein, the x-axis direction angle refers to an included angle (hereinafter referred to as an α angle) between a boundary of the candidate region in the horizontal direction and a boundary of the detection image in the horizontal direction; the y-axis direction angle refers to an angle (hereinafter referred to as a β angle) between a boundary in the vertical direction of the candidate region and a boundary in the vertical direction of the detection image.

In an example, the first candidate region is a parallelogram region, and the preset parameters may include a scale, an aspect ratio, a center point position, an x-axis direction angle, and a y-axis direction angle.

For example, assuming that the predetermined parameters include n selectable values, the aspect ratio includes m selectable values (e.g., 1:2, 1:3, etc.), the center point position includes s × t selectable values, the α angle includes u selectable values (e.g., pi/6, pi/3, pi/2, etc.), and the β angle includes v selectable values, a total of n × m × s × t u v candidate regions can be extracted.

It should be appreciated that, in the embodiment of the present application, the first candidate region is not limited to a parallelogram, and may include other polygons with angle information. For example, the first candidate region may also be an isosceles trapezoid, and accordingly, the preset parameter may include a dimension, an upper and lower side length, a center point position, an x-axis direction angle or a y-axis direction angle.

For ease of understanding and description, the first candidate region is described as a parallelogram as an example in the following.

And step S120, determining first foreground areas in the first candidate areas and the confidence coefficient of each first foreground area through a pre-trained second cascade convolution neural network.

In the embodiment of the application, after the first candidate regions are extracted from the first feature map, the first feature map labeled with the first candidate regions may be input to a pre-trained second cascaded convolutional neural network, and the second cascaded convolutional neural network determines the confidence of each first candidate region as a foreground region.

The second cascaded convolutional neural network may determine a first candidate region, in which the confidence of the foreground region is greater than or equal to a preset confidence threshold (which may be set according to an actual scene), as the foreground region (referred to herein as the first foreground region).

For example, assuming that the preset confidence threshold is 80%, the second cascaded convolutional neural network may determine the confidence of each first candidate region as a foreground region, and determine the first candidate region with the confidence greater than or equal to 80% as the first foreground region.

Wherein the second cascaded convolutional neural network may include a plurality of convolutional neural networks in series or in parallel for foreground region identification.

And S130, performing regression processing on the first foreground region with the reliability meeting the preset condition through a pre-trained third cascade convolution neural network to obtain a target region in the image to be detected.

In the embodiment of the application, after the first foreground regions and the confidence degrees of the first foreground regions are determined, the first foreground regions with the confidence degrees meeting the preset conditions can be determined.

As an embodiment, the confidence that satisfies the preset condition may include that the confidence is greater than or equal to the confidence threshold, that is, the confidence of each first foreground region output by the second cascaded convolutional neural network satisfies the preset condition.

As another embodiment, the confidence level satisfying the preset condition may include TOP N (i.e., N with the highest confidence level, where N is a positive integer, and a specific value of N may be set according to an actual scene) in the confidence levels of the first foreground region, that is, sorting the confidence levels of the first foreground region from high to low, and determining the TOP N confidence levels as the confidence levels satisfying the preset condition.

For example, assuming that the preset confidence threshold is 80% and N is 10, after the second cascaded convolutional neural network determines the confidence of each first candidate region as a foreground region, the first candidate regions with the confidence greater than or equal to 80% may be sorted in order from high confidence to low confidence, and the top 10 confidences are determined as the confidences meeting the preset condition according to the sorting result.

It should be noted that, when the number of the first foreground regions whose confidence degrees meet the preset condition is less than N, the confidence degrees of the first foreground regions may be directly determined as the confidence degrees meeting the preset condition.

In the embodiment of the present application, after the first foreground region with the confidence coefficient satisfying the preset condition is determined, the first foreground region with the confidence coefficient satisfying the preset condition may be input to a third hierarchical convolutional neural network trained in advance to perform regression processing, so as to obtain a target region (hereinafter referred to as a detection target region) in an image to be detected.

It can be seen that, in the flow of the method shown in fig. 1, when the candidate region is extracted, the extracted candidate region is a candidate region carrying angle information, and the angle information includes an angle in the x direction or/and an angle in the y direction, which ensures that the extracted candidate region is closer to the target region, improves the applicability of target detection to a multi-angle scene, and improves the accuracy of target detection in the multi-angle scene.

Referring to fig. 2, in an embodiment of the present application, the cascade connection of the first cascaded convolutional neural network, the second cascaded convolutional neural network, and the third cascaded convolutional neural network may be obtained by training in the following manner:

step S100a, inputting any training sample in the training set into the first cascade convolutional neural network to obtain a second feature map corresponding to the training sample.

In the embodiment of the application, before the target identification is performed through the above-mentioned each-stage convolutional neural network (including the first-stage convolutional neural network, the second-stage convolutional neural network, and the third-stage convolutional neural network), a training set including a certain number of training samples (which can be set according to an actual scene) needs to be used to train each-stage convolutional neural network until the network converges, and then a target detection task is performed.

Accordingly, in this embodiment, for any training sample in the training set, a feature map (referred to herein as a second feature map) of the training sample may be extracted by the first cascaded convolutional neural network.

The training sample may be a detection image marked with a target area.

And step S100b, determining a second candidate region in the second feature map according to preset parameters.

In this embodiment, the specific implementation of step S100b may refer to the relevant description in step S110, and the description of the embodiment of the present application is not repeated herein.

Note that, when the first candidate region in step S110 is a parallelogram, the second candidate region in step S100b is also a parallelogram.

And step S100c, determining a second foreground area in the second candidate area through a second cascade convolution neural network.

In this embodiment, after determining the second candidate regions in the second feature map, the first feature map labeled with the first candidate regions may be input to a second cascaded convolutional neural network trained in advance, and the second cascaded convolutional neural network determines the respective degrees of coincidence of each second candidate region with a target region labeled in advance in the training sample (hereinafter referred to as labeled target region).

In one example, for any one of the second candidate region and the annotation target region, the overlap ratio overlap of the second candidate region and the annotation target region may be determined by:

overlap＝(S_candidate∩S_target)/(S_candidate∪S_target)

wherein S is_candidateIs the area of the candidate region, S_targetFor the area of the labeled target region, S_candidate∩S_targetIs the area of the overlapping part of the candidate region and the labeling target region, S_candidate∪S_targetThe total area covered by the candidate region and the labeling target region.

For example, taking fig. 3 as an example, if the area covered by the parallelogram ABCD is the second candidate area, the area covered by the rectangle EFGH is the labeling target area, I and J are the intersections of the parallelogram ABCD and the rectangle EFGH, the overlap ratio overlap of the second candidate area and the labeling target area is:

overlap＝(S_ABCD∩S_EFGH)/(S_ABCD∪S_EFGH)

wherein S is_ABCD∩S_EFGHThe area of the overlapping part (i.e. quadrangle EJID) of the candidate area ABCD and the labeling target area EFGH; s_ABCD∪S_EFGHThe total area covered by the candidate area ABCD and the labeling target area EFGH is the difference between the sum of the areas of the parallelogram ABCD and the rectangle EFGH and the area of the quadrangle EJID.

In this embodiment, after determining the coincidence degree of each second candidate region with the labeling target region, the second candidate region whose coincidence degree with the labeling target region is equal to or greater than a preset coincidence degree threshold (referred to as a first preset coincidence degree threshold herein, which may be set in an actual scene, for example, 85%) may be determined as the foreground region (referred to as a second foreground region herein).

And step S100d, performing regression processing on the second foreground region through a third cascaded convolutional neural network to obtain a target region in the training sample.

In this embodiment, after the second foreground region in the training sample region is determined and the first class loss and the overlap ratio loss corresponding to the training sample satisfy the requirement, the second foreground region may be input into the third cascaded convolutional neural network for regression processing to obtain a target region (hereinafter referred to as a training target region) in the training sample.

For any training target region, the number of feature points in the regression process may be determined according to the target edge characteristics, for example, for a rectangular region, the number of feature points in the regression process is 2.

Further, in this embodiment, in order to improve the detection accuracy of the first cascaded convolutional neural network and the second cascaded convolutional neural network, after the step S100c, the method may further include:

and according to the first class loss and/or the contact ratio loss corresponding to the training sample, performing parameter optimization on the network combination of the cascaded first cascade convolutional neural network and the cascaded second cascade convolutional neural network until the first class loss and the contact ratio loss corresponding to the training sample meet the requirements.

Performing parameter optimization on a network combination of the first cascaded convolutional neural network and the second cascaded convolutional neural network according to a first class loss and/or a contact ratio loss corresponding to the training sample, wherein the parameter optimization comprises the following steps: when the ratio of the number of the first type target areas in the training sample to the number of the pre-labeled target areas is larger than or equal to a preset ratio threshold, performing parameter optimization on the network combination of the first cascade convolutional neural network and the second cascade convolutional neural network in a cascade manner, so that the ratio of the number of the first type target areas in the training sample to the number of the pre-labeled target areas is smaller than the preset ratio threshold when the training sample is input after the parameter optimization; the first type target area is a pre-labeled target area without a corresponding second foreground area; and/or;

when the average value of the coincidence degrees of each second foreground region and the pre-labeled target region corresponding to each second foreground region in the training sample is smaller than a second preset coincidence degree threshold value, performing parameter optimization on the network combination of the cascaded first cascade convolutional neural network and the second cascade convolutional neural network, so that the average value of the coincidence degrees of each second foreground region and the pre-labeled target region corresponding to each second foreground region in the training sample is larger than or equal to the second preset coincidence degree threshold value when the training sample is input after the parameter optimization; wherein the second preset contact ratio threshold is greater than the first preset contact ratio threshold.

Wherein the performing parameter optimization on the network combination of the first cascaded convolutional neural network and the second cascaded convolutional neural network in cascade comprises: optimizing model parameters of the first cascaded convolutional neural network and/or the second cascaded convolutional neural network.

In this embodiment, after the second foreground region in the training sample is determined, before performing regression processing on the second foreground region, it may be determined whether the first class loss or the overlap ratio loss corresponding to the training sample meets the requirement.

The first category is a category of the second candidate region in the training sample, and may include a foreground region or a background region, and the first category loss may be a ratio of the number of the second foreground regions to the number of the labeling target regions.

In one example, the first class loss corresponding to the training samples satisfies the requirement, which may include:

the ratio of the number of the first type target areas in the training sample to the number of the pre-labeled target areas is smaller than a preset proportion threshold;

the first class loss corresponding to the training sample does not meet the requirement, and may include:

the ratio of the number of the first type target areas in the training sample to the number of the pre-labeled target areas is greater than or equal to a preset proportion threshold.

Specifically, after determining the second foreground region in the training sample, a ratio of the number of pre-labeled target regions (referred to herein as first type target regions) in the training sample for which there is no corresponding second foreground region to the number of pre-labeled target regions may be further determined.

For example, for a training sample, it is assumed that the number of target areas in the training sample is 10 (e.g., target areas 1 to 10), and that the second foreground areas extracted in the above steps S100a to S100c include second foreground areas corresponding to the target areas 1 to 9 (i.e., second candidate areas having a higher degree of overlap with any one of the target areas 1 to 9 than a first preset degree of overlap threshold), if the labeled target region 10 does not have a corresponding second foreground region (i.e., there is no second candidate region whose overlap ratio with the labeled target region 10 is higher than the first preset overlap ratio threshold, that is, the labeled target region 10 is a first-type target region), the ratio of the number of the first-type target regions in the training sample to the number of the pre-labeled target regions is 10% (1/10 × 100% — 10%).

In this embodiment, after determining the ratio of the number of the first type target regions in the training sample, it may be determined whether the ratio is less than or equal to a preset ratio threshold (which may be set according to an actual scene, such as 5%, 10%, etc.); if so, determining that the first class loss corresponding to the training sample meets the requirement; otherwise, determining that the first class loss corresponding to the training sample does not meet the requirement.

In one example, the above mentioned training samples with respect to overlap ratio loss satisfy the requirement, which may include:

the average value of the coincidence degrees of each second foreground region and the labeling target region corresponding to each second foreground region in the training sample is greater than or equal to a second preset coincidence degree threshold value; wherein the second preset contact ratio threshold value is greater than the first preset contact ratio threshold value;

the contact ratio loss corresponding to the training sample does not meet the requirement, and the method comprises the following steps:

the average value of the coincidence degrees of each second foreground region and the labeling target region corresponding to each second foreground region in the training sample is smaller than a second preset coincidence degree threshold value.

Specifically, after the second foreground regions in the training sample are determined, an average value of the coincidence degrees of the second foreground regions in the training sample and the labeling target regions corresponding to the second foreground regions may be further determined.

For example, assume that a training sample includes 10 labeled target regions (labeled target regions 1-10, respectively), and 10 second foreground regions (second foreground regions 1-10, respectively), where the labeled target region 1 corresponds to the second foreground region 1, the labeled target region 2 corresponds to the second foreground region 2 …, the labeled target region 10 corresponds to the second foreground region 10, and if the coincidence degree between the labeled target region i and the second foreground region i is overlap i (i is 1, 2 …, 10), the average value of the coincidence degree between each second foreground region and the corresponding labeled target region in the training sample is (overlap1+ overlap2 … + overlap 10)/10.

In this embodiment, after determining an average value of the coincidence degrees of each second foreground region and the labeling target region corresponding to each second foreground region in the training sample, it may be determined whether the average value is greater than or equal to a preset coincidence degree threshold (referred to as a second preset coincidence degree threshold herein, which may be set according to an actual scene); the second preset contact ratio threshold value is larger than the first preset contact ratio threshold value.

If the average value is larger than or equal to a second preset contact ratio threshold value, determining that the contact ratio loss corresponding to the training sample meets the requirement; otherwise, determining that the coincidence degree loss corresponding to the training sample does not meet the requirement.

Further, in this embodiment, in order to improve the detection accuracy of the first cascaded convolutional neural network and the second cascaded convolutional neural network, after the step S100d, the method may further include:

and when the average value of the sum of the distances between the corresponding first type feature points and the second type feature points in the training sample is greater than a preset distance threshold, optimizing the coefficient of the third cascade convolutional neural network, and repeating the training for the third cascade convolutional neural network until the average value of the sum of the distances between the corresponding first type feature points and the second type feature points in the training sample is less than or equal to the preset distance threshold.

In this embodiment, after performing regression processing on the second foreground region, it is further required to determine whether the loss of the feature point of the training target region meets the requirement.

In an example, the feature points of each training target region in the training samples satisfy a loss requirement, which may include:

the average value of the sum of the distances between the corresponding first type feature points and the second type feature points in the training sample is less than or equal to a preset distance threshold;

the above-mentioned feature point loss of the training target region in the training sample does not meet the requirement, and may include:

and the average value of the sum of the distances between the corresponding first type characteristic points and the second type characteristic points in the training sample is greater than a preset distance threshold value.

Specifically, for any training target region in the training sample, the sum of distances between each feature point (referred to as a first type feature point herein) of the training target region and the corresponding feature point (referred to as a second type feature point herein) of the labeling target region may be determined, and further, the average of the sums of distances between the first type feature point and the corresponding second type feature point in the training sample may be determined.

For example, the average value D of the sum of the distances between the corresponding first-type feature points and the second-type feature points in the training sample may be determined by the following formula:

where m is the number of training target areas in the training sample, n_jThe number of the first type feature points of the training target area j in the training sample, d_jThe sum of the distances between each first type feature point in a training target area j and each second type feature point in a corresponding labeling target area j in the training sample, (x)_{detect_i}，y_{detect_i}) Is the coordinate of the first type characteristic point i in the training target area j in the training sample, (x)_{target_i}，y_{target_i}) And marking the coordinates of the second type feature point i of the target area j corresponding to the training target area j in the training sample.

In this embodiment, after determining an average value of the sum of distances between the first type feature point and the second type feature point corresponding to each other in the training sample, it may be determined whether the average value is greater than a preset distance threshold, and if so, it is determined that the feature point loss of the training target area in the training sample does not meet the requirement; otherwise, determining that the characteristic point loss of the training target area in the training sample meets the requirement.

In the embodiment of the present application, after each training sample in the training set is processed according to the above-mentioned step S100a to step S100e, the trained first cascade convolutional neural network, second cascade convolutional neural network, and third cascade convolutional neural network may be subjected to target detection according to the method flows shown in the steps S100 to S130.

Further, in one embodiment of the present application, after the first class loss and the coincidence degree corresponding to the training sample satisfy the loss satisfaction requirement, the method may further include:

classifying the second foreground area through a third cascaded convolutional neural network to obtain a target class in the training sample;

and if the corresponding second type loss in the training sample does not meet the requirement, optimizing the coefficient of the third cascade convolutional neural network, and repeating the training aiming at the third cascade convolutional neural network until the second type loss corresponding to the training sample meets the requirement.

In the embodiment of the application, when the target is detected, the type of the target can be detected.

Accordingly, in this embodiment, when the third cascaded convolutional neural network is trained, the third cascaded convolutional neural network may also be trained to recognize a target class (referred to herein as a second class).

The second category may include pedestrians, vehicles (including automobiles, non-automobiles, etc.), and the like.

In this embodiment, after the second foreground region in the training sample is determined, and the first class loss and the overlap ratio loss of the training sample satisfy the requirements, the second foreground region may be further classified by the third concatenated convolutional neural network to determine the target class in the training sample, that is, after the second foreground region is input into the third concatenated convolutional neural network, not only the regression processing may be performed to obtain the training target region, but also the target class (that is, the second class) of each training target region may be obtained.

After the target class of each training target area in the training sample is determined, whether the second class loss in the training sample meets the requirement can be further judged. If so, ending the training of the training sample; otherwise, optimizing the coefficient of the third cascade convolutional neural network, and repeating the training for the third cascade convolutional neural network until the second class loss corresponding to the training sample meets the requirement.

In other words, in this embodiment, when the third-stage convolutional neural network is trained, it is required to ensure that both the feature point loss and the second class loss corresponding to any training sample meet the requirements; otherwise, coefficient optimization and retraining are required.

In an example, the second class loss corresponding to the training sample satisfies the requirement, and may include:

the identification accuracy of the target category corresponding to the training sample is greater than or equal to a preset accuracy threshold;

the second type loss corresponding to the training sample does not meet the requirement, and may include:

the identification accuracy of the target category corresponding to the training sample is smaller than a preset accuracy threshold.

Specifically, for any training sample, after the target class identification is performed through the third cascaded convolutional neural network, the identification accuracy of the target class corresponding to the training sample can be determined.

For example, if a training sample includes 10 labeled target regions, the number of the training target regions determined by the third hierarchical convolutional neural network is 9, and the second class identified by 8 training target regions is consistent with the target class of the corresponding labeled target region, the identification accuracy of the target class corresponding to the training sample is 80%.

In this embodiment, after the recognition accuracy of the target class corresponding to the training sample is determined, it may be determined whether the recognition accuracy is greater than or equal to a preset accuracy threshold (which may be set according to an actual scene), and if so, it is determined that the loss of the second class corresponding to the training sample meets the requirement; otherwise, determining that the second category loss corresponding to the training sample does not meet the requirement.

Accordingly, in this embodiment, after determining the first foreground regions in the first candidate regions and the confidence degrees of the first foreground regions through the pre-trained second cascade convolutional neural network, the method may further include:

and classifying the first foreground region with the reliability meeting the preset condition through a pre-trained third-level convolution neural network to obtain the target category of each target region in the image to be detected.

In this embodiment, when the target class identification function of the third cascaded convolutional neural network is trained in the training process of the cascaded convolutional neural network, when the trained cascaded convolutional neural network is used to perform target detection, target class identification may also be performed.

Correspondingly, after the first foreground regions in the first candidate regions and the confidence degrees of the first foreground regions are determined through the pre-trained second cascade convolution neural network, regression and classification processing can be performed on the first foreground regions with the confidence degrees meeting the preset conditions through the pre-trained third cascade convolution neural network, so that the detection target regions in the image to be detected and the target types of the detection target regions are obtained.

In the embodiment of the application, the image to be detected is input into the pre-trained first-stage convolutional neural network to obtain a first feature map of the image to be detected, a first candidate region with angle information is extracted from the first feature map according to preset parameters, then a first foreground region in the first candidate region and confidence coefficients of the first foreground regions are determined through the pre-trained second-stage convolutional neural network, and regression processing is performed on the first foreground region with the confidence coefficients meeting preset conditions through the pre-trained third-stage convolutional neural network to obtain a target region in the image to be detected, so that the applicability of target detection to multi-angle scenes is improved, and the accuracy of target detection in multi-angle scenes is improved.

The methods provided herein are described above. The following describes the apparatus provided in the present application:

referring to fig. 4, a schematic structural diagram of an object detection apparatus provided in an embodiment of the present application is shown in fig. 4, where the object detection apparatus may include:

the first extraction unit 410 is configured to input an image to be detected into a pre-trained first cascade convolutional neural network to obtain a first feature map of the image to be detected;

a second extracting unit 420, configured to extract a first candidate region from the first feature map according to a preset parameter; the preset parameters comprise an x-axis direction angle or/and a y-axis direction angle;

a determining unit 430, configured to determine, through a pre-trained second cascaded convolutional neural network, first foreground regions in the first candidate regions and confidence levels of the first foreground regions; the first foreground region is a first candidate region with a confidence coefficient greater than or equal to a preset confidence coefficient threshold;

and the processing unit 440 is configured to perform regression processing on the first foreground region with the confidence level meeting the preset condition through a pre-trained third-stage convolutional neural network to obtain a target region in the image to be detected.

In an alternative embodiment, the first candidate region is a parallelogram region, and the preset parameters include a scale, an aspect ratio, a center point position, an x-axis direction angle, and a y-axis direction angle.

In an optional embodiment, the first extracting unit 410 is further configured to, for any training sample in the training set, input the training sample into the first cascade convolutional neural network to obtain a second feature map corresponding to the training sample;

the second extracting unit 420 is further configured to determine a second candidate region in the second feature map according to the preset parameter;

the determining unit 430 is further configured to determine, through a second concatenated convolutional neural network, a second foreground region in the second candidate region; the second foreground region is a second candidate region, and the coincidence degree of the second foreground region and a target region which is labeled in advance in the training sample is higher than a first preset coincidence degree threshold value;

the processing unit 440 is further configured to perform regression processing on the second foreground region through a third cascaded convolutional neural network to obtain a target region in the training sample.

In an alternative embodiment, as shown in fig. 5, the apparatus further comprises:

and the parameter optimization unit 450 is configured to perform parameter optimization on the network combination of the first cascaded convolutional neural network and the second cascaded convolutional neural network in a cascaded manner according to the first class loss and/or the contact ratio loss corresponding to the training sample until the first class loss and the contact ratio loss corresponding to the training sample meet requirements.

In an optional implementation manner, the parameter optimization unit 450 is specifically configured to perform parameter optimization on a network combination of the first cascaded convolutional neural network and the second cascaded convolutional neural network when a ratio of the number of the first type target regions in the training sample to the number of the pre-labeled target regions is greater than or equal to a preset ratio threshold, so that when the training sample after parameter optimization is used as input, the ratio of the number of the first type target regions in the training sample to the number of the pre-labeled target regions is smaller than the preset ratio threshold; the first type target area is a pre-labeled target area without a corresponding second foreground area; and/or;

In an optional embodiment, the parameter optimization unit 450 is specifically configured to optimize model parameters of the first cascaded convolutional neural network and/or the second cascaded convolutional neural network.

In an optional implementation manner, the parameter optimization unit 450 is specifically configured to, for any second candidate region and a pre-labeled target region, determine a degree of overlap between the second candidate region and the pre-labeled target region according to the following formula:

overlap＝(S_candidate∩S_target)/(D_candidate∪S_target)

wherein S is_candidateIs the area of the second candidate region, S_targetIs the area of the pre-labeled target region, S_candidate∩S_targetIs the area of the overlapping portion of the second candidate region and the pre-labeled target region, S_candidate∪S_targetIs the total area covered by the second candidate region and the pre-labeled target region.

In an optional embodiment, the parameter optimization unit 450 is further configured to, when an average value of the sum of the distances between the corresponding first-type feature points and the corresponding second-type feature points in the training sample is greater than a preset distance threshold, optimize coefficients of the third cascaded convolutional neural network, and repeat training for the third cascaded convolutional neural network until the average value of the sum of the distances between the corresponding first-type feature points and the corresponding second-type feature points in the training sample is less than or equal to the preset distance threshold;

the first type feature points are feature points of each target area in the training sample, and the second type feature points corresponding to the first type feature points are corresponding feature points of a pre-marked target area corresponding to each target area.

In an optional implementation manner, the parameter optimization unit 450 is specifically configured to determine an average value D of sums of distances between corresponding feature points of the first type and the second type in the training sample by using the following formula:

where m is the number of target regions in the training sample, n_jThe number of first type feature points, d, of the target region j in the training sample_jIs the sum of the distances between each first type feature point in the target area j in the training sample and each second type feature point in the corresponding pre-labeled target area j, (x)_{detect_i}，y_{detect_i}) Is the coordinate of the first type feature point i in the target area j in the training sample, (x)_{target_i}，y_{target_i}) And the coordinates of the second-type feature points i of the pre-labeled target area j corresponding to the target area j in the training sample are obtained.

In an optional implementation manner, the processing unit 440 is further configured to perform classification processing on the second foreground region through the third convolutional neural network to obtain a target class in the training sample;

the parameter optimization unit 450 is further configured to optimize a coefficient of the third cascaded convolutional neural network if the second class loss corresponding to the training sample does not meet the requirement, and repeat training on the third cascaded convolutional neural network until the second class loss corresponding to the training sample meets the requirement.

In an optional embodiment, the second class loss corresponding to the training sample meets the requirement, including:

the second category loss corresponding to the training sample does not meet the requirement, and the method comprises the following steps:

In an optional implementation manner, the processing unit 440 is further configured to perform classification processing on the first foreground region with reliability meeting a preset condition through the pre-trained third-stage convolutional neural network, so as to obtain a target class of each target region in the image to be detected.

Fig. 6 is a schematic diagram of a hardware structure of a target detection apparatus according to an embodiment of the present disclosure. The object detection apparatus may include a processor 601, a machine-readable storage medium 602 having machine-executable instructions stored thereon. The processor 601 and the machine-readable storage medium 602 may communicate via a system bus 603. Also, by reading and executing machine-executable instructions in the machine-readable storage medium 602 corresponding to the object detection logic, the processor 601 may perform the object detection method described above.

The machine-readable storage medium 602 referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.

Embodiments of the present application also provide a machine-readable storage medium, such as machine-readable storage medium 602 in fig. 6, comprising machine-executable instructions that are executable by processor 601 in an object detection apparatus to implement the object detection method described above.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. A method of object detection, comprising:

extracting a first candidate region from the first feature map according to preset parameters; the preset parameters at least comprise an x-axis direction angle or/and a y-axis direction angle;

2. The method according to claim 1, wherein the first candidate region is a parallelogram region, and the preset parameters include a scale, an aspect ratio, a center point position, an x-axis direction angle, and a y-axis direction angle.

3. The method of claim 1, wherein the cascaded first, second, and third cascaded convolutional neural networks are trained by:

inputting any training sample in a training set into a first cascade convolution neural network to obtain a second feature map corresponding to the training sample;

determining a second candidate region in the second feature map according to the preset parameters;

determining, by a second cascaded convolutional neural network, a second foreground region in the second candidate region; the second foreground region is a second candidate region, and the coincidence degree of the second foreground region and a target region which is labeled in advance in the training sample is higher than a first preset coincidence degree threshold value;

and performing regression processing on the second foreground region through a third cascaded convolutional neural network to obtain a target region in the training sample.

4. The method of claim 3, wherein after determining the second foreground region of the second candidate regions by the second cascaded convolutional neural network, the method further comprises:

5. The method of claim 4, wherein performing parameter optimization on a network combination of the cascaded first cascaded convolutional neural network and the second cascaded convolutional neural network according to a first class loss and/or a goodness-of-fit loss corresponding to the training sample comprises:

when the ratio of the number of the first type target areas in the training sample to the number of the pre-labeled target areas is larger than or equal to a preset ratio threshold, performing parameter optimization on the network combination of the first cascade convolutional neural network and the second cascade convolutional neural network in a cascade manner, so that the ratio of the number of the first type target areas in the training sample to the number of the pre-labeled target areas is smaller than the preset ratio threshold when the training sample is input after the parameter optimization; the first type target area is a pre-labeled target area without a corresponding second foreground area; and/or;

6. The method according to claim 4 or 5,

the parameter optimization of the network combination of the cascaded first cascaded convolutional neural network and the cascaded second cascaded convolutional neural network includes:

optimizing model parameters of the first cascaded convolutional neural network and/or the second cascaded convolutional neural network.

7. A method according to claim 3, wherein for any second candidate region and pre-labelled target region, the degree of overlap of the second candidate region with the pre-labelled target region is determined by:

overlap＝(S_candidate∩S_target)/(S_candidate∪S_target)

8. The method of claim 3, wherein after performing regression processing on the second foreground region through a third cascaded convolutional neural network to obtain a target region in the training sample, the method further comprises:

when the average value of the sum of the distances between the corresponding first type feature points and the second type feature points in the training sample is larger than a preset distance threshold, optimizing the coefficient of the third cascade convolution neural network, and repeating the training for the third cascade convolution neural network until the average value of the sum of the distances between the corresponding first type feature points and the second type feature points in the training sample is smaller than or equal to the preset distance threshold;

9. The method of claim 8, wherein the average value D of the sum of the distances between the corresponding feature points of the first type and the feature points of the second type in the training sample is determined by the following formula:

where m is the number of target regions in the training sample, n_jThe number of first type feature points, d, of the target region j in the training sample_jIs the sum of the distances between each first type feature point in the target area j in the training sample and each second type feature point in the corresponding pre-labeled target area j, (x)_{detect_i}，y_{setect_i}) Is the coordinate of the first type feature point i in the target area j in the training sample, (x)_{target_i}，y_{target_i}) And the coordinates of the second-type feature points i of the pre-labeled target area j corresponding to the target area j in the training sample are obtained.

10. The method of claim 3, wherein after the first class loss and the overlap ratio loss corresponding to the training sample satisfy the requirement, further comprising:

classifying the second foreground area through the third cascade convolution neural network to obtain a target class in the training sample;

and if the second type loss corresponding to the training sample does not meet the requirement, optimizing the coefficient of the third cascade convolutional neural network, and repeating the training aiming at the third cascade convolutional neural network until the second type loss corresponding to the training sample meets the requirement.

11. The method of claim 10, wherein the second class loss for the training sample satisfies a requirement, comprising:

12. The method of claim 10, wherein after determining the first foreground regions in the first candidate regions and the confidence levels of the first foreground regions through a pre-trained second cascaded convolutional neural network, further comprising:

and classifying the first foreground region with the reliability meeting the preset condition through the pre-trained third cascade convolution neural network to obtain the target category of each target region in the image to be detected.

13. An object detection device, comprising:

14. The apparatus of claim 13, wherein the first candidate region is a parallelogram region, and the preset parameters comprise a dimension, an aspect ratio, a center point position, an x-axis direction angle and a y-axis direction angle.

15. The apparatus of claim 13,

the first extraction unit is further configured to input any training sample in the training set into the first cascade convolutional neural network to obtain a second feature map corresponding to the training sample;

the second extraction unit is further configured to determine a second candidate region in the second feature map according to the preset parameter;

the determining unit is further configured to determine a second foreground region in the second candidate region through a second cascaded convolutional neural network; the second foreground region is a second candidate region, and the coincidence degree of the second foreground region and a target region which is labeled in advance in the training sample is higher than a first preset coincidence degree threshold value;

and the processing unit is further configured to perform regression processing on the second foreground region through a third cascaded convolutional neural network to obtain a target region in the training sample.

16. The apparatus of claim 15, further comprising:

and the parameter optimization unit is used for carrying out parameter optimization on the network combination of the cascaded first cascade convolutional neural network and the cascaded second cascade convolutional neural network according to the first class loss and/or the contact ratio loss corresponding to the training sample until the first class loss and the contact ratio loss corresponding to the training sample meet requirements.

17. The apparatus of claim 16,

the parameter optimization unit is specifically configured to perform parameter optimization on a network combination of the first cascaded convolutional neural network and the second cascaded convolutional neural network when a ratio of the number of the first type target regions in the training sample to the number of the pre-labeled target regions is greater than or equal to a preset ratio threshold, so that when the training sample is input after the parameter optimization, the ratio of the number of the first type target regions in the training sample to the number of the pre-labeled target regions is less than the preset ratio threshold; the first type target area is a pre-labeled target area without a corresponding second foreground area; and/or;

18. The apparatus of claim 16 or 17,

the parameter optimization unit is specifically configured to optimize model parameters of the first cascaded convolutional neural network and/or the second cascaded convolutional neural network.

19. The apparatus of claim 15,

the parameter optimization unit is specifically configured to, for any second candidate region and a pre-labeled target region, determine an overlap ratio overlap of the second candidate region and the pre-labeled target region according to the following formula:

overlap＝(S_candidate∩S_target)/(S_candidate∪S_target)

20. The apparatus of claim 16,

the parameter optimization unit is further configured to, when an average value of a sum of distances between the first type feature point and the second type feature point corresponding to the training sample is greater than a preset distance threshold, optimize a coefficient of the third cascade convolutional neural network, and repeat training for the third cascade convolutional neural network until the average value of the sum of distances between the first type feature point and the second type feature point corresponding to the training sample is less than or equal to the preset distance threshold;

21. The apparatus of claim 20,

the parameter optimization unit is specifically configured to determine an average value D of sums of distances between corresponding first-type feature points and second-type feature points in the training sample by using the following formula:

22. The apparatus of claim 16,

the processing unit is further configured to perform classification processing on the second foreground region through the third cascade convolution neural network to obtain a target class in the training sample;

the parameter optimization unit is further configured to optimize a coefficient of the third cascade convolutional neural network if the second class loss corresponding to the training sample does not meet the requirement, and repeat training on the third cascade convolutional neural network until the second class loss corresponding to the training sample meets the requirement.

23. The apparatus of claim 22, wherein the second class loss for the training sample satisfies a requirement, comprising:

24. The apparatus of claim 22,

the processing unit is further configured to classify the first foreground region with the confidence level meeting the preset condition through the pre-trained third-stage convolutional neural network, so as to obtain a target category of each target region in the image to be detected.

25. An object detection apparatus comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor, the processor being caused by the machine-executable instructions to:

26. A machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to: