CN112418344A

CN112418344A - Training method, target detection method, medium and electronic device

Info

Publication number: CN112418344A
Application number: CN202011430940.6A
Authority: CN
Inventors: 王海涛; 袁德胜; 游浩泉; 成西锋; 任晓双; 崔龙; 马卫民; 林治强; 党毅飞; 李伟超
Original assignee: Winner Technology Co ltd
Current assignee: Winner Technology Co ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-02-26
Anticipated expiration: 2040-12-07
Also published as: CN112418344B

Abstract

The invention provides a training method, a target detection method, a medium and an electronic device; the training method comprises the following steps: acquiring a training data set, wherein the training data set comprises a plurality of training data, and each training data comprises a training image and the position, the size, the angle and the category of a selection frame in the training image; and training the image processing model by using the training data set. The training method can solve the problems of inaccurate target positioning and low recognition rate when the horizontal frame is adopted to detect the inclined target in the prior art.

Description

Training method, target detection method, medium and electronic device

Technical Field

The invention belongs to the field of data identification, relates to a target detection method, and particularly relates to a training method, a target detection method, a medium and electronic equipment.

Background

With the development of technology and the increase of user demands, video monitoring is more and more widely applied in real life. In a specific application scenario, the top view angle of the monitoring camera is often installed due to limitations of installation environment, detection range, installation manpower and the like. In this case, the human body may be inclined to different angles in the screen, and the conventional horizontal rectangular detection frame may not satisfy the requirement for framing the target subject. At present, a target detection method for common view images is researched more, target objects in the scenes mostly adopt horizontal rectangular frames to define target positions, and the targets are positioned through regression of frame parameters, and a more classical method comprises the following steps: realizing high-efficiency and high-precision target detection by using a fast RCNN (Region-Based constant-functional Network) through a feature sharing and Region extraction Network (RPN); the method for detecting the target by using the YoLo (You Only Look one) model and directly regressing the target coordinates and classifying the target by adopting the sub-grid can greatly improve the detection speed compared with the fast RCNN, thereby being widely applied to a plurality of fields. However, in these methods, a horizontal frame is usually selected as a candidate frame, and this candidate frame has no angle information, although this method can more accurately and effectively locate an object with a vertical or horizontal angle in an image, when these methods are directly applied to human body detection with a top view angle, the rotation invariance of the object detection model is affected, and the object detection frames determined by these candidate regions are usually not accurate enough, for example, the area of the background in the located object detection frame may be much larger than the area of the object itself, which may result in inaccurate object location and low object recognition rate.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, an object of the present invention is to provide a training method, an object detection method, a medium and an electronic device, which are used to solve the problem that a candidate frame selected in the prior art does not have angle information.

To achieve the above and other related objects, a first aspect of the present invention provides a training method for training an image processing model, where the image processing model is used to process a target image to obtain a selection box matching a target object included in the target image, the training method including: acquiring a training data set, wherein the training data set comprises a plurality of training data, and each training data comprises a training image and the position, the size, the angle and the category of a selection frame in the training image; and training the image processing model by using the training data set.

In an embodiment of the first aspect, before training the image processing model with the training data set, the training method further includes: and mapping the angle of the selection frame in the training data into a multidimensional array by using a Gaussian function.

In an embodiment of the first aspect, the image processing model includes: the shallow feature extraction module is used for acquiring shallow features of input data of the image processing model; the deep feature extraction module is used for acquiring deep features of input data of the image processing model; the fusion module is connected with the shallow feature extraction module and the deep feature extraction module and is used for fusing the shallow features and the deep features of the input data to obtain a feature map; and the post-processing module is connected with the fusion module and used for processing the feature map output by the fusion module to obtain a plurality of prediction frames.

In an embodiment of the first aspect, a method for implementing training of the image processing model by using the training data set includes: selecting training data from the training data set as current training data; processing the current training data by using the image processing model to obtain a prediction frame corresponding to the current training data; calculating a function value of a loss function according to a prediction box corresponding to the current training data; adjusting parameters and/or architecture of the image processing model according to the function value of the loss function; and repeating the process until the function value of the loss function meets a preset condition.

In an embodiment of the first aspect, the loss function is:

wherein S is²Is the number of grids of the feature map, λ_conf、λ_coorAnd λ_classRespectively, three weight values, conf_targetConfidence of the target selection box, conf_predictTo predict confidence of the box, coor_targetSelecting coordinates of center point, width and height of frame, color, for target_predictTo predict the center point coordinates, width and height of the box, θ_targetSelecting the angle of the frame, θ, for the target_predictTo predict the angle of the frame, L_MSEIs a loss function of mean square error, L_focallossIs focal loss function, L_{cross-entrophy}Is a cross entropy loss function.

In an embodiment of the first aspect, a detection object of the image processing model is a human body, and the loss function is:

wherein S is²Is the number of grids of the feature map, λ_conf、λ_coorAre respectively two weighted values, conf_targetConfidence of the target selection box, conf_predictTo predict confidence of the box, coor_targetSelecting coordinates of center point, width and height of frame, color, for target_predictTo predict the center point coordinates, width and height of the box, θ_targetSelecting for a targetAngle of frame, θ_predictTo predict the angle of the frame, L_MSEIs a loss function of mean square error, L_focallossIs a focal loss function.

A second aspect of the present invention provides an object detection method, including: acquiring a target image to be detected, wherein the target image comprises a target object; according to the training method of any one of the first aspect of the invention, a trained image processing model is obtained; processing the target image by using the trained image processing model to obtain a prediction frame matched with the target object; and detecting the target image based on the prediction frame matched with the target object.

In an embodiment of the second aspect, an implementation method for obtaining the trained image processing model includes: training at least two image processing models by using the training method to obtain at least two alternative models; acquiring a test data set; and respectively testing each alternative model by using the test data set, and selecting the trained image processing model from the at least two alternative models according to the test result.

A third aspect of the invention provides a computer readable storage medium having stored thereon a computer and a program which, when executed by a processor, implements the training method of any one of the first aspects of the invention and/or the object detection method of any one of the second aspects of the invention.

A fourth aspect of the present invention provides an electronic apparatus, comprising: a memory storing a computer program; a processor, communicatively coupled to the memory, for executing the training method according to any of the first aspects of the present invention and/or the object detection method according to any of the second aspects of the present invention when the computer program is invoked; and the display is in communication connection with the processor and the memory and is used for displaying a GUI (graphical user interface) related to the training method and/or the target detection method.

As described above, one technical solution of the training method, the target detection method, the medium, and the electronic device according to the present invention has the following advantageous effects:

the training data adopted by the training method comprises the position, the size and the category of the selection frame and the angle of the selection frame; the image processing model obtained by adopting the training data can output a prediction frame with an angle, the detection of the inclined target can be realized based on the prediction frame with the angle, and the detection has higher positioning accuracy and recognition rate. Therefore, the training method can solve the problems of inaccurate target positioning and low recognition rate when the horizontal frame is adopted to detect the inclined target in the prior art, and is particularly suitable for target detection under a multi-angle scene with a top view angle.

Drawings

FIG. 1 is a flow chart of a training method according to an embodiment of the present invention.

FIG. 2 is a diagram illustrating examples of multidimensional arrays involved in a training method according to an embodiment of the present invention.

Fig. 3A is a schematic structural diagram of an image processing model used in an embodiment of the training method of the present invention.

FIG. 3B is a schematic diagram of a portion of an algorithm of the training method according to an embodiment of the present invention.

FIG. 3C is a schematic diagram of a portion of an algorithm of the training method according to an embodiment of the present invention.

FIG. 3D is a schematic diagram of a portion of an algorithm of the training method according to an embodiment of the present invention.

Fig. 4 is a flowchart of step S12 according to an embodiment of the training method of the present invention.

FIG. 5A is a flowchart illustrating a method for detecting a target according to an embodiment of the present invention.

FIG. 5B is a flowchart illustrating the step S52 of the target detection method according to an embodiment of the present invention.

Fig. 5C is a flowchart illustrating the step S523 of the target detecting method according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Description of the element reference numerals

1 image processing model

11 shallow layer feature extraction module

12 deep layer characteristic extraction module

13 fusion module

14 post-processing module

600 electronic device

610 memory

620 processor

630 display

S11-S12

S121 to S125

S51-S54

Steps S521 to S523

S5231 to S5233 steps

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than being drawn according to the number, shape and size of the components in actual implementation, and the type, number and proportion of the components in actual implementation may be changed arbitrarily, and the layout of the components may be more complicated. Moreover, in this document, relational terms such as "first," "second," and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The existing detection method for the target object usually selects a horizontal or vertical frame as a candidate frame, the candidate frame does not have angle information, and when the methods are directly applied to human body detection at a top view angle, because a detection target often inclines in an image, the obtained target detection frame is often inaccurate, for example, the background area in the positioned target detection frame may be far larger than the area of the target, so that the problems of inaccurate target positioning, low target identification rate and the like can be caused. Aiming at the problem, the invention provides a training method which is used for training an image processing model, wherein training data adopted in the training process comprises the position, the size and the category of a selection frame and also comprises the angle of the selection frame; the image processing model obtained by adopting the training data can output a prediction frame with an angle, in specific application, the prediction frame with the angle is used as a candidate frame, a proper target detection frame is selected from the candidate frame, the detection of the inclined target can be realized based on the target detection frame, and higher positioning accuracy and recognition rate are obtained. The selection frame refers to a frame of a target object in a training image, the prediction frame refers to a frame of a target object output by the image processing model, the candidate frame refers to a frame of a target object provided for a user to select in practical application, the number of the candidate frames is usually multiple, and the target detection frame refers to a frame of a most appropriate target object selected from the candidate frames.

In an embodiment of the present invention, the training method is used to train an image processing model, where the image processing model is an artificial intelligence-based image processing model, such as a YoLo model, a tinyyolo v3 model, and is used to process a target image to obtain a selection frame matched with a target object included in the target image, where the input of the image processing model is a training image (in a training process) or an image to be detected (in an application process), and the output of the image processing model is a prediction frame corresponding to the training image or a prediction frame corresponding to the image to be detected. Specifically, referring to fig. 1, the training method in this embodiment includes:

s11, obtaining a training data set, where the training data set includes a plurality of training data, and each training data includes a training image and a position, a size, an angle, and a category of a selection box in the training image. The training data set can be obtained from an internet source data set, and can also be obtained by manually marking the position, the size, the angle and the category of the selection box on the training image.

Preferably, the position, size and angle of a certain selection box can be passed through a set of channels (x)_center，y_centerW, h, θ), where x_centerChannel sum y_centerThe channel respectively represents the abscissa and the ordinate of the central point of the selection frame; the w channel and the h channel respectively represent the width and the height of the selection box, wherein the width of the selection box refers to the length of the long side in the selection box, and the height of the selection box refers to the length of the short side in the selection box; the theta channel represents the angle of the selection frame, and in specific application, the included angle between the long side and the y axis can be used as the angle of the selection frame, and the value range is [0 degrees and 180 degrees ]. This definition can avoid the ambiguity problem in the definition process using the horizontal selection box or the vertical selection box, for example, the same selection box can be described as (50, 50, 100, 50, 0 °), and can also be described as (50, 50, 100, 90 °), which is more beneficial to the convergence of the network in the training process. The category of the selection frame is used for representing categories of the object to be detected, such as human bodies, vehicles, animals and the like, and the selection frame can be defined into a single category or multiple categories according to requirements in specific applications.

S12, training the image processing model by using the training data set. The method for training the image processing model by using the training data set can be implemented by using the existing method, and details are not repeated here.

Based on the image processing model obtained after training by the training method in this embodiment, the input target image can be processed to obtain one or more prediction boxes corresponding to the target image, where each prediction box includes a group of channels and corresponding categories, and each channel is used to represent the abscissa and ordinate, the width, the height, and the angle of the center point of the prediction box.

As can be seen from the above description, the training data used in the training process of the training method of this embodiment includes the position, size, and category of the selection box, as well as the angle of the selection box; the image processing model obtained by the training data can output a prediction frame with an angle, the prediction frame with the angle is used as a candidate frame to detect a target image to obtain a target detection frame with the angle, further the detection of an inclined target can be realized, and higher positioning accuracy and recognition rate can be obtained, so that the method is particularly suitable for target detection under a top view angle multi-angle scene.

In addition, in some embodiments, the horizontal channel and the vertical channel are respectively used to define the selection box, and at this time, there is an exchange availability of edges (EoE) problem, where the EoE problem is that in a 90-degree boundary area, due to the unclear definition of the long and short edges, the network predicts that an actual square is not fit, that is: the problem of the vertical channel is that the long side is required to be fitted in some cases, and the short side is required to be fitted in other cases, and the definition mode can lead to the unclear definition of the long side and the short side of the selection box, so that the problem that the similar values can be predicted by the horizontal channel and the vertical channel exists in some cases. In the embodiment, the width and the height of the selection frame are respectively defined by the w channel and the h channel, so that the w channel always corresponds to the long side of the selection frame and the h channel always corresponds to the short side of the selection frame, and thus the EoE problem can be solved.

Moreover, compared with the mode that two models are adopted to respectively obtain the position and the angle of the prediction frame in some embodiments, the image processing model adopted by the embodiment can simultaneously complete the end-to-end detection of the target object in one forward propagation, output the position and the angle information of the prediction frame, and can accurately fit the detection target.

In one implementation of the present inventionIn an example, before training the image processing model using the training data set, the training method further comprises: and mapping the angle of the selection frame in the training data into a multidimensional array by using a Gaussian function. Preferably, the length of the multi-dimensional array is 180. Specifically, for any angle θ₀All elements in the multidimensional array obtained by mapping accord with theta₀In this embodiment, one method for implementing the mapping is as follows: when theta is₀When the temperature is less than or equal to 90 ℃,

when theta is₀Greater than 90 DEG

Wherein, σ is a standard deviation, and the value thereof can be set according to actual requirements. In addition, please refer to fig. 2, which is an exemplary diagram of the multidimensional array according to the embodiment, wherein θ is included in the exemplary diagram₀＝0.3、θ₀45.2 and θ₀The corresponding multidimensional array example is shown when being 89.5. It should be noted that the above formula is only one possible way of the embodiment, and other ways may also be adopted to implement the mapping in specific applications.

In this embodiment, by mapping the angle of the selection box to a multidimensional array, an angle Periodicity of Angular (PoA) problem that the selection box with an angle may have multiple definitions in an application, which may cause that a network cannot be fitted (that is, when the selection box is horizontal, the angle may be 0 ° or 180 °, which may cause that a network prediction angle is inaccurate) can be avoided, and the principle is as follows: in the embodiment, the angles are mapped into the multidimensional array, so that the multidimensional arrays corresponding to the selection frame when the angles are 0 ° and 180 ° are the same, and thus, the problem of multiple definition of the angles does not exist, and the above-mentioned PoA problem can be avoided in the embodiment.

Referring to fig. 3A, in an embodiment of the present invention, the image processing model 1 includes a shallow feature extraction module 11, a deep feature extraction module 12, a fusion module 13, and a post-processing module 14. The shallow feature extraction module 11 is configured to obtain a shallow feature of the input data of the image processing model, the deep feature extraction module 12 is configured to obtain a deep feature of the input data of the image processing model, and the fusion module 13 is connected to the shallow feature extraction module 12 and the deep feature extraction module 13 and configured to fuse the shallow feature and the deep feature of the input data to obtain a feature map of the input data; the post-processing module 14 is connected to the fusion module 13, and is configured to process the feature map output by the fusion module 13 to obtain 1 or more prediction frames.

Specifically, referring to fig. 3B, fig. 3C and fig. 3D, the shallow feature extraction module 11 may be implemented by using an architecture based on TinyYoLoV3, where the shallow feature extraction module 11 is the same as the first half of the original TinyYoLoV3, and this part may load detection network weights trained by other data sets (such as Image, etc.) as pre-training weights. It should be noted that tinyyolo v3 is a simplified version of YoLo v3, and is faster than YoLo v3, and YoLo v3 is a third version of the YoLo series target detection algorithm, and solves the object detection as a regression problem, and completes the output from the input of the original image to the object position and category based on a single end-to-end network.

The deep feature extraction module 12 may be configured to further extract deep features of the input data, and preferably, the deep feature extraction module 12 includes an inverse Residual Block (Inverted Residual Block), which is implemented as shown in fig. 3C, and is suitable for a lightweight terminal network, and is capable of effectively extracting features while reducing the computation of the network, where k is a channel expansion coefficient, and dw _ conv is a subchannel convolution.

The fusion module 13 is configured to fuse the shallow feature and the deep feature to obtain a feature map, and the fusion module 13 may be configured to extract features of the angular channel. Specifically, the feature map may be represented in a matrix form, and in this case, the fusion module 13 may output a matrix with 186 × the number of feature map grids.

Referring to fig. 3D, taking the feature map of 24 × 16 × 186 channels as an example, the process of the post-processing module 14 processing the feature map output by the fusion module to obtain a plurality of prediction frames includes:

step 1, dividing channels, wherein 186 represents 186 channels, and each channel represents an abscissa x of the center point of the selection frame_centerLongitudinal coordinate y of the center point of the selection frame_centerWidth w of selection frame, height h of selection frame, detection confidence conf_predictA 180-bit angle array and a selection box category. In particular, when the image processing model is used for detecting a human body, the category of the selection frame is only the human body frame, and at this time, the number of bits of the selection frame category is 1. In the feature map of the 24 × 16 × 186 channels, the number of the feature map grids is 24 × 16.

Step 2, coding each channel respectively, wherein for the abscissa channel, the coded output is sigmoid (x)_center) + X/W, wherein X is grid offset in the width direction, the value range of the grid offset is 1-24 in the embodiment, W is the output width of the grid, and the value of the grid offset is 24 in the embodiment; for the ordinate channel, its coded output is sigmoid (y)_center) + Y/H, wherein Y is grid offset in the height direction, the value range is 1-16 in the embodiment, H is the network output height, and the value is 16 in the embodiment; for the width channel and the height channel, the coded outputs are exp (w) and exp (h), respectively; for the angle channel, the coded output is the maximum bit after sigmoid is taken for each element in the angle array, and the specific formula is as follows: -argmax (sigmoid (θ)); for the detection confidence channel, the coded output is sigmoid (conf)_predict)。

Step 3, converting the encoded channel value into absolute coordinates to obtain the prediction frames with the expected number, specifically: the conversion of the absolute coordinates can be realized by multiplying the encoded output of the abscissa channel by the image width, multiplying the encoded output of the ordinate channel by the image height, multiplying the encoded output of the width channel by the image width, and multiplying the encoded output of the height channel by the image height, with the output values of the remaining channels being kept unchanged.

According to the above description, the image processing model provided by the embodiment adds the fusion of shallow features to the original TinyYoLoV3 model, and is more suitable for the detection of the tilt selection frame, and the embodiment introduces the subchannel convolution into the image processing model to perform feature extraction, which is beneficial to reducing the operation amount. In addition, the image processing model of the embodiment has a feature of fusion of multiple scales, and can adapt to detection of multiple target sizes, the width and the height of the prediction frame are directly predicted by the image processing model, so that an anchor frame (namely a priori template boundary frame) does not need to be added for calculation, and therefore, when the image processing model of the embodiment is applied to other scenes for training, the anchor frame does not need to be obtained by re-clustering. Furthermore, the image processing model in this embodiment is based on the architecture of TinyYoLoV3, and the inverse residual block is used to obtain the detail features, so that the image processing model in this embodiment is a lightweight small network, which is convenient for deployment.

Referring to fig. 4, in an embodiment of the present invention, a method for implementing training of the image processing model by using the training data set includes:

and S121, selecting training data from the training data set as current training data.

And S122, processing the current training data by using the image processing model to obtain a prediction frame corresponding to the current training data.

And S123, calculating a function value of a loss function according to the prediction box corresponding to the current training data. Optionally, one loss function used in this embodiment is:

wherein S is²The number of meshes of the feature map is, for example, 24 × 16. Lambda [ alpha ]_conf、λ_coorAnd λ_classThe three weighted values are set according to actual requirements, preferably, when the coordinate of the center point of the target selection frame is located in a certain grid, the weighted values of the grid are lambda respectively_conf＝5、λ _coor1 and λ _class1, otherwise, λ_conf＝1、λ _coor0 and λ_classAnd 0, wherein the target selection frame refers to a selection frame in a training image contained in the current training data. conf_targetThe confidence of the target selection frame depends on the Euclidean distance D between the coordinates of the target selection frame and the coordinates of the prediction frame₁Specifically: when D is present₁Less than threshold d_thWhen it is determined that the confidence is

Otherwise conf_targetWhere a is a hyperparameter, for adjusting the sharpness of the function, which takes for example a value of 2, the threshold d being said_thCan be selected according to actual requirements. conf_predictThe confidence of the prediction box can be directly output by the image processing model. color or_targetSelecting coordinates of center point, width and height of frame, color, for target_predictTo predict the center point coordinates, width and height of the box, θ_targetSelecting the angle of the frame, θ, for the target_predictTo predict the angle of the frame, L_MSEIs a loss function of mean square error, L_focallossIs focal loss function, L_{cross-entrophy}Is a cross entropy loss function.

In particular, when the detection object of the image processing model is a human body, the classification loss in the loss function

In a specific application, it is not necessary to consider, and in this case, the loss function is:

and S124, adjusting parameters and/or a framework of the image processing model according to the function value of the loss function, where a specific adjustment method may be implemented by using an existing scheme, and details are not repeated here.

And S125, repeating the steps S121 to S124 until the function value of the loss function meets a preset condition. Preferably, in the repeating process, the training data selected in each time in step S121 are different. The preset condition may be set by a user according to a requirement, for example, the function value of the loss function may not decrease, or the decrease range of the function value of the loss function is smaller than a preset threshold.

Based on the above description of the training method, the invention also provides a target detection method. Referring to fig. 5A, in an embodiment of the present invention, the target detection method includes:

s51, acquiring a target image to be detected, wherein the target image comprises a target object; in particular, the target object includes only the category of human body. Preferably, this step further comprises preprocessing the target image to match the size of the target image with the input size of the image processing model, e.g. the size of the target image may be scaled to 768 × 512.

S52, obtaining a trained image processing model according to the training method of the invention. Specifically, referring to fig. 5B, an implementation method of step S52 in this embodiment includes:

s521, training at least two image processing models by using the training method to obtain at least two alternative models.

S522, a test data set is obtained, wherein the test data set includes a plurality of test data, and each test data includes a test image and a position, a size, an angle, and a category of a selection frame in the test image.

S523, the test data sets are used for testing the alternative models respectively, and one of the at least two alternative models is selected as the trained image processing model according to the test result.

And S53, processing the target image by using the trained image processing model to obtain a prediction frame matched with the target object. Each target image may correspond to multiple prediction frames, for example, each target image may be processed by using the trained image processing model to obtain 24 × 16 prediction frames.

S54, detecting the target image based on the prediction frame matched with the target object. Specifically, the prediction frame matched with the target object is used as a candidate frame, and a most appropriate candidate frame is selected as a target detection frame, and the selection can be realized by using an existing method, such as a Non-Maximum-suppression method (NMS). And detecting the target image by adopting the existing target detection technology based on the target detection frame.

Optionally, referring to fig. 5C, in this embodiment, an implementation method for selecting one of the at least two candidate models as the trained image processing model according to a test result includes:

s5231, processing the target image by using each of the candidate models. And each candidate model outputs a result after processing the target image, wherein the result is a plurality of prediction frames.

S5232, screening the prediction frames output by each candidate model to obtain the optimal prediction frame corresponding to each candidate model. Wherein, one realization method of the screening is as follows: for any two prediction blocks A and B, respectively obtaining

And

wherein C is the area of the minimum closure region of the prediction frames A and B, namely the area of a convex hull containing all the vertexes of the two prediction frames,

the area of the region in the closure region that does not belong to the two prediction boxes. And selecting the optimal prediction frame corresponding to each candidate model from the multiple prediction frames according to the GIOU.

S5233, respectively calculating the recall rate and the accuracy rate of each alternative model based on the optimal prediction frame corresponding to each alternative model, and according to the recall rateAnd selecting the trained image processing model from the alternative models according to the rate and the accuracy. Specifically, a comprehensive index can be obtained according to the recall rate and the accuracy rate of each alternative model

The comprehensive index F can balance the recall rate and the accuracy rate so as to evaluate the model effect; wherein, R is recall rate, the value of which is the total number of targets/total number of targets with correct detection, P is accuracy rate, the value of which is the total number of targets/total number of detection targets with strived for detection; the parameter α is used to adjust the importance of recall and accuracy in the assessment, in particular when α is 1, recall and accuracy are equally important.

According to the above description, the object detection method of the present embodiment employs GIOU and NMS to screen the prediction frame to obtain the optimal prediction frame, and thus is more suitable for screening of the rotating rectangular frame.

Based on the above description of the training method and the object detection method, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the training method of the present invention and/or the object detection method of the present invention.

Based on the above description of the training method and the target detection method, the invention also provides an electronic device. Referring to fig. 6, in an embodiment of the invention, the electronic device 600 includes a memory 610, a processor 620 and a display 630. Wherein the memory 610 stores a computer program; the processor 620 is communicatively connected to the memory 610, and executes the training method of the present invention and/or the object detection method of the present invention when the computer program is called; the display 630 is communicatively coupled to the processor 620 and the memory 610, and is configured to display a GUI interactive interface associated with the training method and/or the object detection method.

The protection scope of the training method and the target detection method of the present invention is not limited to the execution sequence of the steps listed in this embodiment, and all the schemes of adding, subtracting, and replacing steps in the prior art according to the principle of the present invention are included in the protection scope of the present invention.

The training data adopted by the training method of the invention comprises the position, the size and the category of the selection frame, and also comprises the angle of the selection frame; the image processing model obtained by adopting the training data can output a prediction frame with an angle, the detection of the inclined target can be realized based on the prediction frame with the angle, and the detection has higher positioning accuracy and recognition rate. Therefore, the training method can solve the problems of inaccurate target positioning and low recognition rate when the horizontal frame is adopted to detect the inclined target in the prior art, and is particularly suitable for target detection under a multi-angle scene with a top view angle.

In addition, in some embodiments, in the method for performing target detection on a rotating object in an image, a main idea of the method is divided into two types, mostly for scenes with low real-time requirements, such as rotating object detection in an aerial image or oblique character detection, and the like: firstly, a Pixel-to-Pixel (PIXel) method is adopted, and secondly, a method based on inclined bounding box detection is adopted. The labeling process of the Pixel-to-Pixel method needs to label each Pixel (such as labeling of two classes), the labeling workload is large, and the segmentation precision for small targets is low; the method based on the inclined bounding box detection is generally improved based on two-stage RCNN or fast RCNN, the model prediction speed of the method is often difficult to meet the requirement on a miniature terminal, and in addition, the definition of the angle in the method based on the inclined bounding box detection often causes POA problem and EoE problem. Aiming at the problems, in the training method, the width and the height of the selection frame are respectively defined by adopting a w channel and an h channel, so that the w channel always corresponds to the long side of the selection frame and the h channel always corresponds to the short side of the selection frame, and the EoE problem can be solved; the POA problem is solved by mapping the angle to a multidimensional array so that the multidimensional arrays corresponding to the angles of 0 DEG and 180 DEG are the same, and thus there is no problem of multiple definitions of the angle. In addition, the image processing model adopted in the training method is an architecture based on TinyYoLoV3, and the detail characteristics are obtained by adopting the inverse residual error block, so that the image processing model is a light-weight small network and is convenient to deploy.

In conclusion, the present invention effectively overcomes various disadvantages of the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A training method for training an image processing model, wherein the image processing model is used for processing a target image to obtain a selection box matched with a target object contained in the target image, and the training method comprises:

acquiring a training data set, wherein the training data set comprises a plurality of training data, and each training data comprises a training image and the position, the size, the angle and the category of a selection frame in the training image;

and training the image processing model by using the training data set.

2. The training method of claim 1, wherein prior to training the image processing model with the training data set, the training method further comprises:

and mapping the angle of the selection frame in the training data into a multidimensional array by using a Gaussian function.

3. The training method of claim 1, wherein the image processing model comprises:

the shallow feature extraction module is used for acquiring shallow features of input data of the image processing model;

the deep feature extraction module is used for acquiring deep features of input data of the image processing model;

the fusion module is connected with the shallow feature extraction module and the deep feature extraction module and is used for fusing the shallow features and the deep features of the input data to obtain a feature map;

and the post-processing module is connected with the fusion module and used for processing the feature map output by the fusion module to obtain a plurality of prediction frames.

4. A training method according to claim 3, wherein the method of training the image processing model using the training data set comprises:

selecting training data from the training data set as current training data;

processing the current training data by using the image processing model to obtain a prediction frame corresponding to the current training data;

calculating a function value of a loss function according to a prediction box corresponding to the current training data;

adjusting parameters and/or architecture of the image processing model according to the function value of the loss function;

and repeating the process until the function value of the loss function meets a preset condition.

5. Training method according to claim 4, characterized in that the loss function is:

wherein S is²Is the number of grids of the feature map, λ_conf、λ_coorAnd λ_classRespectively, three weight values, conf_targetConfidence of the target selection box, conf_predictTo a prediction boxConfidence of (1), coor_targetSelecting coordinates of center point, width and height of frame, color, for target_predictTo predict the center point coordinates, width and height of the box, θ_targetSelecting the angle of the frame, θ, for the target_predictTo predict the angle of the frame, L_MSEIs a loss function of mean square error, L_focallossIs focal loss function, L_{cross-entrophy}Is a cross entropy loss function.

6. The training method according to claim 4, wherein the detection object of the image processing model is a human body, and the loss function is:

wherein S is²Is the number of grids of the feature map, λ_conf、λ_coorAre respectively two weighted values, conf_targetConfidence of the target selection box, conf_predictTo predict confidence of the box, coor_targetSelecting coordinates of center point, width and height of frame, color, for target_predictTo predict the center point coordinates, width and height of the box, θ_targetSelecting the angle of the frame, θ, for the target_predictTo predict the angle of the frame, L_MSEIs a loss function of mean square error, L_focallossIs a focal loss function.

7. An object detection method, characterized in that the object detection method comprises:

acquiring a target image to be detected, wherein the target image comprises a target object;

obtaining a trained image processing model according to the training method of any one of claims 1-6;

processing the target image by using the trained image processing model to obtain a prediction frame matched with the target object;

and detecting the target image based on the prediction frame matched with the target object.

8. The method of claim 7, wherein the step of obtaining the trained image processing model comprises:

training at least two image processing models by using the training method to obtain at least two alternative models;

acquiring a test data set;

and respectively testing each alternative model by using the test data set, and selecting the trained image processing model from the at least two alternative models according to the test result.

9. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program, when executed by a processor, implements a training method as claimed in any one of claims 1 to 6, and/or an object detection method as claimed in any one of claims 7 to 8.

10. An electronic device, characterized in that the electronic device comprises:

a memory storing a computer program;

a processor, communicatively coupled to the memory, that executes the training method of any of claims 1-6, and/or the object detection method of any of claims 7-8 when the computer program is invoked;

and the display is in communication connection with the processor and the memory and is used for displaying a GUI (graphical user interface) related to the training method and/or the target detection method.