CN112418344B

CN112418344B - Training method, target detection method, medium and electronic equipment

Info

Publication number: CN112418344B
Application number: CN202011430940.6A
Authority: CN
Inventors: 王海涛; 袁德胜; 游浩泉; 成西锋; 任晓双; 崔龙; 马卫民; 林治强; 党毅飞; 李伟超
Original assignee: Winner Technology Co ltd
Current assignee: Winner Technology Co ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2023-11-21
Anticipated expiration: 2040-12-07
Also published as: CN112418344A

Abstract

The invention provides a training method, a target detection method, a medium and electronic equipment; the training method comprises the following steps: acquiring a training data set, wherein the training data set comprises a plurality of training data, and each training data comprises a training image and the position, the size, the angle and the category of a selection frame in the training image; and training the image processing model by using the training data set. The training method can solve the problems of inaccurate target positioning and low recognition rate in the prior art when a horizontal frame is adopted to detect the inclined target.

Description

Training method, target detection method, medium and electronic equipment

Technical Field

The invention belongs to the field of data identification, relates to a target detection method, and in particular relates to a training method, a target detection method, a medium and electronic equipment.

Background

With the development of technology and the increase of user demands, video monitoring is increasingly applied in real life. In a specific application scenario, the top view angle of the monitoring camera is often installed due to limitations of installation environment, detection range, installation manpower and the like. In this case, the human body may be inclined to different angles in the picture, and the conventional horizontal rectangular detection frame may not meet the requirement of the frame selection target body. At present, a target detection method for a common visual angle image is more studied, a target object in the scenes mostly adopts a horizontal rectangular frame to define a target position, the target is positioned through regression of frame parameters, and compared with a classical method, the method comprises the following steps: efficient and high-precision target detection is achieved by using a fast RCNN (Region-Based Convolume-tional Neural Network) through a feature sharing and regional extraction network (Region Proposal Network, RPN); the YoLo (You Only Look Once) model is utilized to detect the target by adopting a method of directly regressing the target coordinates and classifying the target by using a grid, and compared with the Faster RCNN, the method can greatly improve the detection speed, so that the method is widely applied in a plurality of fields. However, in these methods, a horizontal frame is usually selected as a candidate frame, and this candidate frame does not have angle information, and although this method can relatively accurately and effectively locate an object with a vertical or horizontal angle in an image, when these methods are directly applied to top view human detection, the rotation invariance of the object detection model will be affected, and the object detection frames determined by these candidate regions are usually inaccurate, for example, the background area in the located object detection frame may be far greater than the area of the object itself, so that the object location may be inaccurate, and the object recognition rate is low.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a training method, a target detection method, a medium and an electronic device, which are used for solving the problem that a candidate frame selected in the prior art does not have angle information.

To achieve the above and other related objects, a first aspect of the present invention provides a training method for training an image processing model for processing a target image to obtain a selection box matching a target object contained in the target image, the training method comprising: acquiring a training data set, wherein the training data set comprises a plurality of training data, and each training data comprises a training image and the position, the size, the angle and the category of a selection frame in the training image; and training the image processing model by using the training data set.

In an embodiment of the first aspect, before training the image processing model with the training dataset, the training method further comprises: the angles of the selection boxes in the training data are mapped into a multi-dimensional array using gaussian functions.

In an embodiment of the first aspect, the image processing model comprises: the shallow feature extraction module is used for acquiring shallow features of input data of the image processing model; the deep feature extraction module is used for acquiring deep features of input data of the image processing model; the fusion module is connected with the shallow feature extraction module and the deep feature extraction module and is used for fusing the shallow features and the deep features of the input data to obtain a feature map; and the post-processing module is connected with the fusion module and is used for processing the feature images output by the fusion module to obtain a plurality of prediction frames.

In an embodiment of the first aspect, the training method for training the image processing model by using the training data set includes: selecting training data from the training data set as current training data; processing the current training data by using the image processing model to obtain a prediction frame corresponding to the current training data; calculating a function value of a loss function according to a prediction frame corresponding to the current training data; adjusting parameters and/or architecture of the image processing model according to the function value of the loss function; repeating the above process until the function value of the loss function meets the preset condition.

In an embodiment of the first aspect, the loss function is:

wherein S is ² Lambda is the grid number of the characteristic diagram _conf 、λ _coor And lambda (lambda) _class Respectively three weight values, conf _target Confidence, conf, of target selection box _predict To predict confidence of a frame, cor _target Selecting the coordinates, width and height of the center point of the frame for the target, and the color _predict For predicting the center point coordinates, width and height of the frame, θ _target Selecting the angle of the frame, θ, for the target _predict To predict the angle of the frame, L _MSE As a mean square error loss function, L _focalloss As a focal loss function, L _{cross-entrophy} Is a cross entropy loss function.

In an embodiment of the first aspect, the detection object of the image processing model is a human body, and the loss function is:

wherein S is ² Lambda is the grid number of the characteristic diagram _conf 、λ _coor Respectively two weight values, conf _target Confidence, conf, of target selection box _predict To predict confidence of a frame, cor _target Selecting the coordinates, width and height of the center point of the frame for the target, and the color _predict For predicting the center point coordinates, width and height of the frame, θ _target Selecting the angle of the frame, θ, for the target _predict To predict the angle of the frame, L _MSE As a mean square error loss function, L _focalloss Is a focal loss function.

A second aspect of the present invention provides a target detection method, the target detection method comprising: acquiring a target image to be detected, wherein the target image comprises a target object; the training method according to any one of the first aspect of the present invention obtains a trained image processing model; processing the target image by using the trained image processing model to obtain a prediction frame matched with the target object; the target image is detected based on a prediction box matched with the target object.

In an embodiment of the second aspect, the implementation method for obtaining the trained image processing model includes: training at least two image processing models by using the training method to obtain at least two alternative models; acquiring a test data set; and respectively testing each alternative model by using the test data set, and selecting the trained image processing model from the at least two alternative models according to a test result.

A third aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the training method according to any of the first aspect of the present invention and/or the object detection method according to any of the second aspect of the present invention.

A fourth aspect of the present invention provides an electronic device comprising: a memory storing a computer program; a processor communicatively coupled to the memory for executing the training method of any one of the first aspects of the present invention and/or the target detection method of any one of the second aspects of the present invention when the computer program is invoked; and the display is in communication connection with the processor and the memory and is used for displaying a related GUI interactive interface of the training method and/or the target detection method.

As described above, one technical scheme of the training method, the target detection method, the medium and the electronic equipment has the following beneficial effects:

the training data adopted by the training method comprises the position, the size and the category of the selection frame and also comprises the angle of the selection frame; the image processing model obtained by the training data can output an angular prediction frame, the detection of the inclined target can be realized based on the angular prediction frame, and the detection has higher positioning accuracy and recognition rate. Therefore, the training method can overcome the problems of inaccurate target positioning and low recognition rate when the horizontal frame is adopted to detect the inclined target in the prior art, and is particularly suitable for target detection under a top view multi-angle scene.

Drawings

FIG. 1 is a flow chart of a training method according to an embodiment of the invention.

FIG. 2 is a diagram showing an example of a multi-dimensional array involved in one embodiment of the training method according to the present invention.

FIG. 3A is a schematic diagram of an image processing model used in an embodiment of the training method according to the present invention.

FIG. 3B is a schematic diagram of a portion of an algorithm of the training method according to an embodiment of the invention.

FIG. 3C is a schematic diagram of a portion of an algorithm of the training method according to an embodiment of the invention.

FIG. 3D is a schematic diagram of a part of the training method according to an embodiment of the invention.

Fig. 4 is a flowchart showing step S12 of the training method according to an embodiment of the invention.

FIG. 5A is a flowchart illustrating an exemplary embodiment of a method for detecting an object according to the present invention.

Fig. 5B is a flowchart showing step S52 of the object detection method according to an embodiment of the invention.

Fig. 5C is a flowchart showing a step S523 of the object detection method according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Description of element reference numerals

1. Image processing model

11. Shallow layer feature extraction module

12. Deep feature extraction module

13. Fusion module

14. Post-processing module

600. Electronic equipment

610. Memory device

620. Processor and method for controlling the same

630. Display device

S11 to S12 steps

S121 to S125 steps

S51 to S54 steps

S521 to S523 steps

Steps S5231 to S5233

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the illustrations, not according to the number, shape and size of the components in actual implementation, and the form, number and proportion of each component in actual implementation may be arbitrarily changed, and the layout of the components may be more complex. Moreover, relational terms such as "first," "second," and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The existing target object detection method generally selects a horizontal or vertical frame as a candidate frame, the candidate frame does not have angle information, and when the method is directly applied to top view human detection, the detected target often has inclination in an image, and the obtained target detection frame is often inaccurate, for example, the background area in the positioned target detection frame may be far larger than the area of the target, so that the problems of inaccurate target positioning, low target recognition rate and the like can be caused. In order to solve the problem, the invention provides a training method for training an image processing model, wherein training data adopted in the training process comprises the position, the size and the category of a selection frame and the angle of the selection frame; the image processing model obtained by adopting the training data can output the prediction frame with the angle, in specific application, the prediction frame with the angle is taken as a candidate frame, an appropriate target detection frame is selected from the candidate frames, the detection of an inclined target can be realized based on the target detection frame, and higher positioning accuracy and recognition rate are obtained. The selection frame refers to a frame of a target object in a training image, the prediction frame refers to a frame of a target object output by the image processing model, the candidate frame refers to a frame of a target object provided for a user to select in practical application, the number of the candidate frames is usually a plurality of frames, and the target detection frame refers to a frame of the most appropriate target object selected from the candidate frames.

In an embodiment of the present invention, the training method is used for training an image processing model, where the image processing model is an image processing model based on artificial intelligence, for example, a YoLo model, a TinyYoLoV3 model, etc., and is used for processing a target image to obtain a selection frame matched with a target object included in the target image, where an input of the image processing model is a training image (during a training process) or an image to be detected (during an application process), and an output of the image processing model is a prediction frame corresponding to the training image or a prediction frame corresponding to the image to be detected. Specifically, referring to fig. 1, the training method in this embodiment includes:

s11, acquiring a training data set, wherein the training data set comprises a plurality of training data, and each training data comprises a training image and the position, the size, the angle and the category of a selection frame in the training image. The training data set can be obtained from an internet open source data set, and can also be obtained by manually marking the position, the size, the angle and the category of the selection frame on the training image.

Preferably, the position, size and angle of a certain selection frame can be determined by a set of channels (x _center ，y _center W, h, θ), where x is represented by _center Channel and y _center The channel represents the abscissa and ordinate of the center point of the selection frame respectively; the w channel and the h channel respectively represent the width and the height of the selection frame, wherein the width of the selection frame refers to the length of the long side of the selection frame, and the height of the selection frame refers to the length of the short side of the selection frame; the theta channel represents the angle of the selection frame, and the included angle between the long side and the y axis can be used as the angle of the selection frame in specific application, and the value range is [0 degrees, 180 degrees ]. This definition can avoid the ambiguity problem in the definition process by using the horizontal selection frame or the vertical selection frame, for example, the same selection frame can be described as (50, 50, 100, 50,0 °) or (50, 50, 50, 100, 90 °), which is more beneficial to the convergence of the network in the training process. The class of the selection box is used for representing the class of the object to be detected, such as human body, vehicle, animal, etc., and the selection box can be defined as a single class or multiple classes according to requirements in specific applications.

And S12, training the image processing model by using the training data set. The method for training the image processing model by using the training data set may be implemented by using an existing method, which is not described herein.

Based on the image processing model obtained after training by the training method in this embodiment, an input target image can be processed to obtain one or more prediction frames corresponding to the target image, where each prediction frame includes a set of channels and a class corresponding to the channels, and each channel is used to represent an abscissa and an ordinate of a center point of the prediction frame, a width, a height, and an angle, respectively.

As can be seen from the above description, the training data adopted in the training process of the training method according to the present embodiment includes the angle of the selection frame in addition to the position, size and category of the selection frame; the image processing model obtained by the training data can output an angled prediction frame, the angled prediction frame is used as a candidate frame to detect a target image, an angled target detection frame can be obtained, further, the detection of an inclined target can be achieved, and high positioning accuracy and recognition rate can be obtained.

In addition, in some embodiments, the selection boxes are defined by using horizontal channels and vertical channels, and at this time, there is a long-short edge switching (Exchangeability of edges, eoE) problem, where the EoE problem refers to a situation that the network predicts that an actual square is not fit due to undefined long and short edges in a 90-degree boundary area, that is: horizontal channels fit long edges in some cases and short edges in other cases, and vertical channels have the problem that the long and short edges of the selection box are not well defined in this definition, so that in some cases both horizontal and vertical channels predict similar values. In this embodiment, the width and the height of the selection frame are defined by the w channel and the h channel respectively, so that the w channel always corresponds to the long side of the selection frame and the h channel always corresponds to the short side of the selection frame, thereby solving the EoE problem.

In addition, compared with the mode that two models are adopted to respectively acquire the position and the angle of the prediction frame in some embodiments, the image processing model adopted in the embodiment can finish end-to-end detection of the target object in one forward propagation, output the position and the angle information of the prediction frame, and accurately attach to the detection target.

In an embodiment of the invention, before training the image processing model using the training dataset, the training method further comprises: the angles of the selection boxes in the training data are mapped into a multi-dimensional array using gaussian functions. Preferably, the length of the multi-dimensional array is 180. Specifically, for any angle θ ₀ Each element in the multi-dimensional array obtained by mapping accords with theta ₀ Is high as the mean valueIn this embodiment, one method for implementing the mapping is as follows: when theta is as ₀ When the angle is less than or equal to 90 ℃,when theta is as ₀ At > 90 DEGWherein, sigma is standard deviation, and the value of sigma can be set according to actual requirements. In addition, referring to FIG. 2, an exemplary diagram of a multi-dimensional array according to the present embodiment is shown, wherein θ is respectively included ₀ ＝0.3、θ ₀ =45.2 and θ ₀ Example graph of multi-dimensional array corresponding to =89.5. It should be noted that the above formula is only one possible way of implementing the mapping in the present embodiment, and other ways may be adopted in specific applications.

In this embodiment, by mapping the angles of the selection boxes into a multidimensional array, the problem of angular periodicity (Periodicity of Angular, poA) caused by that the network cannot fit due to multiple definitions of the selection boxes with angles in application (i.e., the problem of inaccurate network prediction angles caused by multiple definitions of 0 ° or 180 ° when the selection boxes are horizontal) can be avoided, and the principle is that: in this embodiment, the angles are mapped into a multi-dimensional array, so that the multi-dimensional arrays corresponding to the selection frames when the angles are 0 ° and 180 ° are the same, and thus there is no problem of multiple definition of the angles, so the present embodiment can avoid the PoA problem.

Referring to fig. 3A, in an embodiment of the invention, the image processing model 1 includes a shallow feature extraction module 11, a deep feature extraction module 12, a fusion module 13, and a post-processing module 14. The shallow feature extraction module 11 is configured to obtain shallow features of input data of the image processing model, the deep feature extraction module 12 is configured to obtain deep features of the input data of the image processing model, and the fusion module 13 is connected to the shallow feature extraction module 11 and the deep feature extraction module 12, and is configured to fuse the shallow features and the deep features of the input data to obtain a feature map of the input data; the post-processing module 14 is connected to the fusion module 13, and is configured to process the feature map output by the fusion module 13 to obtain 1 or more prediction frames.

Specifically, referring to fig. 3B, fig. 3C, and fig. 3D, the shallow feature extraction module 11 may be implemented by using a architecture based on TinyYoLoV3, where the shallow feature extraction module 11 is the same as the first half of the original TinyYoLoV3, and this part may be loaded with detection network weights trained by other data sets (such as Image) as pre-training weights. It should be noted that TinyYoLoV3 is a simplified version of YoLoV3, and is faster than YoLoV3, yoLoV3 is the third version of the YoLo series object detection algorithm, by solving object detection as a regression problem, and based on a single end-to-end network, completing the input from the original image to the output of object location and class.

The deep feature extraction module 12 may be configured to further extract deep features of the input data, and preferably, the deep feature extraction module 12 includes an inverse residual block (Inverted Residual Block), where the implementation of the inverse residual block is shown in fig. 3C, and is applicable to a lightweight terminal network, and can effectively extract features while simultaneously extracting the low computational load of the network, where k is a channel expansion coefficient and dw_conv is a split-channel convolution (depthwise separable convolution).

The fusion module 13 is configured to fuse the shallow features and the deep features to obtain a feature map, and the fusion module 13 may be used for feature extraction of the angle channel. Specifically, the feature map may be represented in the form of a matrix, and at this time, the fusion module 13 may be capable of outputting a matrix of 186×the number of feature map meshes.

Referring further to fig. 3D, taking a 24×16×186 channel feature map as an example, the process of the post-processing module 14 processing the feature map output by the fusion module to obtain a plurality of prediction frames includes:

step 1, dividing channels, wherein 186 represents 186 channels, and represents the abscissa x of the center point of the selection frame _center Ordinate y of the centre point of the selection frame _center Width w of selection frame, height h of selection frame, detection confidence conf _predict 180 bit angle array and select box category. In particular, when the image processing model is used for detecting a human body, the type of the selection frame is only a human body frame, and at this time, the number of bits of the selection frame type is 1. In the 24×16×186 channel feature map, the feature map grid number is 24×16.

Step 2, each channel is encoded, wherein, for the abscissa channel, the encoding output is sigmoid (x _center ) +X/W, X is grid bias in the width direction, the value range is 1-24 in the embodiment, W is the network output width, and the value is 24 in the embodiment; for the ordinate channel, its code output is sigmoid (y _center ) +Y/H, Y is grid bias in the height direction, the value range is 1-16 in the embodiment, H is the network output height, and the value is 16 in the embodiment; for the width channel and the height channel, the coding output is exp (w) and exp (h) respectively; for the angle channel, the code output is the largest bit after the sigmoid is taken for each element in the angle array, and the specific formula is as follows: -argmax (sigmoid (θ)); for the detection confidence channel, its code output is sigmoid (conf _predict )。

Step 3, converting the encoded channel value into absolute coordinates to obtain a desired number of prediction frames, specifically: the conversion of the absolute coordinates can be realized by multiplying the code output of the abscissa channel by the image width, the code output of the ordinate channel by the image height, the code output of the width channel by the image width, the code output of the height channel by the image height, and the output values of the rest channels are kept unchanged.

As can be seen from the above description, the image processing model provided in this embodiment adds shallow layer feature fusion relative to the original TinyYoLoV3 model, and is more suitable for detection of an inclined selection frame, and the present embodiment introduces a multichannel convolution into the image processing model to perform feature extraction, which is beneficial to reducing the operation amount. In addition, the image processing model in this embodiment has a feature of multi-scale fusion, and can adapt to detecting multiple target sizes, and the width and the height of the prediction frame are directly predicted by the image processing model, so that an anchor frame (i.e. a priori template boundary frame) is not required to be added for calculation, and therefore, the image processing model in this embodiment is not required to be clustered again to obtain the anchor frame when being applied to other scenes for training. Furthermore, the image processing model in this embodiment is a architecture based on TinyYoLoV3, and the detail features are obtained by adopting the inverse residual block, so that the image processing model in this embodiment is a lightweight small network, and is convenient for deployment.

Referring to fig. 4, in an embodiment of the present invention, a method for training the image processing model by using the training data set includes:

s121, selecting training data from the training data set as current training data.

S122, processing the current training data by using the image processing model to obtain a prediction frame corresponding to the current training data.

S123, calculating a function value of a loss function according to the prediction frame corresponding to the current training data. Optionally, a loss function used in this embodiment is:

wherein S is ² The grid number of the feature map is, for example, 24×16. Lambda (lambda) _conf 、λ _coor And lambda (lambda) _class The three weight values are respectively set according to actual requirements, preferably, when the coordinates of the center point of the target selection frame are located in a certain grid, the weight values of the grid are respectively lambda _conf ＝5、λ _coor =1 and λ _class =1, otherwise, λ _conf ＝1、λ _coor =0 and λ _class =0, wherein the target selection box refers to a selection box in a training image included in the current training data. conf _target Confidence of target selection frame depends on Euclidean distance D between coordinates of target selection frame and predicted frame coordinates ₁ Specifically: when D is ₁ Less than threshold d _th When the confidence isOtherwise conf _target =0, where a is a super parameter for adjusting the sharpness of the function, which takes on a value of e.g. 2, the threshold d _th Can be selected according to actual requirements. conf _predict For confidence in the prediction block, it may be directly output by the image processing model. Coor _target Selecting the coordinates, width and height of the center point of the frame for the target, and the color _predict For predicting the center point coordinates, width and height of the frame, θ _target Selecting the angle of the frame, θ, for the target _predict To predict the angle of the frame, L _MSE As a mean square error loss function, L _focalloss As a focal loss function, L _{cross-entrophy} Is a cross entropy loss function.

In particular, when the detection object of the image processing model is a human body, the classification loss in the loss functionIn a specific application, the loss function is:

s124, adjusting parameters and/or architecture of the image processing model according to the function value of the loss function, wherein the specific adjustment method can be implemented by adopting the existing scheme, and details are not repeated here.

S125, repeating the steps S121-S124 until the function value of the loss function meets the preset condition. Preferably, in the repetition process, the training data selected in step S121 is different each time. The preset condition may be set by the user according to the requirement, for example, the function value of the loss function may not drop any more, or the drop width of the function value of the loss function is smaller than a preset threshold.

Based on the above description of the training method, the invention further provides a target detection method. Referring to fig. 5A, in an embodiment of the invention, the target detection method includes:

s51, acquiring a target image to be detected, wherein the target image comprises a target object; in particular, the target object includes only the category of human body. Preferably, the step further comprises preprocessing the target image to match the size of the target image with the input size of the image processing model, for example, the size of the target image may be scaled to 768×512.

S52, acquiring a trained image processing model according to the training method. Specifically, referring to fig. 5B, an implementation method of step S52 in this embodiment includes:

s521, training at least two image processing models by using the training method to obtain at least two alternative models.

S522, a test data set is obtained, wherein the test data set comprises a plurality of test data, and each test data comprises a test image and the position, the size, the angle and the category of a selection frame in the test image.

S523, testing each alternative model by using the test data set, and selecting one from the at least two alternative models as the trained image processing model according to a test result.

And S53, processing the target image by using the trained image processing model to obtain a prediction frame matched with the target object. Wherein each target image may correspond to a plurality of prediction frames, for example, each target image may be processed using the trained image processing model to obtain 24×16 prediction frames.

And S54, detecting the target image based on a prediction frame matched with the target object. Specifically, the prediction frame matched with the target object is taken as a candidate frame, and the most appropriate candidate frame is selected from the candidate frames as a target detection frame, and the selection can be implemented by adopting the existing method, such as a Non-Maximum-suppression (NMS) method. The detection of the target image can be realized by adopting the existing target detection technology based on the target detection frame.

Optionally, referring to fig. 5C, in this embodiment, an implementation method for selecting one of the at least two candidate models as the trained image processing model according to the test result includes:

and S5231, processing the target image by utilizing each alternative model. And the result output after each candidate model processes the target image is a plurality of prediction frames.

S5232, screening the prediction frames output by the alternative models to obtain the optimal prediction frames corresponding to the alternative models. The method for realizing the screening comprises the following steps: for any two prediction frames A and B, respectively obtainingAnd +.>Wherein C is the area of the minimum closure region of the prediction frames A and B, namely the convex closure area of all vertexes of the two prediction frames, and +.>Is the area of the closure region that does not belong to the two prediction frames. And selecting the optimal prediction frame corresponding to each alternative model from a plurality of prediction frames according to the GIOU.

S5233, calculating recall rates and accuracy rates of the alternative models based on the optimal prediction frames corresponding to the alternative models, and selecting the trained image processing model from the alternative models according to the recall rates and the accuracy rates. Specifically, a comprehensive index can be obtained according to the recall rate and the accuracy of each alternative modelThe comprehensive index F can weigh the recall rate and the accuracy rate so as to evaluateEstimating a model effect; wherein, R is recall rate, the value of which is the total number of targets/the total number of targets which are detected correctly, P is accuracy rate, and the value of which is the total number of targets/the total number of targets which are detected strives for; the parameter α is used to adjust the importance of recall and accuracy in the evaluation, in particular, when α=1, the recall and accuracy are equally important.

As can be seen from the above description, the target detection method according to the present embodiment screens the prediction frame by using the GIOU and the NMS to obtain the best prediction frame, and thus is more suitable for the screening of the rotating rectangular frame.

Based on the above description of the training method and the target detection method, the present invention further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the training method of the present invention and/or the target detection method of the present invention.

Based on the above description of the training method and the target detection method, the invention also provides electronic equipment. Referring to fig. 6, in an embodiment of the invention, the electronic device 600 includes a memory 610, a processor 620, and a display 630. Wherein the memory 610 stores a computer program; the processor 620 is communicatively coupled to the memory 610, and executes the training method of the present invention and/or the target detection method of the present invention when the computer program is invoked; the display 630 is communicatively coupled to the processor 620 and the memory 610 for displaying a GUI interactive interface associated with the training method and/or the target detection method.

The protection scope of the training method and the target detection method of the present invention is not limited to the execution sequence of the steps listed in the present embodiment, and all the schemes implemented by the steps of increasing or decreasing and the steps of replacing the prior art according to the principles of the present invention are included in the protection scope of the present invention.

The training data adopted by the training method comprises the position, the size and the category of the selection frame and the angle of the selection frame; the image processing model obtained by the training data can output an angular prediction frame, the detection of the inclined target can be realized based on the angular prediction frame, and the detection has higher positioning accuracy and recognition rate. Therefore, the training method can overcome the problems of inaccurate target positioning and low recognition rate when the horizontal frame is adopted to detect the inclined target in the prior art, and is particularly suitable for target detection under a top view multi-angle scene.

In addition, in some embodiments, in the method for detecting a target for a rotating object in an image, most of the methods face a scene with low real-time requirements, such as rotating object detection or inclined text detection, in an aerial image, and the main ideas thereof can be divided into two types: firstly, a Pixel-level detection (Pixel-to-Pixel) method is adopted, and secondly, a detection method based on an inclined boundary box is adopted. The labeling process of the Pixel-to-Pixel method needs to label each Pixel (such as two-classification labels), so that the labeling workload is large and the segmentation precision for small targets is low; methods based on oblique bounding box detection are usually improved based on two-stage RCNN or fast RCNN, and model prediction speed is often difficult to meet requirements on miniature terminals, and in addition, the definition of angles in the methods of oblique bounding box detection often causes POA problems and EoE problems. In order to solve the problems, in the training method of the invention, the width and the height of the selection frame are respectively defined by adopting the w channel and the h channel, so that the w channel always corresponds to the long side of the selection frame and the h channel always corresponds to the short side of the selection frame, thereby being capable of solving the EoE problem; the POA problem is solved by mapping the angle into a multi-dimensional array such that the corresponding multi-dimensional arrays are identical at angles of 0 DEG and 180 DEG, and thus there is no problem of multiple definition of the angle. In addition, the image processing model adopted in the training method is a TinyYoLoV 3-based architecture, and the detail characteristics are acquired by adopting an inverse residual block, so that the image processing model is a lightweight small network and is convenient to deploy.

In summary, the present invention effectively overcomes the disadvantages of the prior art and has high industrial utility value.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims

1. The training method is characterized by comprising a training method for training an image processing model, wherein the image processing model is used for processing a target image to obtain a selection frame matched with a target object contained in the target image, the image processing model comprises a shallow feature extraction module, a deep feature extraction module, a fusion module and a post-processing module, the shallow feature extraction module is used for obtaining shallow features of input data of the image processing model, the deep feature extraction module is used for obtaining deep features of the input data of the image processing model, the fusion module is connected with the shallow feature extraction module and the deep feature extraction module and used for fusing the shallow features and the deep features of the input data to obtain feature images, the post-processing module is connected with the fusion module and used for processing the feature images output by the fusion module to obtain a plurality of prediction frames, and the training method comprises the following steps of:

acquiring a training data set, wherein the training data set comprises a plurality of training data, and each training data comprises a training image and the position, the size, the angle and the category of a selection frame in the training image;

training the image processing model using the training dataset, comprising: selecting training data from the training data set as current training data, processing the current training data by using the image processing model to obtain a prediction frame corresponding to the current training data, calculating a function value of a loss function according to the prediction frame corresponding to the current training data, and adjusting parameters and/or architecture of the image processing model according to the function value of the loss function;

the loss function is:

2. The training method of claim 1, wherein prior to training the image processing model with the training dataset, the training method further comprises:

the angles of the selection boxes in the training data are mapped into a multi-dimensional array using gaussian functions.

3. The training method of claim 1, wherein the detected object of the image processing model is a human body, and the loss function is:

wherein S is ² Lambda is the grid number of the characteristic diagram _conf 、λ _coor Respectively two weight values, conf _target Confidence, conf, of target selection box _predict To predict confidence of a frame, cor _target Selecting the coordinates, width and height of the center point of the frame for the target, and the color _predict For predicting framesCenter point coordinates, width and height, θ _target Selecting the angle of the frame, θ, for the target _predict To predict the angle of the frame, L _MSE As a mean square error loss function, L _focalloss Is a focal loss function.

4. A target detection method, characterized in that the target detection method comprises:

acquiring a target image to be detected, wherein the target image comprises a target object;

a training method according to any one of claims 1-3, wherein a trained image processing model is obtained;

processing the target image by using the trained image processing model to obtain a prediction frame matched with the target object;

the target image is detected based on a prediction box matched with the target object.

5. The method of claim 4, wherein the method of obtaining the trained image processing model comprises:

training at least two image processing models by using the training method to obtain at least two alternative models;

acquiring a test data set;

and respectively testing each alternative model by using the test data set, and selecting the trained image processing model from the at least two alternative models according to a test result.

6. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program, when executed by a processor, implements the training method of any of claims 1-3, and/or the object detection method of any of claims 4-5.

7. An electronic device, the electronic device comprising:

a memory storing a computer program;

a processor in communication with the memory, which when invoked performs the training method of any one of claims 1-3, and/or the object detection method of any one of claims 4-5;

and the display is in communication connection with the processor and the memory and is used for displaying a related GUI interactive interface of the training method and/or the target detection method.