CN112418344A - Training method, target detection method, medium and electronic device - Google Patents

Training method, target detection method, medium and electronic device Download PDF

Info

Publication number
CN112418344A
CN112418344A CN202011430940.6A CN202011430940A CN112418344A CN 112418344 A CN112418344 A CN 112418344A CN 202011430940 A CN202011430940 A CN 202011430940A CN 112418344 A CN112418344 A CN 112418344A
Authority
CN
China
Prior art keywords
training
target
image processing
frame
processing model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011430940.6A
Other languages
Chinese (zh)
Other versions
CN112418344B (en
Inventor
王海涛
袁德胜
游浩泉
成西锋
任晓双
崔龙
马卫民
林治强
党毅飞
李伟超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Winner Technology Co ltd
Original Assignee
Winner Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Winner Technology Co ltd filed Critical Winner Technology Co ltd
Priority to CN202011430940.6A priority Critical patent/CN112418344B/en
Publication of CN112418344A publication Critical patent/CN112418344A/en
Application granted granted Critical
Publication of CN112418344B publication Critical patent/CN112418344B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a training method, a target detection method, a medium and an electronic device; the training method comprises the following steps: acquiring a training data set, wherein the training data set comprises a plurality of training data, and each training data comprises a training image and the position, the size, the angle and the category of a selection frame in the training image; and training the image processing model by using the training data set. The training method can solve the problems of inaccurate target positioning and low recognition rate when the horizontal frame is adopted to detect the inclined target in the prior art.

Description

Training method, target detection method, medium and electronic device
Technical Field
The invention belongs to the field of data identification, relates to a target detection method, and particularly relates to a training method, a target detection method, a medium and electronic equipment.
Background
With the development of technology and the increase of user demands, video monitoring is more and more widely applied in real life. In a specific application scenario, the top view angle of the monitoring camera is often installed due to limitations of installation environment, detection range, installation manpower and the like. In this case, the human body may be inclined to different angles in the screen, and the conventional horizontal rectangular detection frame may not satisfy the requirement for framing the target subject. At present, a target detection method for common view images is researched more, target objects in the scenes mostly adopt horizontal rectangular frames to define target positions, and the targets are positioned through regression of frame parameters, and a more classical method comprises the following steps: realizing high-efficiency and high-precision target detection by using a fast RCNN (Region-Based constant-functional Network) through a feature sharing and Region extraction Network (RPN); the method for detecting the target by using the YoLo (You Only Look one) model and directly regressing the target coordinates and classifying the target by adopting the sub-grid can greatly improve the detection speed compared with the fast RCNN, thereby being widely applied to a plurality of fields. However, in these methods, a horizontal frame is usually selected as a candidate frame, and this candidate frame has no angle information, although this method can more accurately and effectively locate an object with a vertical or horizontal angle in an image, when these methods are directly applied to human body detection with a top view angle, the rotation invariance of the object detection model is affected, and the object detection frames determined by these candidate regions are usually not accurate enough, for example, the area of the background in the located object detection frame may be much larger than the area of the object itself, which may result in inaccurate object location and low object recognition rate.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, an object of the present invention is to provide a training method, an object detection method, a medium and an electronic device, which are used to solve the problem that a candidate frame selected in the prior art does not have angle information.
To achieve the above and other related objects, a first aspect of the present invention provides a training method for training an image processing model, where the image processing model is used to process a target image to obtain a selection box matching a target object included in the target image, the training method including: acquiring a training data set, wherein the training data set comprises a plurality of training data, and each training data comprises a training image and the position, the size, the angle and the category of a selection frame in the training image; and training the image processing model by using the training data set.
In an embodiment of the first aspect, before training the image processing model with the training data set, the training method further includes: and mapping the angle of the selection frame in the training data into a multidimensional array by using a Gaussian function.
In an embodiment of the first aspect, the image processing model includes: the shallow feature extraction module is used for acquiring shallow features of input data of the image processing model; the deep feature extraction module is used for acquiring deep features of input data of the image processing model; the fusion module is connected with the shallow feature extraction module and the deep feature extraction module and is used for fusing the shallow features and the deep features of the input data to obtain a feature map; and the post-processing module is connected with the fusion module and used for processing the feature map output by the fusion module to obtain a plurality of prediction frames.
In an embodiment of the first aspect, a method for implementing training of the image processing model by using the training data set includes: selecting training data from the training data set as current training data; processing the current training data by using the image processing model to obtain a prediction frame corresponding to the current training data; calculating a function value of a loss function according to a prediction box corresponding to the current training data; adjusting parameters and/or architecture of the image processing model according to the function value of the loss function; and repeating the process until the function value of the loss function meets a preset condition.
In an embodiment of the first aspect, the loss function is:
Figure BDA0002820595000000021
wherein S is2Is the number of grids of the feature map, λconf、λcoorAnd λclassRespectively, three weight values, conftargetConfidence of the target selection box, confpredictTo predict confidence of the box, coortargetSelecting coordinates of center point, width and height of frame, color, for targetpredictTo predict the center point coordinates, width and height of the box, θtargetSelecting the angle of the frame, θ, for the targetpredictTo predict the angle of the frame, LMSEIs a loss function of mean square error, LfocallossIs focal loss function, Lcross-entrophyIs a cross entropy loss function.
In an embodiment of the first aspect, a detection object of the image processing model is a human body, and the loss function is:
Figure BDA0002820595000000022
wherein S is2Is the number of grids of the feature map, λconf、λcoorAre respectively two weighted values, conftargetConfidence of the target selection box, confpredictTo predict confidence of the box, coortargetSelecting coordinates of center point, width and height of frame, color, for targetpredictTo predict the center point coordinates, width and height of the box, θtargetSelecting for a targetAngle of frame, θpredictTo predict the angle of the frame, LMSEIs a loss function of mean square error, LfocallossIs a focal loss function.
A second aspect of the present invention provides an object detection method, including: acquiring a target image to be detected, wherein the target image comprises a target object; according to the training method of any one of the first aspect of the invention, a trained image processing model is obtained; processing the target image by using the trained image processing model to obtain a prediction frame matched with the target object; and detecting the target image based on the prediction frame matched with the target object.
In an embodiment of the second aspect, an implementation method for obtaining the trained image processing model includes: training at least two image processing models by using the training method to obtain at least two alternative models; acquiring a test data set; and respectively testing each alternative model by using the test data set, and selecting the trained image processing model from the at least two alternative models according to the test result.
A third aspect of the invention provides a computer readable storage medium having stored thereon a computer and a program which, when executed by a processor, implements the training method of any one of the first aspects of the invention and/or the object detection method of any one of the second aspects of the invention.
A fourth aspect of the present invention provides an electronic apparatus, comprising: a memory storing a computer program; a processor, communicatively coupled to the memory, for executing the training method according to any of the first aspects of the present invention and/or the object detection method according to any of the second aspects of the present invention when the computer program is invoked; and the display is in communication connection with the processor and the memory and is used for displaying a GUI (graphical user interface) related to the training method and/or the target detection method.
As described above, one technical solution of the training method, the target detection method, the medium, and the electronic device according to the present invention has the following advantageous effects:
the training data adopted by the training method comprises the position, the size and the category of the selection frame and the angle of the selection frame; the image processing model obtained by adopting the training data can output a prediction frame with an angle, the detection of the inclined target can be realized based on the prediction frame with the angle, and the detection has higher positioning accuracy and recognition rate. Therefore, the training method can solve the problems of inaccurate target positioning and low recognition rate when the horizontal frame is adopted to detect the inclined target in the prior art, and is particularly suitable for target detection under a multi-angle scene with a top view angle.
Drawings
FIG. 1 is a flow chart of a training method according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating examples of multidimensional arrays involved in a training method according to an embodiment of the present invention.
Fig. 3A is a schematic structural diagram of an image processing model used in an embodiment of the training method of the present invention.
FIG. 3B is a schematic diagram of a portion of an algorithm of the training method according to an embodiment of the present invention.
FIG. 3C is a schematic diagram of a portion of an algorithm of the training method according to an embodiment of the present invention.
FIG. 3D is a schematic diagram of a portion of an algorithm of the training method according to an embodiment of the present invention.
Fig. 4 is a flowchart of step S12 according to an embodiment of the training method of the present invention.
FIG. 5A is a flowchart illustrating a method for detecting a target according to an embodiment of the present invention.
FIG. 5B is a flowchart illustrating the step S52 of the target detection method according to an embodiment of the present invention.
Fig. 5C is a flowchart illustrating the step S523 of the target detecting method according to an embodiment of the present invention.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Description of the element reference numerals
1 image processing model
11 shallow layer feature extraction module
12 deep layer characteristic extraction module
13 fusion module
14 post-processing module
600 electronic device
610 memory
620 processor
630 display
S11-S12
S121 to S125
S51-S54
Steps S521 to S523
S5231 to S5233 steps
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than being drawn according to the number, shape and size of the components in actual implementation, and the type, number and proportion of the components in actual implementation may be changed arbitrarily, and the layout of the components may be more complicated. Moreover, in this document, relational terms such as "first," "second," and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The existing detection method for the target object usually selects a horizontal or vertical frame as a candidate frame, the candidate frame does not have angle information, and when the methods are directly applied to human body detection at a top view angle, because a detection target often inclines in an image, the obtained target detection frame is often inaccurate, for example, the background area in the positioned target detection frame may be far larger than the area of the target, so that the problems of inaccurate target positioning, low target identification rate and the like can be caused. Aiming at the problem, the invention provides a training method which is used for training an image processing model, wherein training data adopted in the training process comprises the position, the size and the category of a selection frame and also comprises the angle of the selection frame; the image processing model obtained by adopting the training data can output a prediction frame with an angle, in specific application, the prediction frame with the angle is used as a candidate frame, a proper target detection frame is selected from the candidate frame, the detection of the inclined target can be realized based on the target detection frame, and higher positioning accuracy and recognition rate are obtained. The selection frame refers to a frame of a target object in a training image, the prediction frame refers to a frame of a target object output by the image processing model, the candidate frame refers to a frame of a target object provided for a user to select in practical application, the number of the candidate frames is usually multiple, and the target detection frame refers to a frame of a most appropriate target object selected from the candidate frames.
In an embodiment of the present invention, the training method is used to train an image processing model, where the image processing model is an artificial intelligence-based image processing model, such as a YoLo model, a tinyyolo v3 model, and is used to process a target image to obtain a selection frame matched with a target object included in the target image, where the input of the image processing model is a training image (in a training process) or an image to be detected (in an application process), and the output of the image processing model is a prediction frame corresponding to the training image or a prediction frame corresponding to the image to be detected. Specifically, referring to fig. 1, the training method in this embodiment includes:
s11, obtaining a training data set, where the training data set includes a plurality of training data, and each training data includes a training image and a position, a size, an angle, and a category of a selection box in the training image. The training data set can be obtained from an internet source data set, and can also be obtained by manually marking the position, the size, the angle and the category of the selection box on the training image.
Preferably, the position, size and angle of a certain selection box can be passed through a set of channels (x)center,ycenterW, h, θ), where xcenterChannel sum ycenterThe channel respectively represents the abscissa and the ordinate of the central point of the selection frame; the w channel and the h channel respectively represent the width and the height of the selection box, wherein the width of the selection box refers to the length of the long side in the selection box, and the height of the selection box refers to the length of the short side in the selection box; the theta channel represents the angle of the selection frame, and in specific application, the included angle between the long side and the y axis can be used as the angle of the selection frame, and the value range is [0 degrees and 180 degrees ]. This definition can avoid the ambiguity problem in the definition process using the horizontal selection box or the vertical selection box, for example, the same selection box can be described as (50, 50, 100, 50, 0 °), and can also be described as (50, 50, 100, 90 °), which is more beneficial to the convergence of the network in the training process. The category of the selection frame is used for representing categories of the object to be detected, such as human bodies, vehicles, animals and the like, and the selection frame can be defined into a single category or multiple categories according to requirements in specific applications.
S12, training the image processing model by using the training data set. The method for training the image processing model by using the training data set can be implemented by using the existing method, and details are not repeated here.
Based on the image processing model obtained after training by the training method in this embodiment, the input target image can be processed to obtain one or more prediction boxes corresponding to the target image, where each prediction box includes a group of channels and corresponding categories, and each channel is used to represent the abscissa and ordinate, the width, the height, and the angle of the center point of the prediction box.
As can be seen from the above description, the training data used in the training process of the training method of this embodiment includes the position, size, and category of the selection box, as well as the angle of the selection box; the image processing model obtained by the training data can output a prediction frame with an angle, the prediction frame with the angle is used as a candidate frame to detect a target image to obtain a target detection frame with the angle, further the detection of an inclined target can be realized, and higher positioning accuracy and recognition rate can be obtained, so that the method is particularly suitable for target detection under a top view angle multi-angle scene.
In addition, in some embodiments, the horizontal channel and the vertical channel are respectively used to define the selection box, and at this time, there is an exchange availability of edges (EoE) problem, where the EoE problem is that in a 90-degree boundary area, due to the unclear definition of the long and short edges, the network predicts that an actual square is not fit, that is: the problem of the vertical channel is that the long side is required to be fitted in some cases, and the short side is required to be fitted in other cases, and the definition mode can lead to the unclear definition of the long side and the short side of the selection box, so that the problem that the similar values can be predicted by the horizontal channel and the vertical channel exists in some cases. In the embodiment, the width and the height of the selection frame are respectively defined by the w channel and the h channel, so that the w channel always corresponds to the long side of the selection frame and the h channel always corresponds to the short side of the selection frame, and thus the EoE problem can be solved.
Moreover, compared with the mode that two models are adopted to respectively obtain the position and the angle of the prediction frame in some embodiments, the image processing model adopted by the embodiment can simultaneously complete the end-to-end detection of the target object in one forward propagation, output the position and the angle information of the prediction frame, and can accurately fit the detection target.
In one implementation of the present inventionIn an example, before training the image processing model using the training data set, the training method further comprises: and mapping the angle of the selection frame in the training data into a multidimensional array by using a Gaussian function. Preferably, the length of the multi-dimensional array is 180. Specifically, for any angle θ0All elements in the multidimensional array obtained by mapping accord with theta0In this embodiment, one method for implementing the mapping is as follows: when theta is0When the temperature is less than or equal to 90 ℃,
Figure BDA0002820595000000071
when theta is0Greater than 90 DEG
Figure BDA0002820595000000072
Wherein, σ is a standard deviation, and the value thereof can be set according to actual requirements. In addition, please refer to fig. 2, which is an exemplary diagram of the multidimensional array according to the embodiment, wherein θ is included in the exemplary diagram0=0.3、θ045.2 and θ0The corresponding multidimensional array example is shown when being 89.5. It should be noted that the above formula is only one possible way of the embodiment, and other ways may also be adopted to implement the mapping in specific applications.
In this embodiment, by mapping the angle of the selection box to a multidimensional array, an angle Periodicity of Angular (PoA) problem that the selection box with an angle may have multiple definitions in an application, which may cause that a network cannot be fitted (that is, when the selection box is horizontal, the angle may be 0 ° or 180 °, which may cause that a network prediction angle is inaccurate) can be avoided, and the principle is as follows: in the embodiment, the angles are mapped into the multidimensional array, so that the multidimensional arrays corresponding to the selection frame when the angles are 0 ° and 180 ° are the same, and thus, the problem of multiple definition of the angles does not exist, and the above-mentioned PoA problem can be avoided in the embodiment.
Referring to fig. 3A, in an embodiment of the present invention, the image processing model 1 includes a shallow feature extraction module 11, a deep feature extraction module 12, a fusion module 13, and a post-processing module 14. The shallow feature extraction module 11 is configured to obtain a shallow feature of the input data of the image processing model, the deep feature extraction module 12 is configured to obtain a deep feature of the input data of the image processing model, and the fusion module 13 is connected to the shallow feature extraction module 12 and the deep feature extraction module 13 and configured to fuse the shallow feature and the deep feature of the input data to obtain a feature map of the input data; the post-processing module 14 is connected to the fusion module 13, and is configured to process the feature map output by the fusion module 13 to obtain 1 or more prediction frames.
Specifically, referring to fig. 3B, fig. 3C and fig. 3D, the shallow feature extraction module 11 may be implemented by using an architecture based on TinyYoLoV3, where the shallow feature extraction module 11 is the same as the first half of the original TinyYoLoV3, and this part may load detection network weights trained by other data sets (such as Image, etc.) as pre-training weights. It should be noted that tinyyolo v3 is a simplified version of YoLo v3, and is faster than YoLo v3, and YoLo v3 is a third version of the YoLo series target detection algorithm, and solves the object detection as a regression problem, and completes the output from the input of the original image to the object position and category based on a single end-to-end network.
The deep feature extraction module 12 may be configured to further extract deep features of the input data, and preferably, the deep feature extraction module 12 includes an inverse Residual Block (Inverted Residual Block), which is implemented as shown in fig. 3C, and is suitable for a lightweight terminal network, and is capable of effectively extracting features while reducing the computation of the network, where k is a channel expansion coefficient, and dw _ conv is a subchannel convolution.
The fusion module 13 is configured to fuse the shallow feature and the deep feature to obtain a feature map, and the fusion module 13 may be configured to extract features of the angular channel. Specifically, the feature map may be represented in a matrix form, and in this case, the fusion module 13 may output a matrix with 186 × the number of feature map grids.
Referring to fig. 3D, taking the feature map of 24 × 16 × 186 channels as an example, the process of the post-processing module 14 processing the feature map output by the fusion module to obtain a plurality of prediction frames includes:
step 1, dividing channels, wherein 186 represents 186 channels, and each channel represents an abscissa x of the center point of the selection framecenterLongitudinal coordinate y of the center point of the selection framecenterWidth w of selection frame, height h of selection frame, detection confidence confpredictA 180-bit angle array and a selection box category. In particular, when the image processing model is used for detecting a human body, the category of the selection frame is only the human body frame, and at this time, the number of bits of the selection frame category is 1. In the feature map of the 24 × 16 × 186 channels, the number of the feature map grids is 24 × 16.
Step 2, coding each channel respectively, wherein for the abscissa channel, the coded output is sigmoid (x)center) + X/W, wherein X is grid offset in the width direction, the value range of the grid offset is 1-24 in the embodiment, W is the output width of the grid, and the value of the grid offset is 24 in the embodiment; for the ordinate channel, its coded output is sigmoid (y)center) + Y/H, wherein Y is grid offset in the height direction, the value range is 1-16 in the embodiment, H is the network output height, and the value is 16 in the embodiment; for the width channel and the height channel, the coded outputs are exp (w) and exp (h), respectively; for the angle channel, the coded output is the maximum bit after sigmoid is taken for each element in the angle array, and the specific formula is as follows: -argmax (sigmoid (θ)); for the detection confidence channel, the coded output is sigmoid (conf)predict)。
Step 3, converting the encoded channel value into absolute coordinates to obtain the prediction frames with the expected number, specifically: the conversion of the absolute coordinates can be realized by multiplying the encoded output of the abscissa channel by the image width, multiplying the encoded output of the ordinate channel by the image height, multiplying the encoded output of the width channel by the image width, and multiplying the encoded output of the height channel by the image height, with the output values of the remaining channels being kept unchanged.
According to the above description, the image processing model provided by the embodiment adds the fusion of shallow features to the original TinyYoLoV3 model, and is more suitable for the detection of the tilt selection frame, and the embodiment introduces the subchannel convolution into the image processing model to perform feature extraction, which is beneficial to reducing the operation amount. In addition, the image processing model of the embodiment has a feature of fusion of multiple scales, and can adapt to detection of multiple target sizes, the width and the height of the prediction frame are directly predicted by the image processing model, so that an anchor frame (namely a priori template boundary frame) does not need to be added for calculation, and therefore, when the image processing model of the embodiment is applied to other scenes for training, the anchor frame does not need to be obtained by re-clustering. Furthermore, the image processing model in this embodiment is based on the architecture of TinyYoLoV3, and the inverse residual block is used to obtain the detail features, so that the image processing model in this embodiment is a lightweight small network, which is convenient for deployment.
Referring to fig. 4, in an embodiment of the present invention, a method for implementing training of the image processing model by using the training data set includes:
and S121, selecting training data from the training data set as current training data.
And S122, processing the current training data by using the image processing model to obtain a prediction frame corresponding to the current training data.
And S123, calculating a function value of a loss function according to the prediction box corresponding to the current training data. Optionally, one loss function used in this embodiment is:
Figure BDA0002820595000000091
wherein S is2The number of meshes of the feature map is, for example, 24 × 16. Lambda [ alpha ]conf、λcoorAnd λclassThe three weighted values are set according to actual requirements, preferably, when the coordinate of the center point of the target selection frame is located in a certain grid, the weighted values of the grid are lambda respectivelyconf=5、λ coor1 and λ class1, otherwise, λconf=1、λ coor0 and λclassAnd 0, wherein the target selection frame refers to a selection frame in a training image contained in the current training data. conftargetThe confidence of the target selection frame depends on the Euclidean distance D between the coordinates of the target selection frame and the coordinates of the prediction frame1Specifically: when D is present1Less than threshold dthWhen it is determined that the confidence is
Figure BDA0002820595000000092
Otherwise conftargetWhere a is a hyperparameter, for adjusting the sharpness of the function, which takes for example a value of 2, the threshold d being saidthCan be selected according to actual requirements. confpredictThe confidence of the prediction box can be directly output by the image processing model. color ortargetSelecting coordinates of center point, width and height of frame, color, for targetpredictTo predict the center point coordinates, width and height of the box, θtargetSelecting the angle of the frame, θ, for the targetpredictTo predict the angle of the frame, LMSEIs a loss function of mean square error, LfocallossIs focal loss function, Lcross-entrophyIs a cross entropy loss function.
In particular, when the detection object of the image processing model is a human body, the classification loss in the loss function
Figure BDA0002820595000000101
In a specific application, it is not necessary to consider, and in this case, the loss function is:
Figure BDA0002820595000000102
and S124, adjusting parameters and/or a framework of the image processing model according to the function value of the loss function, where a specific adjustment method may be implemented by using an existing scheme, and details are not repeated here.
And S125, repeating the steps S121 to S124 until the function value of the loss function meets a preset condition. Preferably, in the repeating process, the training data selected in each time in step S121 are different. The preset condition may be set by a user according to a requirement, for example, the function value of the loss function may not decrease, or the decrease range of the function value of the loss function is smaller than a preset threshold.
Based on the above description of the training method, the invention also provides a target detection method. Referring to fig. 5A, in an embodiment of the present invention, the target detection method includes:
s51, acquiring a target image to be detected, wherein the target image comprises a target object; in particular, the target object includes only the category of human body. Preferably, this step further comprises preprocessing the target image to match the size of the target image with the input size of the image processing model, e.g. the size of the target image may be scaled to 768 × 512.
S52, obtaining a trained image processing model according to the training method of the invention. Specifically, referring to fig. 5B, an implementation method of step S52 in this embodiment includes:
s521, training at least two image processing models by using the training method to obtain at least two alternative models.
S522, a test data set is obtained, wherein the test data set includes a plurality of test data, and each test data includes a test image and a position, a size, an angle, and a category of a selection frame in the test image.
S523, the test data sets are used for testing the alternative models respectively, and one of the at least two alternative models is selected as the trained image processing model according to the test result.
And S53, processing the target image by using the trained image processing model to obtain a prediction frame matched with the target object. Each target image may correspond to multiple prediction frames, for example, each target image may be processed by using the trained image processing model to obtain 24 × 16 prediction frames.
S54, detecting the target image based on the prediction frame matched with the target object. Specifically, the prediction frame matched with the target object is used as a candidate frame, and a most appropriate candidate frame is selected as a target detection frame, and the selection can be realized by using an existing method, such as a Non-Maximum-suppression method (NMS). And detecting the target image by adopting the existing target detection technology based on the target detection frame.
Optionally, referring to fig. 5C, in this embodiment, an implementation method for selecting one of the at least two candidate models as the trained image processing model according to a test result includes:
s5231, processing the target image by using each of the candidate models. And each candidate model outputs a result after processing the target image, wherein the result is a plurality of prediction frames.
S5232, screening the prediction frames output by each candidate model to obtain the optimal prediction frame corresponding to each candidate model. Wherein, one realization method of the screening is as follows: for any two prediction blocks A and B, respectively obtaining
Figure BDA0002820595000000111
And
Figure BDA0002820595000000112
wherein C is the area of the minimum closure region of the prediction frames A and B, namely the area of a convex hull containing all the vertexes of the two prediction frames,
Figure BDA0002820595000000113
the area of the region in the closure region that does not belong to the two prediction boxes. And selecting the optimal prediction frame corresponding to each candidate model from the multiple prediction frames according to the GIOU.
S5233, respectively calculating the recall rate and the accuracy rate of each alternative model based on the optimal prediction frame corresponding to each alternative model, and according to the recall rateAnd selecting the trained image processing model from the alternative models according to the rate and the accuracy. Specifically, a comprehensive index can be obtained according to the recall rate and the accuracy rate of each alternative model
Figure BDA0002820595000000114
The comprehensive index F can balance the recall rate and the accuracy rate so as to evaluate the model effect; wherein, R is recall rate, the value of which is the total number of targets/total number of targets with correct detection, P is accuracy rate, the value of which is the total number of targets/total number of detection targets with strived for detection; the parameter α is used to adjust the importance of recall and accuracy in the assessment, in particular when α is 1, recall and accuracy are equally important.
According to the above description, the object detection method of the present embodiment employs GIOU and NMS to screen the prediction frame to obtain the optimal prediction frame, and thus is more suitable for screening of the rotating rectangular frame.
Based on the above description of the training method and the object detection method, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the training method of the present invention and/or the object detection method of the present invention.
Based on the above description of the training method and the target detection method, the invention also provides an electronic device. Referring to fig. 6, in an embodiment of the invention, the electronic device 600 includes a memory 610, a processor 620 and a display 630. Wherein the memory 610 stores a computer program; the processor 620 is communicatively connected to the memory 610, and executes the training method of the present invention and/or the object detection method of the present invention when the computer program is called; the display 630 is communicatively coupled to the processor 620 and the memory 610, and is configured to display a GUI interactive interface associated with the training method and/or the object detection method.
The protection scope of the training method and the target detection method of the present invention is not limited to the execution sequence of the steps listed in this embodiment, and all the schemes of adding, subtracting, and replacing steps in the prior art according to the principle of the present invention are included in the protection scope of the present invention.
The training data adopted by the training method of the invention comprises the position, the size and the category of the selection frame, and also comprises the angle of the selection frame; the image processing model obtained by adopting the training data can output a prediction frame with an angle, the detection of the inclined target can be realized based on the prediction frame with the angle, and the detection has higher positioning accuracy and recognition rate. Therefore, the training method can solve the problems of inaccurate target positioning and low recognition rate when the horizontal frame is adopted to detect the inclined target in the prior art, and is particularly suitable for target detection under a multi-angle scene with a top view angle.
In addition, in some embodiments, in the method for performing target detection on a rotating object in an image, a main idea of the method is divided into two types, mostly for scenes with low real-time requirements, such as rotating object detection in an aerial image or oblique character detection, and the like: firstly, a Pixel-to-Pixel (PIXel) method is adopted, and secondly, a method based on inclined bounding box detection is adopted. The labeling process of the Pixel-to-Pixel method needs to label each Pixel (such as labeling of two classes), the labeling workload is large, and the segmentation precision for small targets is low; the method based on the inclined bounding box detection is generally improved based on two-stage RCNN or fast RCNN, the model prediction speed of the method is often difficult to meet the requirement on a miniature terminal, and in addition, the definition of the angle in the method based on the inclined bounding box detection often causes POA problem and EoE problem. Aiming at the problems, in the training method, the width and the height of the selection frame are respectively defined by adopting a w channel and an h channel, so that the w channel always corresponds to the long side of the selection frame and the h channel always corresponds to the short side of the selection frame, and the EoE problem can be solved; the POA problem is solved by mapping the angle to a multidimensional array so that the multidimensional arrays corresponding to the angles of 0 DEG and 180 DEG are the same, and thus there is no problem of multiple definitions of the angle. In addition, the image processing model adopted in the training method is an architecture based on TinyYoLoV3, and the detail characteristics are obtained by adopting the inverse residual error block, so that the image processing model is a light-weight small network and is convenient to deploy.
In conclusion, the present invention effectively overcomes various disadvantages of the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (10)

1. A training method for training an image processing model, wherein the image processing model is used for processing a target image to obtain a selection box matched with a target object contained in the target image, and the training method comprises:
acquiring a training data set, wherein the training data set comprises a plurality of training data, and each training data comprises a training image and the position, the size, the angle and the category of a selection frame in the training image;
and training the image processing model by using the training data set.
2. The training method of claim 1, wherein prior to training the image processing model with the training data set, the training method further comprises:
and mapping the angle of the selection frame in the training data into a multidimensional array by using a Gaussian function.
3. The training method of claim 1, wherein the image processing model comprises:
the shallow feature extraction module is used for acquiring shallow features of input data of the image processing model;
the deep feature extraction module is used for acquiring deep features of input data of the image processing model;
the fusion module is connected with the shallow feature extraction module and the deep feature extraction module and is used for fusing the shallow features and the deep features of the input data to obtain a feature map;
and the post-processing module is connected with the fusion module and used for processing the feature map output by the fusion module to obtain a plurality of prediction frames.
4. A training method according to claim 3, wherein the method of training the image processing model using the training data set comprises:
selecting training data from the training data set as current training data;
processing the current training data by using the image processing model to obtain a prediction frame corresponding to the current training data;
calculating a function value of a loss function according to a prediction box corresponding to the current training data;
adjusting parameters and/or architecture of the image processing model according to the function value of the loss function;
and repeating the process until the function value of the loss function meets a preset condition.
5. Training method according to claim 4, characterized in that the loss function is:
Figure FDA0002820594990000021
wherein S is2Is the number of grids of the feature map, λconf、λcoorAnd λclassRespectively, three weight values, conftargetConfidence of the target selection box, confpredictTo a prediction boxConfidence of (1), coortargetSelecting coordinates of center point, width and height of frame, color, for targetpredictTo predict the center point coordinates, width and height of the box, θtargetSelecting the angle of the frame, θ, for the targetpredictTo predict the angle of the frame, LMSEIs a loss function of mean square error, LfocallossIs focal loss function, Lcross-entrophyIs a cross entropy loss function.
6. The training method according to claim 4, wherein the detection object of the image processing model is a human body, and the loss function is:
Figure FDA0002820594990000022
wherein S is2Is the number of grids of the feature map, λconf、λcoorAre respectively two weighted values, conftargetConfidence of the target selection box, confpredictTo predict confidence of the box, coortargetSelecting coordinates of center point, width and height of frame, color, for targetpredictTo predict the center point coordinates, width and height of the box, θtargetSelecting the angle of the frame, θ, for the targetpredictTo predict the angle of the frame, LMSEIs a loss function of mean square error, LfocallossIs a focal loss function.
7. An object detection method, characterized in that the object detection method comprises:
acquiring a target image to be detected, wherein the target image comprises a target object;
obtaining a trained image processing model according to the training method of any one of claims 1-6;
processing the target image by using the trained image processing model to obtain a prediction frame matched with the target object;
and detecting the target image based on the prediction frame matched with the target object.
8. The method of claim 7, wherein the step of obtaining the trained image processing model comprises:
training at least two image processing models by using the training method to obtain at least two alternative models;
acquiring a test data set;
and respectively testing each alternative model by using the test data set, and selecting the trained image processing model from the at least two alternative models according to the test result.
9. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program, when executed by a processor, implements a training method as claimed in any one of claims 1 to 6, and/or an object detection method as claimed in any one of claims 7 to 8.
10. An electronic device, characterized in that the electronic device comprises:
a memory storing a computer program;
a processor, communicatively coupled to the memory, that executes the training method of any of claims 1-6, and/or the object detection method of any of claims 7-8 when the computer program is invoked;
and the display is in communication connection with the processor and the memory and is used for displaying a GUI (graphical user interface) related to the training method and/or the target detection method.
CN202011430940.6A 2020-12-07 2020-12-07 Training method, target detection method, medium and electronic equipment Active CN112418344B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011430940.6A CN112418344B (en) 2020-12-07 2020-12-07 Training method, target detection method, medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011430940.6A CN112418344B (en) 2020-12-07 2020-12-07 Training method, target detection method, medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN112418344A true CN112418344A (en) 2021-02-26
CN112418344B CN112418344B (en) 2023-11-21

Family

ID=74774956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011430940.6A Active CN112418344B (en) 2020-12-07 2020-12-07 Training method, target detection method, medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN112418344B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494398A (en) * 2022-01-18 2022-05-13 深圳市联洲国际技术有限公司 Processing method and device for inclined target, storage medium and processor

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108470077A (en) * 2018-05-28 2018-08-31 广东工业大学 A kind of video key frame extracting method, system and equipment and storage medium
WO2019128646A1 (en) * 2017-12-28 2019-07-04 深圳励飞科技有限公司 Face detection method, method and device for training parameters of convolutional neural network, and medium
EP3509014A1 (en) * 2018-01-05 2019-07-10 Whirlpool Corporation Detecting objects in images
WO2019227479A1 (en) * 2018-06-01 2019-12-05 华为技术有限公司 Method and apparatus for generating face rotation image
CN111079632A (en) * 2019-12-12 2020-04-28 上海眼控科技股份有限公司 Training method and device of text detection model, computer equipment and storage medium
CN111241947A (en) * 2019-12-31 2020-06-05 深圳奇迹智慧网络有限公司 Training method and device of target detection model, storage medium and computer equipment
CN111444918A (en) * 2020-04-01 2020-07-24 中移雄安信息通信科技有限公司 Image inclined text line detection model training and image inclined text line detection method
CN111950329A (en) * 2019-05-16 2020-11-17 长沙智能驾驶研究院有限公司 Target detection and model training method and device, computer equipment and storage medium
CN112036249A (en) * 2020-08-04 2020-12-04 汇纳科技股份有限公司 Method, system, medium and terminal for end-to-end pedestrian detection and attribute identification

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019128646A1 (en) * 2017-12-28 2019-07-04 深圳励飞科技有限公司 Face detection method, method and device for training parameters of convolutional neural network, and medium
EP3509014A1 (en) * 2018-01-05 2019-07-10 Whirlpool Corporation Detecting objects in images
CN108470077A (en) * 2018-05-28 2018-08-31 广东工业大学 A kind of video key frame extracting method, system and equipment and storage medium
WO2019227479A1 (en) * 2018-06-01 2019-12-05 华为技术有限公司 Method and apparatus for generating face rotation image
CN111950329A (en) * 2019-05-16 2020-11-17 长沙智能驾驶研究院有限公司 Target detection and model training method and device, computer equipment and storage medium
CN111079632A (en) * 2019-12-12 2020-04-28 上海眼控科技股份有限公司 Training method and device of text detection model, computer equipment and storage medium
CN111241947A (en) * 2019-12-31 2020-06-05 深圳奇迹智慧网络有限公司 Training method and device of target detection model, storage medium and computer equipment
CN111444918A (en) * 2020-04-01 2020-07-24 中移雄安信息通信科技有限公司 Image inclined text line detection model training and image inclined text line detection method
CN112036249A (en) * 2020-08-04 2020-12-04 汇纳科技股份有限公司 Method, system, medium and terminal for end-to-end pedestrian detection and attribute identification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李佳禧;邱东;杨宏韬;刘克平;: "基于改进的YOLO v3的工件识别方法研究", 组合机床与自动化加工技术, no. 08 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494398A (en) * 2022-01-18 2022-05-13 深圳市联洲国际技术有限公司 Processing method and device for inclined target, storage medium and processor
CN114494398B (en) * 2022-01-18 2024-05-07 深圳市联洲国际技术有限公司 Processing method and device of inclined target, storage medium and processor

Also Published As

Publication number Publication date
CN112418344B (en) 2023-11-21

Similar Documents

Publication Publication Date Title
CN110443143B (en) Multi-branch convolutional neural network fused remote sensing image scene classification method
CN109241913B (en) Ship detection method and system combining significance detection and deep learning
CN115601549B (en) River and lake remote sensing image segmentation method based on deformable convolution and self-attention model
CN113673425B (en) Multi-view target detection method and system based on Transformer
CN108428220B (en) Automatic geometric correction method for ocean island reef area of remote sensing image of geostationary orbit satellite sequence
CN110765865B (en) Underwater target detection method based on improved YOLO algorithm
CN109583483A (en) A kind of object detection method and system based on convolutional neural networks
CN113449594A (en) Multilayer network combined remote sensing image ground semantic segmentation and area calculation method
CN110796143A (en) Scene text recognition method based on man-machine cooperation
US20220375192A1 (en) Optimization method, apparatus, device for constructing target detection network, medium and product
CN110427933A (en) A kind of water gauge recognition methods based on deep learning
CN110443279B (en) Unmanned aerial vehicle image vehicle detection method based on lightweight neural network
CN110728307A (en) Method for realizing small sample character recognition of X-ray image by self-generating data set and label
CN113313703A (en) Unmanned aerial vehicle power transmission line inspection method based on deep learning image recognition
CN115047455A (en) Lightweight SAR image ship target detection method
CN115937552A (en) Image matching method based on fusion of manual features and depth features
CN116645592A (en) Crack detection method based on image processing and storage medium
CN110851627B (en) Method for describing sun black subgroup in full-sun image
CN115393635A (en) Infrared small target detection method based on super-pixel segmentation and data enhancement
Laupheimer et al. The importance of radiometric feature quality for semantic mesh segmentation
CN114820668A (en) End-to-end building regular outline automatic extraction method based on concentric ring convolution
CN114708434A (en) Cross-domain remote sensing image semantic segmentation method based on adaptation and self-training in iterative domain
CN113657225B (en) Target detection method
CN112418344A (en) Training method, target detection method, medium and electronic device
CN110334581A (en) A kind of multi-source Remote Sensing Images change detecting method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 201203 No. 6, Lane 55, Chuanhe Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai

Applicant after: Winner Technology Co.,Ltd.

Address before: 201505 Room 216, 333 Tingfeng Highway, Tinglin Town, Jinshan District, Shanghai

Applicant before: Winner Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant