CN111241947A

CN111241947A - Training method and device of target detection model, storage medium and computer equipment

Info

Publication number: CN111241947A
Application number: CN201911422532.3A
Authority: CN
Inventors: 岑俊毅; 李立赛; 傅东生
Original assignee: Miracle Intelligent Network Co ltd
Current assignee: Miracle Intelligent Network Co ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-06-05
Anticipated expiration: 2039-12-31
Also published as: CN111241947B

Abstract

The application relates to a training method, a device, a computer readable storage medium and a computer device of a target detection model, wherein the method comprises the following steps: acquiring a characteristic diagram of a sample image during training, and determining an initial detection frame in the characteristic diagram according to a preset rotation angle, a preset scale and a preset target aspect ratio; adjusting the position of each initial detection frame to obtain the position information of a prediction detection frame, and adjusting the network parameters of the regression network according to the position information and the real position information in the labeling information of the sample image; predicting probabilities of the target detection areas and the preset categories corresponding to the target detection areas determined according to the position information of the prediction detection frames; and after network parameters of the classification network are adjusted according to the real category information and the prediction probability in the labeling information of the sample image, a target detection model for carrying out target detection on the image is obtained. The scheme provided by the application enables the target detection model to identify the rotation angle of the target in the image, and the positioned target detection frame is more accurate.

Description

Training method and device of target detection model, storage medium and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for training a target detection model, a computer-readable storage medium, and a computer device.

Background

The target detection is also called target extraction, and is an image segmentation technology in the field of computer vision, and not only can the target be segmented from the image, namely the target position is positioned, but also the category of the target can be identified.

When a position regression network of a target detection model is trained, a sliding window method is usually adopted to traverse regions in an image, and then the regions are screened and checked to be used as candidate rectangular regions for target detection, however, the inventors have realized that the candidate rectangular regions are usually horizontal rectangular regions, such candidate regions can more accurately and effectively locate targets horizontally placed and regularly placed in the image, but when a rotating target or a target with an irregular shape exists in the image, a target detection frame determined according to the candidate regions is not accurate enough, for example, when a long strip-shaped object (such as a pencil) is presented in the image at a certain angle with a horizontal square line, if the horizontal rectangular frame is adopted for labeling, the background area in the located target detection frame may be far larger than the area of the target itself, so that the target location is not accurate enough, the target recognition rate is low.

Disclosure of Invention

Therefore, it is necessary to provide a training method and apparatus for a target detection model, a computer-readable storage medium, and a computer device, for solving the technical problems of the existing target detection model that the method for locating a rotating target or an irregular-shaped target in an image is not accurate enough and the recognition rate is low.

A method of training an object detection model, comprising:

acquiring a sample image and annotation information, wherein the annotation information comprises real position information and real category information of a target in the sample image, and the real position information comprises a rotation angle of a rectangular surrounding frame corresponding to the target;

obtaining a feature map of the sample image through a feature extraction network of an initial model;

generating a network through an area of an initial model, and determining an initial detection frame in the feature map according to a preset rotation angle, a preset scale and a preset target aspect ratio;

adjusting the position of each initial detection frame through a regression network of an initial model to obtain the position information of a prediction detection frame, and adjusting the network parameters of the regression network according to the real position information in the labeling information and the position information of the prediction detection frame;

predicting the prediction probability of each preset category corresponding to the target according to the target detection area determined by the position information of each prediction detection frame through a classification network of an initial model;

and after network parameters of the classification network are adjusted according to the real category information and the prediction probability in the labeling information, a target detection model for carrying out target detection on the image is obtained.

An apparatus for training an object detection model, the apparatus comprising:

In one embodiment, the acquiring the sample image comprises: obtaining an original sample image; judging whether the aspect ratio of the original sample image is 1; if so, scaling the original sample image to a preset size in an equal proportion to obtain a sample image; if not, the original sample image is subjected to equal-scale scaling and then image pixels are supplemented, and a sample image with a preset size is obtained.

In one embodiment, the acquiring the sample image comprises: obtaining an original sample image; performing rotation processing on the original sample image according to a preset angle to obtain a sample image, and obtaining real labeling information of the sample image according to the rotation angle of a rectangular surrounding frame in the original sample image and the preset angle; or, performing vertical mirror image processing on the original sample image to obtain a sample image, and obtaining real annotation information of the sample image according to the rotation angle of the rectangular surrounding frame in the original sample image; or carrying out horizontal mirror image processing on the original sample image to obtain a sample image, and obtaining the real annotation information of the sample image according to the rotation angle of the rectangular surrounding frame in the original sample image.

In one embodiment, the step of determining the target aspect ratio comprises: obtaining sample images and width and height information of a rectangular surrounding frame corresponding to a target in each sample image; counting the aspect ratio of each rectangular surrounding frame according to the aspect information; and clustering the counted aspect ratio to obtain the target aspect ratio in the clustering result.

In one embodiment, the adjusting the position of each of the initial detection frames to obtain the position information of the prediction detection frame includes: calculating the position offset of each initial detection frame according to the current network parameters of the regression network; obtaining the position information of a prediction detection frame according to the initial detection frame and the position offset; the position information includes coordinates of a geometric center point of the prediction detection frame, a width and a height of the prediction detection frame, and a rotation angle of the prediction detection frame.

In one embodiment, the method further comprises: determining the prediction detection frame according to the position information of the prediction detection frame; determining a rectangular surrounding frame corresponding to a target in the sample image according to the real position information; calculating the intersection ratio between the prediction detection frame and the rectangular surrounding frame; calculating a rotation angle difference between the prediction detection frame and the rectangular enclosure frame; when the intersection ratio is larger than a first threshold value and the rotation angle difference is smaller than a second threshold value, marking the sample image as a positive sample image; when the intersection ratio is smaller than a third threshold value or the rotation angle difference is larger than a second threshold value, the sample image is marked as a negative sample image.

In one embodiment, the predicting the prediction probability of each preset category corresponding to the target according to the target detection area determined by the position information of each prediction detection frame includes: determining a target detection area on the feature map according to the position information of each prediction detection frame; after the target detection areas are adjusted to the same preset scale, acquiring a feature vector corresponding to each target detection area; and determining the prediction probability of each preset category corresponding to the target detection area according to the feature vector.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the method of training an object detection model as described above.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the above-mentioned method of training an object detection model

According to the method and the device for training the target detection model, on one hand, when the target detection model is trained, the labeling information of the sample image comprises the real position information and the real category information, and the real position information comprises the rotation angle, so that the trained target detection model can have the capability of identifying the rotation angle of the target in the image, and the positioned target detection frame is more accurate. On the other hand, in the process of training the target detection model, the rotation angle, the scale and the target aspect ratio which are used for generating the initial detection frame in the area generation network are initialized, the generation mode of the initial detection frame is enriched, the trained target detection model is more stable, and the generated initial detection frame is closer to the real target detection frame due to the fact that the initial detection frame is determined according to the preset rotation angle. Therefore, after the position of the initial detection frame is adjusted through the regression network, the prediction detection frame is obtained, the target detection area on the feature map is obtained according to the prediction detection frame, the network parameters of the regression network are adjusted according to the real position information in the labeling information and the position information of the prediction detection frame, and after the class probability of the target detection area is predicted through the classification network, the network parameters of the classification network can be adjusted according to the real class information and the prediction probability in the labeling information, so that the target detection model which can carry out target detection on the rotating target in the image and can position the target more accurately is obtained.

Drawings

FIG. 1 is a diagram of an exemplary environment in which a method for training a target detection model may be implemented;

FIG. 2 is a schematic flow chart diagram illustrating a method for training a target detection model according to one embodiment;

FIG. 3 is a schematic diagram of labeling a sample image in one embodiment;

FIG. 4 is a schematic flow chart illustrating labeling of a sample image according to an embodiment;

FIG. 5 is a diagram illustrating an embodiment of enhancing an original sample image to obtain a sample image;

FIG. 6 is a diagram illustrating an initial detection box determined from a feature map in one embodiment;

FIG. 7 is a schematic flow chart diagram illustrating a method for training a target detection model in an exemplary embodiment;

FIG. 8 is a block diagram showing the structure of a training apparatus for an object detection model according to an embodiment;

FIG. 9 is a block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

FIG. 1 is a diagram of an exemplary implementation of a method for training a target detection model. Referring to fig. 1, the training method of the target detection model is applied to a training system of the target detection model. The training system of the target detection model may include the terminal 11 and the server 120. The terminal 11 and the server 120 may be connected via a network, and the terminal 110 may be specifically a desktop terminal or a mobile terminal, and the mobile terminal may be specifically at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

Specifically, the terminal 110 may take a sample image and communicate the sample image to the server 120. After obtaining the sample image, the server 120 trains the initial model by using the sample image to obtain a target detection model for performing target detection on the image.

In one embodiment, the server 120 may obtain the sample image and the annotation information, where the annotation information includes real position information and real category information of the target in the sample image, and the real position information includes a rotation angle of a rectangular bounding box corresponding to the target; obtaining a characteristic diagram of a sample image through a characteristic extraction network of the initial model; generating a network through the area of the initial model, and determining an initial detection frame in the feature map according to a preset rotation angle, a preset scale and a preset target aspect ratio; adjusting the position of each initial detection frame through a regression network of the initial model to obtain the position information of the prediction detection frame, and adjusting the network parameters of the regression network according to the real position information in the labeling information and the position information of the prediction detection frame; predicting the prediction probability of the target corresponding to each preset category in the target detection area determined according to the position information of each prediction detection frame through a classification network of an initial model; and after network parameters of the classification network are adjusted according to the real category information and the prediction probability in the labeling information, a target detection model for carrying out target detection on the image is obtained.

In one embodiment, as shown in FIG. 2, a method of training an object detection model is provided. The method is described as applied to a computer device (such as the terminal 110 or the server 120 in fig. 1) as an example. The method may include the following steps S202 to S212.

S202, a sample image and annotation information are obtained, wherein the annotation information comprises real position information and real category information of a target in the sample image, and the real position information comprises a rotation angle of a rectangular surrounding frame corresponding to the target.

The sample image is an image used for training an initial model, and the model obtained through training of the sample image has the capability of carrying out target detection on the image. Target detection requires not only segmentation of the target from the image, i.e. locating the target position, but also identification of the target class. The class information of the target in the sample image may be one or more of a plurality of preset classification classes, and the preset classification classes may be preset according to the actual application requirements, such as human faces, vehicles, animals, vehicles, and the like. The position information of the object in the sample image may be represented by position information of a rectangular enclosure frame surrounding the object, such as an x-coordinate of a geometric center point of the rectangular enclosure frame, a y-coordinate, a width w of the rectangular enclosure frame, and a height h of the rectangular enclosure frame, and the geometric center point of the rectangular enclosure frame does not change after the rectangular enclosure frame is rotated around the geometric center point. In addition, in the embodiments provided in the present application, the position information of the sample image further includes a rotation angle θ of the rectangular bounding box corresponding to the target, that is, the annotation information of the sample image can be represented by such a set of data including x, y, w, h, and θ. The rotation angle θ may be an offset angle of the rectangular bounding box when the rectangular bounding box is placed horizontally, for example, an included angle between a long side of the rectangular bounding box and a positive direction of an x axis of the sample image, an included angle between a long side of the rectangular bounding box and a positive direction of a y axis of the sample image, or an included angle between a short side of the rectangular bounding box and a positive direction of the x axis of the sample image or an included angle between a short side of the rectangular bounding box and a positive direction of the y axis of the sample image. The value of the rotation angle may be any value between 0 degrees and 360 degrees. It can be understood that, since the annotation information of the sample image includes the rotation angle of the sample image, the model obtained by training the annotated sample image also has the capability of identifying the rotation angle of the target in the image, so that the target in the image can be more accurately located according to the rotation angle.

Fig. 3 is a schematic diagram illustrating labeling of a sample image in an embodiment. Referring to fig. 3, the target in the sample image is a pencil, where the left side of fig. 3 is a schematic diagram illustrating labeling of the target in the sample image in the conventional technology, a rectangular bounding box in the drawing is a horizontal rectangular box, the horizontal rectangular box includes a large amount of background information, and the background information is even greater than the target information, which may cause a decrease in recognition rate and inaccurate target positioning. Fig. 3 is a schematic diagram of labeling a target in a sample image according to an embodiment of the present application, where a rectangular bounding box in the diagram is a rectangular box with a rotation angle, and the rectangular bounding box more accurately illustrates a position of the target in the image.

The initial model may be a machine learning model that may learn from the sample images to provide the ability to identify the images. In embodiments provided herein, a computer device may learn, from a sample image, the ability to perform object detection on the image. In one embodiment, the computer device may set a model structure of the machine learning model in advance to obtain an initial model, and train the initial model through the sample image to obtain model parameters of the machine learning model. When the image needs to be subjected to target detection, the computer equipment can obtain model parameters obtained by training in advance, and then the model parameters are imported into the initial model to obtain a target detection model with the capability of performing target detection on the image.

In one embodiment, before labeling the sample image, the training sample may be expanded, and obtaining the sample image includes: obtaining an original sample image; judging whether the aspect ratio of the original sample image is 1; if so, scaling the original sample image to a preset size in an equal proportion to obtain a sample image; if not, the original sample image is subjected to equal-scale scaling and then image pixels are supplemented, and a sample image with a preset size is obtained.

Because the input images of the feature extraction network need to have the same size, and the input and output of the whole network are fixed, the sample images need to be preprocessed first. Specifically, it is determined whether the width and the height of the original sample image are the same, and if so, the width and the height of the original sample image are scaled to a preset size, where the preset size may be S × S, for example. If the width and the height of the original sample image are different, when the width is larger than the height, firstly scaling the width of the original sample image to a preset size S, then scaling the height of the sample image to S' according to the width-to-height ratio of the original sample image, and then supplementing pixels to the upper area or the lower area of the original sample image to enable the height of the original sample image to be S, so that the sample image with the size of S is obtained; when the height is larger than the width, the height of the original sample image is firstly scaled to a preset size S, then the width of the original sample image is scaled to S' according to the aspect ratio of the original sample image, then pixels are supplemented to the left area or the right area of the original sample image to enable the width of the original sample image to be S, and therefore the sample image with the size S is obtained. The scaling is to ensure that the target in the sample image is not deformed, and to satisfy the requirement that the sample image is consistent in size and is not long enough or wide enough, the picture length or width is supplemented by pixels.

Fig. 4 is a schematic flow chart illustrating labeling of a sample image in an embodiment. Referring to fig. 4, the method includes the following steps:

s402, obtaining an original sample image;

s404, judging whether the aspect ratio of the original sample image is 1; if yes, go to step S406; if not, go to step S408;

s406, scaling the original sample picture to a preset size S in a high-width and high-height ratio;

s408, judging whether the width of the original sample image is larger than the height; if yes, go to step S410 a; if not, go to step S410 b;

s410a, the width of the original sample image is zoomed to a preset size S, and then the height of the sample image is zoomed to S' according to the aspect ratio of the original sample image;

s412a, supplementing pixels to the upper area or the lower area of the original sample image to make the height of the original sample image be S, and obtaining a sample image of S × S;

s410b, scaling the height of the original sample image to a preset size S, and then scaling the width of the original sample image to S' according to the aspect ratio of the original sample image;

s412b, supplementing pixels to the left or right area of the original sample image to make the width of the original sample image be S, and obtaining a sample image of S × S;

and S414, marking the adjusted sample image.

In one embodiment, the method further includes the step of obtaining the target aspect ratio by counting the aspect ratio of the rectangular bounding box labeled in the sample image: obtaining sample images and width and height information of rectangular surrounding frames corresponding to targets in the sample images; counting the width-to-height ratio of each rectangular surrounding frame according to the width-to-height information; and clustering the counted aspect ratio to obtain the target aspect ratio in the clustering result.

Specifically, after labeling the sample image, the computer device may obtain width and height information of a rectangular bounding box labeled in the sample image, count the width and height ratio, and perform clustering on the counted width and height ratio by using a clustering algorithm, where the number of types of clustering may be set as needed, for example, 3 clustering may be performed by using a K-means algorithm to obtain 3 width and height ratio values, and the obtained target width and height ratio values include w1: h1, w2: h2, w3: h 3. For example, 1:2, 1:3, 1: 4.

It should be noted that, the target aspect ratio obtained by the computer device is used to determine the initial detection frames from the feature map of the sample image in step S206, and the more the types of the target aspect ratio, the more the number of the determined initial detection frames.

In one embodiment, the method further comprises the step of enhancing the sample image, that is, the step of obtaining the sample image comprises: obtaining an original sample image; carrying out rotation processing on the original sample image according to a preset angle to obtain a sample image, and obtaining real labeling information of the sample image according to the rotation angle of a rectangular surrounding frame in the original sample image and the preset angle; or, carrying out vertical mirror image processing on the original sample image to obtain a sample image, and obtaining real annotation information of the sample image according to the rotation angle of the rectangular surrounding frame in the original sample image; or carrying out horizontal mirror image processing on the original sample image to obtain a sample image, and obtaining the real annotation information of the sample image according to the rotation angle of the rectangular surrounding frame in the original sample image.

Specifically, the computer device may rotate the original sample image by a preset angle, where the preset angle may be, for example, 30 ° or 60 °, and the computer device may further perform vertical mirroring or horizontal mirroring on the original sample image, and may further perform vertical mirroring or horizontal mirroring on the sample image processed by the preset rotation angle to obtain a new sample image. The rotation angle in the corresponding labeling information of the processed sample image needs to be modified correspondingly, so that the obtained new sample image can be added into a training sample library for training an initial model.

Fig. 5 is a schematic diagram illustrating an original sample image being subjected to enhancement processing to obtain a sample image in one embodiment. Referring to fig. 5, a schematic diagram of an enhanced sample image may be obtained by performing rotation processing, mirror processing, and further mirror processing on the rotation-processed picture on an original sample image.

In the embodiment, the sample images are subjected to image enhancement processing, so that the richness of the sample images can be improved, and the sample images are adopted to train the initial model, so that a more accurate and more stable target detection model can be obtained.

And S204, obtaining a characteristic diagram of the sample image through a characteristic extraction network of the initial model.

Wherein the feature map may be used to reflect characteristics of the sample image. According to the characteristics of the sample image, the target in the sample image can be positioned, and the class to which the sample image belongs can be classified. The initial model comprises a feature extraction network, a region generation network, a regression network and a classification network, and the computer equipment can input the sample image into the feature extraction network of the initial model in the process of training the initial model, and extract the image features of the sample image through the feature extraction network to obtain a feature map. The network parameters in the feature extraction network can be determined by training in advance, and the parameters of the feature extraction network are kept unchanged in the training process. The feature extraction network may be, for example, a convolutional neural network. In addition, the initial model can be built based on the network architecture of the Faster RCNN.

And S206, generating a network through the area of the initial model, and determining an initial detection frame in the feature map according to the preset rotation angle, the preset scale and the preset target aspect ratio.

The area generation network extracts an initial detection frame with a rotation angle from the feature map. Specifically, for each position point belonging to the foreground on the feature map, a corresponding initial detection frame is generated according to a preset rotation angle, a preset scale and a preset target aspect ratio. It is understood that if the predetermined rotation angle includes m, the predetermined scale includes n, and the target aspect ratio includes k, then in combination, m × n × k initial detection frames can be generated for each position point.

For example, the preset rotation angle includes 8 rotation angles, which are {0, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, 315 ° }. The predetermined scale includes 3 scales, which are 128 × 128, 256 × 256, 512 × 512, respectively. The target aspect ratio is determined by counting the aspect ratio of the rectangular surrounding frame marked in the sample image, and the target aspect ratio is obtained by analyzing the real aspect ratio of the target in the sample image, so that the aspect ratio of the initial detection frame determined according to the target aspect ratio is more appropriate to the real aspect ratio of the target, the storage of a detection network can be accelerated, and the accuracy is improved. For example, there may be 3 target aspect ratios, which are { w1: h1, w2: h2, w3: h3}, so that 8 × 3 — 72 different initial detection frames may be generated for each location point.

FIG. 6 is a diagram illustrating an initial detection box determined from a feature map in one embodiment. Referring to fig. 6, the size of the feature map is S × S, and for a certain position point (X0, Y0) on the feature map, the area generation network extracts 72 initial detection boxes, fig. 6 only shows 6 initial detection boxes, and the rotation angle, the size, and the aspect ratio corresponding to these 6 initial detection boxes are:

0°、128*128、1；

45°、128*128、1；

90°、128*128、1；

45°、256*128、2；

90°、256*256、1；

45°、256*512、1/2。

in one embodiment, for an input pair sample image, pixel points belonging to the foreground are obtained through a classification function in the area generation network, so that position points corresponding to the pixel points belonging to the foreground on the feature map are determined, and an initial detection frame is generated for each determined position point.

S208, adjusting the position of each initial detection frame through the regression network of the initial model to obtain the position information of the prediction detection frame, and adjusting the network parameters of the regression network according to the real position information in the labeling information and the position information of the prediction detection frame.

The regression network is used for adjusting the position of the generated initial detection frame according to the current network parameters to obtain the position information of the adjusted prediction detection frame. The position information of the prediction detection frame also includes the coordinates of the geometric center point of the prediction detection frame, the width and height of the prediction detection frame, and the rotation angle. The initial detection frame generally cannot accurately position the target in the sample image, the position information of the initial detection frame is adjusted through the current network parameters in the regression network, and the obtained prediction detection frame is closer to the target detection frame.

In one embodiment, adjusting the position of each initial detection frame to obtain the position information of the prediction detection frame includes: calculating the position offset of each initial detection frame according to the current network parameters of the regression network; obtaining the position information of a prediction detection frame according to the initial detection frame and the position offset; the position information includes coordinates of a geometric center point of the prediction detection frame, a width and a height of the prediction detection frame, and a rotation angle of the prediction detection frame.

In the above example, if the area generation network can generate 72 initial detection frames for each foreground position point, the regression network includes the coordinates of the geometric center point, the width and height of the predicted detection frame, and the rotation angle x, y, w, h, and θ in the position information of the predicted detection frame obtained after adjusting the position of the initial detection frame, and then the regression network has 72 × 5 — 360 output values for each point on the feature map.

In an embodiment, the training method of the target detection model further includes: determining a prediction detection frame according to the position information of the prediction detection frame; determining a rectangular surrounding frame corresponding to the target in the sample image according to the real position information; calculating the intersection and parallel ratio between the prediction detection frame and the rectangular surrounding frame; calculating the rotation angle difference between the prediction detection frame and the rectangular surrounding frame; when the intersection ratio is larger than a first threshold value and the rotation angle difference is smaller than a second threshold value, marking the sample image as a positive sample image; and when the intersection ratio is smaller than a third threshold value or the rotation angle difference is larger than a second threshold value, marking the sample image as a negative sample image.

The intersection ratio refers to a ratio of an overlapping area of the prediction detection frame and the real rectangular bounding box to a merging area, the overlapping area may be represented by the number of position points included in an overlapping region of the prediction detection frame and the real rectangular bounding box, and similarly, the merging area may be represented by the number of position points included in a region after the prediction detection frame and the real rectangular bounding box are merged. It is mentioned above that the regression network outputs the position information of the predicted detection frame, the position information including the rotation angle of the predicted detection frame, and thus the difference between the rotation angle of the predicted detection frame output by the regression network and the rotation angle of the true rectangular bounding frame can be determined. The intersection ratio and the difference of the rotation angle can reflect the accuracy of the prediction detection frame to a certain extent, and the larger the intersection ratio is, the higher the overlapping degree of the intersection ratio is, and the smaller the difference of the rotation angle is, the closer the intersection ratio is, the closer the position of the intersection ratio is. If the intersection ratio between the prediction detection frame and the real rectangular surrounding frame is greater than the first threshold and the difference between the prediction detection frame and the real rectangular surrounding frame in terms of the rotation angle is smaller than the second threshold, it is indicated that the prediction detection frame is closer to the real rectangular surrounding frame, and the sample image can be marked as a positive sample image. When the intersection ratio is less than a third threshold or the difference between the rotation angles is greater than a second threshold, the sample image is marked as a negative sample image. Wherein the first threshold may be 0.7, the second threshold may be 22.5 °, and the third threshold may be 0.3. If the difference between the intersection ratio or the rotation angle between the prediction detection frame and the real rectangular surrounding frame meets other conditions, the sample image does not belong to the positive sample image or the negative sample image and is not used for training.

In this embodiment, since the number of foreground position points is large, the number of initial detection frames determined according to each foreground position point is large, the number of prediction detection frames obtained by regression is large, and in order to reduce the data amount in the training process, the sample images may be screened according to the above method, and the model is trained only by using the screened sample images.

In one embodiment, after obtaining the predicted detection boxes, the computer device may further perform a filtering on all the predicted detection boxes according to the overlapping degree of the predicted detection boxes in order to reduce the calculation amount of the training process. The computer device may also cull prediction detection boxes that exceed image boundaries.

Further, after obtaining the position information of the prediction detection frame, the computer device also obtains the position information of the target in the sample image, and the computer device may adjust the network parameter of the regression network according to a difference between the actual position information of the target in the annotation information and the position information of the prediction detection frame.

S210, predicting the prediction probability of the target corresponding to each preset category in the target detection area determined according to the position information of each prediction detection frame through the classification network of the initial model.

Specifically, the input of the classification network includes a feature map and the determined position information of the prediction detection frame, and the computer device may determine a target detection region from the feature map according to the position information of the prediction detection frame, and predict the category of the sample image based on the target detection region.

In one embodiment, the determining the prediction probability of the target detection region prediction target corresponding to each preset category according to the position information of each prediction detection frame includes: determining a target detection area on the feature map according to the position information of each prediction detection frame; after adjusting the target detection areas to the same preset scale, obtaining the characteristic vectors corresponding to the target detection areas; and determining the prediction probability of the target detection area corresponding to each preset category according to the feature vector.

Specifically, the computer device may cut target detection regions with different sizes from the feature map according to the position information of each prediction detection frame, adjust each target detection Region to the same preset scale through ROI Pooling (Region of interest Pooling, candidate Region Pooling), obtain a feature vector corresponding to each target detection Region, and determine a probability vector of each target detection Region belonging to each preset category through the full-link layer and the normalization layer, thereby obtaining a prediction probability corresponding to each preset category.

S212, after network parameters of the classification network are adjusted according to the real category information and the prediction probability in the labeling information, a target detection model for carrying out target detection on the image is obtained.

And finally, after the class probability of the preset class corresponding to the prediction detection frame in the sample image is determined, constructing a loss function of the classification network according to the class probability and the real class information of the target in the sample image, and adjusting the network parameters of the classification network according to the adjustment direction when the loss function is minimized. For all sample images, the computer device may perform the above steps S202 to S212 on the current model until an object detection model capable of performing object detection on the image is obtained.

According to the training method of the target detection model, on one hand, when the target detection model is trained, the labeling information of the sample image comprises real position information and real category information, and the real position information comprises the rotation angle, so that the trained target detection model can have the capability of identifying the rotation angle of the target in the image, and the positioned target detection frame is more accurate. On the other hand, in the process of training the target detection model, the rotation angle, the scale and the target aspect ratio which are used for generating the initial detection frame in the area generation network are initialized, the generation mode of the initial detection frame is enriched, the trained target detection model is more stable, and the generated initial detection frame is closer to the real target detection frame due to the fact that the initial detection frame is determined according to the preset rotation angle. Therefore, after the position of the initial detection frame is adjusted through the regression network, the prediction detection frame is obtained, the target detection area on the feature map is obtained according to the prediction detection frame, the network parameters of the regression network are adjusted according to the real position information in the labeling information and the position information of the prediction detection frame, and after the class probability of the target detection area is predicted through the classification network, the network parameters of the classification network can be adjusted according to the real class information and the prediction probability in the labeling information, so that the target detection model which can carry out target detection on the rotating target in the image and can position the target more accurately is obtained.

In a specific embodiment, as shown in fig. 7, the method for training the target detection model includes the following steps:

s702, acquiring an original sample image.

S704, the original sample images and the width and height information of the rectangular surrounding frame corresponding to the target in each original sample image are obtained.

And S706, counting the aspect ratio of each rectangular surrounding frame according to the aspect information.

S708, clustering the counted aspect ratio to obtain the target aspect ratio in the clustering result.

S710, judging whether the aspect ratio of the original sample image is 1; if so, scaling the original sample image to a preset size in an equal proportion to obtain a sample image; if not, the original sample image is subjected to equal-scale scaling and then image pixels are supplemented, and a sample image with a preset size is obtained.

And S712, performing rotation processing on the sample image according to a preset angle to obtain a newly added sample image.

And S714, performing vertical mirror image processing on the sample image to obtain a newly added sample image.

And S716, performing horizontal mirror image processing on the sample image to obtain a newly added sample image.

And S718, obtaining the real labeling information of the newly added sample image according to the rotation angle of the rectangular surrounding frame in the sample image.

And S720, obtaining a characteristic diagram of the sample image through the characteristic extraction network of the initial model.

And S722, generating a network through the region of the initial model, and determining an initial detection frame in the feature map according to a preset rotation angle, a preset scale and a preset target aspect ratio.

S724, calculating the position offset of each initial detection frame according to the current network parameters of the regression network through the regression network of the initial model; obtaining the position information of a prediction detection frame according to the initial detection frame and the position offset; the position information includes coordinates of a geometric center point of the prediction detection frame, a width and a height of the prediction detection frame, and a rotation angle of the prediction detection frame.

And S726, adjusting the network parameters of the regression network according to the real position information in the labeling information and the position information of the prediction detection frame.

S728, the prediction detection frame is determined based on the position information of the prediction detection frame.

And S730, determining a rectangular surrounding frame corresponding to the target in the sample image according to the real position information.

S732, calculating the intersection ratio and the rotation angle difference between the prediction detection frame and the rectangular bounding frame.

S734, when the intersection ratio is greater than the first threshold and the rotation angle difference is less than the second threshold, the sample image is marked as a positive sample image.

And S736, when the intersection ratio is smaller than a third threshold value or the rotation angle difference is larger than a second threshold value, marking the sample image as a negative sample image.

S738, determining a target detection area on the feature map according to the position information of each prediction detection frame through the classification network of the initial model.

And S740, adjusting the target detection areas to the same preset scale, and then obtaining the feature vectors corresponding to the target detection areas.

And S742, determining the prediction probability of each preset category corresponding to the target detection area according to the feature vector.

S744, after network parameters of the classification network are adjusted according to the real category information and the prediction probability in the labeling information, a target detection model for carrying out target detection on the image is obtained.

FIG. 7 is a flowchart illustrating a method for training a target detection model according to an embodiment. It should be understood that, although the steps in the flowchart of fig. 7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 7 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 8, a training apparatus 800 for an object detection model is provided, the apparatus including a sample image acquisition module 802, a feature map acquisition module 804, an initial detection frame generation module 806, a position regression module 808, and a classification module 810, wherein:

the sample image obtaining module 802 is configured to obtain a sample image and annotation information, where the annotation information includes real position information and real category information of a target in the sample image.

And the feature map obtaining module 804 is configured to obtain a feature map of the sample image through a feature extraction network of the initial model.

And an initial detection frame generation module 806, configured to generate a network through the area of the initial model, and determine an initial detection frame in the feature map according to a preset rotation angle, a preset scale, and a preset target aspect ratio.

And the position regression module 808 is configured to adjust the position of each initial detection frame through the regression network of the initial model to obtain the position information of the predicted detection frame, and adjust the network parameters of the regression network according to the actual position information in the labeling information and the position information of the predicted detection frame.

The classification module 810 is configured to predict, through a classification network of the initial model, prediction probabilities of the target detection regions determined according to the position information of the prediction detection frames, where the target detection regions predict the target corresponding to the preset categories; and after network parameters of the classification network are adjusted according to the real category information and the prediction probability in the labeling information, a target detection model for carrying out target detection on the image is obtained.

In one embodiment, the sample image acquisition module 802 is further configured to acquire an original sample image; judging whether the aspect ratio of the original sample image is 1; if so, scaling the original sample image to a preset size in an equal proportion to obtain a sample image; if not, the original sample image is subjected to equal-scale scaling and then image pixels are supplemented, and a sample image with a preset size is obtained.

In one embodiment, the sample image acquisition module 802 is further configured to acquire an original sample image; carrying out rotation processing on the original sample image according to a preset angle to obtain a sample image, and obtaining real labeling information of the sample image according to the rotation angle of a rectangular surrounding frame in the original sample image and the preset angle; or, carrying out vertical mirror image processing on the original sample image to obtain a sample image, and obtaining real annotation information of the sample image according to the rotation angle of the rectangular surrounding frame in the original sample image; or carrying out horizontal mirror image processing on the original sample image to obtain a sample image, and obtaining the real annotation information of the sample image according to the rotation angle of the rectangular surrounding frame in the original sample image.

In one embodiment, the apparatus further includes a sample image preprocessing module, configured to obtain width and height information of the sample image and a rectangular bounding box corresponding to the target in each sample image; counting the width-to-height ratio of each rectangular surrounding frame according to the width-to-height information; and clustering the counted aspect ratio to obtain the target aspect ratio in the clustering result.

In one embodiment, the apparatus further includes a statistical module, configured to obtain width and height information of the sample image and a rectangular bounding box corresponding to the target in each sample image; counting the width-to-height ratio of each rectangular surrounding frame according to the width-to-height information; and clustering the counted aspect ratio to obtain the target aspect ratio in the clustering result.

In one embodiment, the apparatus further includes a screening module, configured to determine the predicted detection frame according to the position information of the predicted detection frame; determining a rectangular surrounding frame corresponding to the target in the sample image according to the real position information; calculating the intersection and parallel ratio between the prediction detection frame and the rectangular surrounding frame; calculating the rotation angle difference between the prediction detection frame and the rectangular surrounding frame; when the intersection ratio is larger than a first threshold value and the rotation angle difference is smaller than a second threshold value, marking the sample image as a positive sample image; and when the intersection ratio is smaller than a third threshold value or the rotation angle difference is larger than a second threshold value, marking the sample image as a negative sample image.

In one embodiment, the classification module is further configured to determine a target detection area on the feature map according to the position information of each of the prediction detection frames; after adjusting the target detection areas to the same preset scale, obtaining the characteristic vectors corresponding to the target detection areas; and determining the prediction probability of the target detection area corresponding to each preset category according to the feature vector.

In the training apparatus 800 for the target detection model, on one hand, when the target detection model is trained, the labeling information of the sample image includes the real position information and the real category information, and the real position information includes the rotation angle, so that the trained target detection model can have the capability of identifying the rotation angle of the target in the image, and the positioned target detection frame is more accurate. On the other hand, in the process of training the target detection model, the rotation angle, the scale and the target aspect ratio which are used for generating the initial detection frame in the area generation network are initialized, the generation mode of the initial detection frame is enriched, the trained target detection model is more stable, and the generated initial detection frame is closer to the real target detection frame due to the fact that the initial detection frame is determined according to the preset rotation angle. Therefore, after the position of the initial detection frame is adjusted through the regression network, the prediction detection frame is obtained, the target detection area on the feature map is obtained according to the prediction detection frame, the network parameters of the regression network are adjusted according to the real position information in the labeling information and the position information of the prediction detection frame, and after the class probability of the target detection area is predicted through the classification network, the network parameters of the classification network can be adjusted according to the real class information and the prediction probability in the labeling information, so that the target detection model which can carry out target detection on the rotating target in the image and can position the target more accurately is obtained.

FIG. 9 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the computer device of fig. 1. As shown in fig. 9, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement a method of training a target detection model. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a method of training a target detection model.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the training apparatus for the object detection model provided in the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 9. The memory of the computer device may store various program modules of the training apparatus constituting the object detection model, such as the sample image acquisition module 802, the feature map acquisition module 804, the initial detection frame generation module 806, the position regression module 808, and the classification module 810 shown in fig. 8. The program modules constitute computer programs that cause the processors to execute the steps of the training method of the object detection model of the embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 9 may execute step S202 through the sample image acquisition module 802 in the training apparatus of the target detection model shown in fig. 8. The computer device may execute step S204 through the feature map obtaining module 804. The computer device may perform step S206 by the initial detection box generation module 806. The computer device may perform step S208 by the location regression module 808. The computer device may perform steps S210 and S212 through the classification module 810. ' Qijian

In an embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the above training method of an object detection model. Here, the steps of the training method of the target detection model may be steps in the training method of the target detection model of the above embodiments.

In one embodiment, a computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, causes the processor to carry out the steps of the above-mentioned method for training an object detection model. Here, the steps of the training method of the target detection model may be steps in the training method of the target detection model of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of training an object detection model, comprising:

2. The method of claim 1, wherein the obtaining a sample image comprises:

obtaining an original sample image;

judging whether the aspect ratio of the original sample image is 1; if so, scaling the original sample image to a preset size in an equal proportion to obtain a sample image; if not, the original sample image is subjected to equal-scale scaling and then image pixels are supplemented, and a sample image with a preset size is obtained.

3. The method of claim 1, wherein the obtaining a sample image comprises:

obtaining an original sample image;

performing rotation processing on the original sample image according to a preset angle to obtain a sample image, and obtaining real labeling information of the sample image according to the rotation angle of a rectangular surrounding frame in the original sample image and the preset angle; alternatively, the first and second electrodes may be,

performing vertical mirror image processing on the original sample image to obtain a sample image, and obtaining real annotation information of the sample image according to the rotation angle of a rectangular surrounding frame in the original sample image; alternatively, the first and second electrodes may be,

and carrying out horizontal mirror image processing on the original sample image to obtain a sample image, and obtaining the real annotation information of the sample image according to the rotation angle of the rectangular surrounding frame in the original sample image.

4. The method of claim 1, wherein the step of determining the target aspect ratio comprises:

obtaining sample images and width and height information of a rectangular surrounding frame corresponding to a target in each sample image;

counting the aspect ratio of each rectangular surrounding frame according to the aspect information;

and clustering the counted aspect ratio to obtain the target aspect ratio in the clustering result.

5. The method of claim 1, wherein the adjusting the position of each of the initial detection frames to obtain the position information of the predicted detection frame comprises:

calculating the position offset of each initial detection frame according to the current network parameters of the regression network;

obtaining the position information of a prediction detection frame according to the initial detection frame and the position offset; the position information includes coordinates of a geometric center point of the prediction detection frame, a width and a height of the prediction detection frame, and a rotation angle of the prediction detection frame.

6. The method of claim 5, further comprising:

determining the prediction detection frame according to the position information of the prediction detection frame;

determining a rectangular surrounding frame corresponding to a target in the sample image according to the real position information;

calculating the intersection ratio between the prediction detection frame and the rectangular surrounding frame;

calculating a rotation angle difference between the prediction detection frame and the rectangular enclosure frame;

when the intersection ratio is larger than a first threshold value and the rotation angle difference is smaller than a second threshold value, marking the sample image as a positive sample image;

when the intersection ratio is smaller than a third threshold value or the rotation angle difference is larger than a second threshold value, the sample image is marked as a negative sample image.

7. The method according to claim 1, wherein the predicting the prediction probability of each preset category corresponding to the target according to the target detection area determined according to the position information of each prediction detection frame comprises:

determining a target detection area on the feature map according to the position information of each prediction detection frame;

after the target detection areas are adjusted to the same preset scale, acquiring a feature vector corresponding to each target detection area;

and determining the prediction probability of each preset category corresponding to the target detection area according to the feature vector.

8. An apparatus for training an object detection model, the apparatus comprising:

the system comprises a sample image acquisition module, a target identification module and a target identification module, wherein the sample image acquisition module is used for acquiring a sample image and annotation information, and the annotation information comprises real position information and real category information of a target in the sample image;

the characteristic diagram acquisition module is used for acquiring a characteristic diagram of the sample image through a characteristic extraction network of the initial model;

the initial detection frame generation module is used for generating a network through an area of an initial model and determining an initial detection frame in the feature map according to a preset rotation angle, a preset scale and a preset target aspect ratio;

the position regression module is used for adjusting the position of each initial detection frame through a regression network of an initial model to obtain the position information of the prediction detection frame, and adjusting the network parameters of the regression network according to the real position information in the labeling information and the position information of the prediction detection frame;

the classification module is used for predicting the prediction probability of each preset class corresponding to the target according to the target detection area determined by the position information of each prediction detection frame through a classification network of an initial model; and after network parameters of the classification network are adjusted according to the real category information and the prediction probability in the labeling information, a target detection model for carrying out target detection on the image is obtained.

9. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.

10. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 7.