CN111798456A

CN111798456A - Instance segmentation model training method and device and instance segmentation method

Info

Publication number: CN111798456A
Application number: CN202010454014.6A
Authority: CN
Inventors: 卢运西; 徐兆坤; 黄银君
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Suning Cloud Computing Co Ltd
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2020-10-20
Also published as: WO2021238826A1

Abstract

The embodiment of the application discloses a training method, a device and an example segmentation method of an example segmentation model, wherein the method comprises the following steps: pruning a pre-constructed deep learning model; acquiring a training set and marking the training set, wherein the training set is a set of RGBD images with target objects in a scene acquired by different depth cameras, and the RGBD images comprise a depth map and a color map; and training the pruned deep learning model by using the labeled training set to obtain an instance segmentation model. According to the method, the network structure of the existing instance segmentation model is pruned, so that the whole model is lighter, the training speed of the model and the prediction speed of the model are improved, and meanwhile, in order to prevent the model prediction precision from being reduced due to the reduction of network layers, the depth map is added, the number of channels is expanded, and the training precision and the prediction precision of the model are improved.

Description

Instance segmentation model training method and device and instance segmentation method

Technical Field

The invention belongs to the field of target detection, and particularly relates to a training method and device of an instance segmentation model and an instance segmentation method.

Background

With the continuous improvement of the technological level, the technology in the field of artificial intelligence is continuously mature and applied to the ground, and the quality of life of people is greatly improved. Nowadays, a large number of image video acquisition systems are arranged in a plurality of scenes, the advanced technology in the existing artificial intelligence field is operated in an image video system, the comprehension capability of the system to image video contents can be greatly improved, and the intelligent monitoring technical capability is provided for scenes such as offline unmanned stores, security systems, public places and the like. However, in the existing example segmentation model, because the number of network layers is large when image features are extracted, the whole training process is slow when the data size is large, and the existing example segmentation model is generally trained by using a color map, and the prediction accuracy of the example segmentation model obtained by training only using the color map is often not high in scenes such as unmanned stores, security systems, public places and the like on line. Therefore, an efficient and fast deep learning segmentation algorithm providing relevant technical capability is urgently needed.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a training method, a device and an example segmentation method of an example segmentation model.

The embodiment of the invention provides the following specific technical scheme:

a first aspect discloses a method for training an instance segmentation model, the method comprising:

pruning a pre-constructed deep learning model;

acquiring a training set and marking the training set, wherein the training set is a set of RGBD images with target objects in a scene acquired by different depth cameras, and the RGBD images comprise a depth image and a color image;

and training the pruned deep learning model by using the labeled training set to obtain an instance segmentation model.

Preferably, the method further comprises: preprocessing the training set before labeling, specifically comprising:

performing three-dimensional reconstruction according to the acquired depth map in the training set to obtain a first modeling result, and simultaneously performing three-dimensional reconstruction according to the acquired depth map in the RGBD image which is acquired by different depth cameras and does not have any target object in the scene corresponding to the training set to obtain a second modeling result;

according to the second modeling result, performing background removal processing on the first modeling result to obtain a foreground image containing a target object;

determining a truncation distance corresponding to each depth camera according to the foreground image containing the target object, wherein the truncation distance is a moving distance from the target object to the depth camera;

performing truncation processing on the depth map in each training set according to the truncation distance corresponding to each depth camera, and performing normalization processing on the depth map after the truncation processing;

and carrying out normalization processing on the color image in the training set.

Preferably, the labeling of the training set specifically includes:

calculating the integrity of each target object in the RGBD image with the target object;

and when the integrity of all the target objects is greater than a first preset value, labeling the target objects in the RGBD image with the target objects to obtain a target object detection frame, and generating corresponding labels.

Preferably, the training the pruned deep learning model by using the labeled training set to obtain the instance segmentation model specifically includes:

extracting features of the marked training set, and fusing the extracted features to obtain a feature region;

performing segmentation processing on the feature region to obtain a segmentation result of the feature region, and performing regression and classification processing on the feature region to obtain a detection frame of the feature region, a classification result corresponding to the detection frame, and an instance score associated with the segmentation result;

multiplying the segmentation result and the corresponding example score to obtain an example segmentation result;

utilizing the corresponding target object detection frame to perform truncation processing on the example segmentation result, calculating an error between the truncated example segmentation result and the corresponding label, and meanwhile calculating an error between the detection frame and the corresponding target object detection frame;

calculating a total loss value according to an error between the example segmentation result after the truncation processing and the corresponding label and an error between the detection frame and the corresponding target object detection frame;

and judging the total loss value, stopping training the deep learning model when the total loss value is smaller than a second preset value, and determining the corresponding deep learning model as the example segmentation model when the total loss value is smaller than the second preset value.

Preferably, the regression processing on the feature region specifically includes:

predicting the central point of the feature region, and calculating the width and the height of the feature region according to the central point to generate the detection frame;

the method further comprises the following steps:

and when the number of the generated detection frames is more than one, performing maximum pooling on each generated detection frame and storing the detection frames after the maximum pooling meeting the first preset condition.

Preferably, the training of the pruned deep learning model by using the labeled training set specifically further comprises:

and training the deep learning model according to the learning rate corresponding to the current total loss value.

Preferably, pruning the pre-constructed deep learning model specifically includes:

obtaining an influence factor corresponding to a network layer to be pruned in the deep learning model, wherein the influence factor is a scaling factor obtained by performing normalization calculation on the network layer to be pruned;

and when the influence factor is smaller than a third preset value, pruning the network layer corresponding to the influence factor.

In a second aspect, a method for instance segmentation is disclosed, the method comprising:

acquiring a picture to be detected;

inputting the picture to be detected into a pre-trained example segmentation model for identification, and outputting a detection frame and an example segmentation result of the picture to be detected;

wherein the pre-trained instance segmentation model is obtained by training based on the method of the first aspect.

Preferably, before the picture to be detected is input to a pre-trained example segmentation model for recognition, the method further includes:

acquiring the number of the pictures to be detected;

when the number of the pictures to be detected is more than one, splicing the pictures to be detected;

the inputting the picture to be detected into a pre-trained example segmentation model for identification, and outputting the detection frame and the example segmentation result of the picture to be detected specifically comprises:

inputting the spliced pictures to be detected into the example segmentation model for identification, and outputting detection frames and example segmentation results of all the pictures to be detected;

the method further comprises the following steps:

and splitting the detection frames and the example segmentation results of all the pictures to be detected to obtain the detection frames and the example segmentation results corresponding to each picture to be detected.

In a third aspect, an apparatus for training an instance segmentation model is disclosed, the apparatus comprising:

the pruning module is used for pruning the pre-constructed deep learning model;

the acquisition module is used for acquiring a training set; the training set is a set of RGBD images with target objects in a scene collected by different depth cameras, and the RGBD images comprise a depth map and a color map;

the preprocessing module is used for labeling the training set;

and the training module is used for training the pruned deep learning model by utilizing the labeled training set to obtain an example segmentation model.

The embodiment of the invention has the following beneficial effects:

1. according to the invention, the deep learning model is pruned, so that the network structure is lighter, the speed is high when the model is trained and the model is used for prediction, and meanwhile, a depth map is added when the deep learning network is trained, so that the number of channels is expanded, the training precision is improved, and the prediction precision is also improved;

2. when the deep learning model is trained, the depth map in the training data is subjected to truncation processing and normalization, and the color map in the training data is subjected to normalization processing, so that the accuracy of the training data is improved, and the training precision of the model is improved;

3. according to the method, the training data are labeled by using a special labeling strategy, and the data with low integrity are removed, so that the effectiveness of labeling the training data is improved, and the training precision of the model is also improved;

4. according to the embodiment segmentation model, the center point of the target object is predicted by using an anchor free method, then the width and the height are obtained through regression, so that the detection frame is obtained, the detection frame is subjected to maximum pooling treatment to realize de-duplication, the detection robustness under the condition of dense personnel is improved, and the loss of the detection frame under the condition of dense personnel is effectively avoided;

5. when the method and the device utilize the model for prediction, the input data are spliced and the output result is split, so that parallel processing is realized, the execution efficiency is improved, the efficient utilization of computing resources is improved, and the method and the device are more suitable for application scenes related to video monitoring.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a training method of an example segmentation model provided in embodiment 1 of the present application;

FIG. 2 is a block diagram of an example segmentation model provided in embodiment 1 of the present application;

FIG. 3 is a flow chart of an example segmentation method provided in embodiment 2 of the present application;

fig. 4 is a schematic structural diagram of a training apparatus for an example segmentation model provided in embodiment 3 of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As described in the background art, the existing force segmentation network algorithm is complex and high, and the real-time performance of the algorithm is difficult to guarantee, so that a more efficient and rapid segmentation algorithm is needed to better distinguish different individuals in videos and images and obtain richer body information of human bodies, so that better target object identification can be realized in scenes such as unmanned stores, security systems, public places and the like. Based on the above, the application provides a training method of an example segmentation model, which can obtain a model with lighter weight, faster speed and higher precision in training and prediction.

Example 1

As shown in fig. 1, a training method of an example segmentation model includes the following steps:

s11, constructing a deep learning model;

the method builds a basic network structure based on a YOLACT model, and builds a deep learning model. The deep learning model comprises: the model has a specific structure as shown in fig. 2, and includes a content 18 network, an FPN network, two network branches (protonet and Pred _ heads) connected to the FPN network, a crop network, and the like. The content 18 network is used for extracting features, the FPN network is used for fusing the features, the protonet network branch is used for segmenting the characteristic diagram to obtain segmentation results including a foreground and a background, and the Pred _ headers are used for predicting the characteristic diagram to obtain a detection frame, a category and a confidence coefficient of a target object and an example segmentation score associated with the prediction results of the protonet network branch.

S12, pruning the deep learning model;

in order to lighten the network change, the network layer is selected to be pruned when the network is trained and the network is used for prediction at a high speed.

In the application, a network layer is pruned by a coarse-grained method, and the method comprises the following specific steps:

1. acquiring an influence factor corresponding to a network layer to be pruned in the deep learning model, wherein the influence factor is a scaling factor obtained by carrying out normalization calculation on the network layer to be pruned;

wherein, the network layer to be pruned is a convolutional layer.

2. And when the influence factor is smaller than a preset value, pruning the network layer corresponding to the influence factor.

Specifically, a batch normalization layer is added after each convolution layer (i.e., each layer of the content 18 network), and normalization calculation is performed for each convolution layer. The calculation formula of the batch normalization layer comprises a parameter gamma, wherein the gamma is a scaling factor, and when the gamma is smaller than a preset value, a corresponding channel is less important, so that the network layer can be pruned. In addition, a regular term related to gamma can be added to the calculation formula, so that automatic pruning can be realized in the training process of the model.

S13, acquiring and labeling a training set, wherein the training set is a set of RGBD images with target objects in a scene acquired by different depth cameras, and the RGBD images comprise a depth map and a color map;

the purpose of labeling is to preprocess the RGBD image, so that a target object detection frame and a label can be obtained.

The process of labeling the training set specifically includes:

s131, calculating the integrity of each target object in the RGBD image with the target object;

and S132, when the integrity of all the target objects is larger than a preset value, labeling the target objects in the RGBD image with the target objects to obtain a target object detection frame, and generating corresponding labels.

For example, the preset value may be 1/2, and therefore, when the integrity of the target object is greater than 1/2, the target object in the RGBD image with the target object is labeled and a corresponding label is generated. By the special marking strategy, the effectiveness of training data can be improved, and the precision of subsequent model training and prediction can be improved.

In order to further improve the accuracy of model training and prediction, the training set may be further processed, specifically including:

1. performing three-dimensional reconstruction according to the acquired depth map in the training set to obtain a first modeling result, and simultaneously performing three-dimensional reconstruction according to the acquired depth map in the RGBD image which is acquired by different depth cameras and does not have any target object in the scene corresponding to the training set to obtain a second modeling result;

specifically, the three-dimensional model including the target object and the three-dimensional model not including any target object are constructed by jointly calibrating the depth maps acquired by different depth cameras.

2. According to the second modeling result, performing background removal processing on the first modeling result to obtain a foreground image containing the target object;

3. determining a truncation distance corresponding to each depth camera according to a foreground image containing a target object, wherein the truncation distance is a moving distance from the target object to the depth camera;

since the moving range of the target object under each depth camera is not fixed, the truncation distance may be a dynamic range.

4. Carrying out truncation processing on the depth map in each training set through the truncation distance corresponding to each depth camera, and carrying out normalization processing on the depth map subjected to truncation processing;

by performing truncation processing on the depth map by using the truncation distance, some noise in the depth map can be filtered.

5. And carrying out normalization processing on the color map in the training set.

Through the processing process, the accuracy of the training data is improved, and therefore the training precision of the model is improved.

And S14, training the pruned deep learning model by using the labeled training set to obtain an example segmentation model.

Step S14 specifically includes:

s141, extracting features of the marked training set, and fusing the extracted features to obtain a feature region;

specifically, the labeled training set is input into a present 18 network, and the present 18 network includes a plurality of convolution layers for extracting features of the training set to obtain features of multiple dimensions. And after the feature extraction is finished, inputting the features of multiple dimensions into the FPN network to obtain a feature region.

The features output by the event 18 network are two, one is a low-level feature, the other is a high-level feature, the semantic information of the low-level feature is less, but the target position is accurate, and the semantic information of the high-level feature is rich, but the target position is rough. The FPN network is a characteristic pyramid network, so that two types of characteristics can be fused, the multi-scale problem is solved, and the target detection performance is improved.

S142, carrying out segmentation processing on the feature region to obtain a segmentation result of the feature region, and carrying out regression and classification processing on the feature region to obtain a detection frame of the feature region, a classification result and confidence corresponding to the detection frame and an instance score associated with the segmentation result;

the method comprises the steps of connecting two branches of an FPN network, inputting characteristic regions into the two network branches (protonet and Pred _ heads) respectively, wherein the protonet network branches are used for segmenting characteristic regions to obtain segmentation results including foreground and background, and the Pred _ heads are used for predicting the characteristic regions to obtain detection frames, categories and confidence degrees of target objects and example scores related to the segmentation results.

S143, multiplying the segmentation result and the corresponding example score to obtain an example segmentation result;

s144, utilizing the corresponding target object detection frame to perform truncation processing on the example segmentation result, calculating an error between the truncated example segmentation result and the corresponding label, and meanwhile calculating an error between the detection frame and the corresponding target object detection frame;

s145, calculating a total loss value according to the error between the example segmentation result after the truncation processing and the corresponding label and the error between the detection frame and the corresponding target object detection frame;

in the scheme, the total loss value is the sum of the error between the example segmentation result after the truncation processing and the corresponding label and the error between the detection frame and the corresponding target object detection frame.

And S146, judging the total loss value, stopping training the deep learning model when the total loss value is smaller than a second preset value, and determining the corresponding deep learning model as an example segmentation model when the total loss value is smaller than the second preset value.

When the total loss value is less than a preset value, the whole model is converged, and the training can be stopped at the moment.

In addition, in the process of training the model, the deep learning model is optimally trained by using a gradient descent algorithm, and in order to improve the convergence rate of the model, corresponding learning rates can be set for loss values in different stages, and the method specifically comprises the following steps:

and training the deep learning model according to the learning rate corresponding to the current loss value.

After the model training is completed, the model may be verified to ensure the prediction accuracy of the model, and specifically, the method may include the following implementation steps:

1. acquiring a verification set and marking the verification set, wherein the verification set is a set of RGBD images with target objects acquired by different depth cameras, and the RGBD images comprise a depth image and a color image;

2. inputting the marked verification set into an example segmentation model to obtain an output result;

the output result can be output in a fixed round, for example, when the model iterates for 5 times, the result is output once, so that the reasonability and the efficiency of the model verification process can be ensured.

3. And comparing the output result with the real result to verify the example segmentation model.

Example 2

Based on the example segmentation model obtained by training in the above embodiment 1, an embodiment of the present invention further provides an example segmentation method, as shown in fig. 3, the method includes:

s31, acquiring a picture to be detected;

and S32, inputting the picture to be detected into a pre-trained example segmentation model for identification, and outputting a detection frame and an example segmentation result of the picture to be detected.

The identification process of the picture to be detected may specifically refer to the training process of the model in embodiment 1. Before outputting the detection frame and the example segmentation result of the picture to be detected, the confidence coefficient needs to be compared with a preset value, and with specific reference to fig. 2, after comparing the confidence coefficient with the preset value, the Crop module outputs the detection frame corresponding to the confidence coefficient higher than the preset value and the corresponding example segmentation result.

The pre-trained example segmentation model is obtained by training based on the method described in embodiment 1.

In order to improve the prediction speed of different pictures, the scheme further comprises the following steps:

s41, before the pictures to be detected are input to a pre-trained example segmentation model for identification, the number of the pictures to be detected is obtained;

s42, when the number of the pictures to be detected is more than one, splicing the pictures to be detected;

s43, inputting the spliced pictures to be detected into an example segmentation model for identification, and outputting detection frames and example segmentation results of all the pictures to be detected;

and S44, splitting the detection frames and the example segmentation results of all the pictures to be detected to obtain the detection frames and the example segmentation results corresponding to each picture to be detected.

Based on the processing process (splicing the pictures to be detected before prediction and splitting the prediction result after prediction), a plurality of pictures can be predicted at the same time, so that the parallelization capability of model prediction is greatly improved, the efficient utilization of computing resources is improved, and the method is more suitable for the application scene related to video monitoring.

Example 3

Based on the foregoing embodiment 1, an embodiment of the present invention further provides a training apparatus for an example segmentation model, as shown in fig. 4, the apparatus includes:

a pruning module 41, configured to prune the pre-constructed deep learning model;

an obtaining module 42, configured to obtain a training set; the training set is a set of RGBD images with target objects in a scene collected by different depth cameras, and the RGBD images comprise a depth map and a color map;

a preprocessing module 43, configured to label the training set;

and the training module 44 is configured to train the pruned deep learning model by using the labeled training set to obtain an instance segmentation model.

Further, the preprocessing module 43 is further configured to preprocess the training set before labeling, and specifically includes:

according to the second modeling result, performing background removal processing on the first modeling result to obtain a foreground image containing the target object;

determining a truncation distance corresponding to each depth camera according to a foreground image containing a target object, wherein the truncation distance is a moving distance from the target object to the depth camera;

carrying out truncation processing on the depth map in the training set through the truncation distance corresponding to each depth camera, and carrying out normalization processing on the depth map after the truncation processing;

and carrying out normalization processing on the color map in the training set.

Further, the preprocessing module 43 is specifically configured to:

Further, the training module 44 specifically includes:

the feature extraction and fusion module 441 is configured to perform feature extraction on the labeled training set, and fuse features obtained after extraction to obtain a feature region;

a prediction module 442, configured to perform segmentation processing on the feature region to obtain a segmentation result about the feature region, and perform regression and classification processing on the feature region to obtain a detection frame about the feature region, a classification result corresponding to the detection frame, and an instance score associated with the segmentation result;

a processing module 443, configured to multiply the segmentation result and the corresponding instance score to obtain an instance segmentation result;

the processing module 443 is further configured to perform truncation processing on the instance segmentation result by using the corresponding target object detection box;

a calculating module 444, configured to calculate an error between the truncated instance segmentation result and the corresponding tag, and calculate an error between the detection frame and the corresponding target object detection frame;

the calculating module 444 is further configured to calculate a total loss value according to an error between the truncated instance segmentation result and the corresponding tag, and an error between the detection frame and the corresponding target object detection frame;

the determining module 445 is configured to determine the total loss value, stop training the deep learning model when the total loss value is smaller than a second preset value, and determine the corresponding deep learning model as the example segmentation model when the total loss value is smaller than the second preset value.

Further, the prediction module 442 is specifically configured to:

predicting the central point of the characteristic region, and calculating the width and the height of the characteristic region according to the central point to generate a detection frame;

the prediction module 442 is further configured to:

Further, the training module 44 is further configured to: and training the deep learning model according to the learning rate corresponding to the current total loss value.

Further, the pruning module 41 is specifically configured to:

acquiring an influence factor corresponding to a network layer to be pruned in the deep learning model, wherein the influence factor is a scaling factor obtained by carrying out normalization calculation on the network layer to be pruned;

It should be noted that: in the training apparatus for the example segmentation model provided in this embodiment, only the division of the functional modules is exemplified, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to complete all or part of the functions described above. In addition, the training apparatus for the example segmentation model of this embodiment and the training method embodiment for the example segmentation model in embodiment 1 belong to the same concept, and specific implementation processes and beneficial effects thereof are described in detail in the text recognition model training method embodiment, and are not described herein again.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for training an instance segmentation model, the method comprising:

pruning a pre-constructed deep learning model;

2. The method of claim 1, further comprising: preprocessing the training set before labeling, specifically comprising:

3. The method of claim 1, wherein labeling the training set specifically comprises:

4. The method of claim 3, wherein training the pruned deep learning model with the labeled training set to obtain the instance segmentation model specifically comprises:

5. The method according to claim 4, wherein performing regression processing on the feature region specifically comprises:

the method further comprises the following steps:

6. The method of claim 4, wherein training the pruned deep learning model using the labeled training set further comprises:

7. The method according to any one of claims 1 to 6, wherein pruning the pre-constructed deep learning model specifically comprises:

8. An instance splitting method, the method comprising:

acquiring a picture to be detected;

the pre-trained example segmentation model is obtained by training based on the method of any one of claims 1-7.

9. The method according to claim 8, wherein before inputting the picture to be detected into a pre-trained instance segmentation model for recognition, the method further comprises:

acquiring the number of the pictures to be detected;

the method further comprises the following steps:

10. An apparatus for training an instance segmentation model, the apparatus comprising:

the pruning module is used for pruning the pre-constructed deep learning model;

the preprocessing module is used for labeling the training set;