WO2024113782A1 - Image instance segmentation method and system, device and nonvolatile readable storage medium - Google Patents

Image instance segmentation method and system, device and nonvolatile readable storage medium Download PDF

Info

Publication number
WO2024113782A1
WO2024113782A1 PCT/CN2023/101908 CN2023101908W WO2024113782A1 WO 2024113782 A1 WO2024113782 A1 WO 2024113782A1 CN 2023101908 W CN2023101908 W CN 2023101908W WO 2024113782 A1 WO2024113782 A1 WO 2024113782A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
segmentation
decoder
segmentation network
network architectures
Prior art date
Application number
PCT/CN2023/101908
Other languages
French (fr)
Chinese (zh)
Inventor
周镇镇
张潇澜
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024113782A1 publication Critical patent/WO2024113782A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Definitions

  • the present application relates to the field of image processing, and in particular to an image instance segmentation method, system, device and non-volatile readable storage medium.
  • Image semantic segmentation technology has become an important research direction in the field of computer vision and is widely used in practical application scenarios such as mobile robots, autonomous driving, drones, and medical diagnosis.
  • image segmentation technology is mainly divided into two research directions: semantic segmentation and instance segmentation.
  • Semantic segmentation refers to the division of each pixel in the image into a corresponding category, that is, pixel-level classification, so it is also called dense classification; instance segmentation is to distinguish different instances of the same category based on semantic segmentation.
  • the image segmentation neural network models designed by experts have achieved a high level of accuracy, such as Mask RCNN (Region-based Convolutional Neural Networks), DeepLab, U-net series algorithms and other neural networks.
  • Mask RCNN Region-based Convolutional Neural Networks
  • DeepLab DeepLab
  • U-net series algorithms and other neural networks.
  • DeepLab series is a branch with a greater influence in the field of semantic segmentation
  • DeepLabV3+ is one of the better variants of this series. Therefore, researchers began to explore the automatic design of neural networks through Neural Architecture Search (NAS).
  • NAS Neural Architecture Search
  • relevant researchers mainly focus on neural network architecture search algorithms, automatically establish neural networks, and quickly apply them in practice.
  • the existing neural network architecture algorithm uses the search method of reinforcement learning and evolutionary algorithms to search for architectures, evaluates the sampled network architectures through performance evaluation methods, and then obtains the best model structure by optimizing the evaluation indicators.
  • the former is mainly achieved by obtaining the maximum reward in the process of interacting with the environment through the Neural Architecture Search framework.
  • the main representative algorithms include NASNet, MetaQNN (Quantum Neural Network), BlockQNN, etc.; the latter is mainly a general evolutionary algorithm to simulate the laws of biological inheritance and evolution to realize the NAS process.
  • the main representative algorithms include NEAT (Evolving Neural Networks through Augmenting Topologies), DeepNEAT, CoDeepNEAT, etc.
  • Neural network architecture search can automatically design customized neural networks for specific tasks, which has far-reaching implications.
  • tasks such as image segmentation that require dense pixel-by-pixel classification
  • the traditional neural network model used in existing image instance segmentation methods has a large number of parameters, which makes it impossible to use it directly in edge devices.
  • the current vehicle-side chips cannot carry existing high-precision neural networks, and directly using neural networks with smaller parameters will lead to inaccurate recognition of dense images and difficulty in defining the edges of different categories.
  • an embodiment of the present application proposes an image instance segmentation method, comprising the following steps:
  • controller network to search for multiple decoder structures and using each decoder structure and a fixed encoder to form multiple segmentation network architectures
  • obtaining a trained teacher network includes:
  • the encoder of the teacher network is constructed using the backbone network and the atrous spatial pyramid pooling (ASPP) module, where the ASPP module includes four atrous convolution blocks with different dilation rates and a global average pooling block.
  • ASPP atrous spatial pyramid pooling
  • the decoder of the teacher network is constructed using an upsampling module, a 1 ⁇ 1 convolution block, and a 3 ⁇ 3 convolution block, where the decoder of the teacher network takes the low-level feature maps output by the middle layer of the backbone network and the output of the ASPP module as input.
  • it also includes:
  • the decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain the image instance segmentation result;
  • the parameters of the encoder and decoder of the teacher network are adjusted according to the image instance segmentation results to train the teacher network.
  • it also includes:
  • the backbone network and the ASPP module are used to process the images in the training set, including:
  • the backbone network is used to extract multi-layer semantic features of the images in the training set, and the ASPP module is used to sample the multi-layer semantic features in parallel with dilated convolutions at different sampling rates to obtain five sets of feature maps, which are then concatenated and input into the decoder of the teacher network.
  • the decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain an image segmentation result, including:
  • the upsampling module is used to interpolate and upsample the feature map from the ASPP module, and the 1 ⁇ 1 convolution block is used to reduce the channel dimension of the low-level feature map output from the intermediate layer of the backbone network;
  • the low-level feature map of channel dimension reduction and the feature map obtained by linear interpolation upsampling are concatenated and sent to a 3 ⁇ 3 convolution block for processing.
  • the upsampling module is then used for linear interpolation upsampling again to obtain the image instance segmentation result.
  • a 1 ⁇ 1 convolution block is used to perform channel dimension reduction on a low-level feature map output from an intermediate layer of the backbone network, including:
  • the controller network comprises a two-layer recursive LSTM neural network with 100 hidden units, and all hidden units are randomly initialized from a uniform distribution.
  • a controller network is used to search for multiple decoder structures and each decoder structure and a fixed encoder are used to form multiple segmentation network architectures, including:
  • the controller network is used to search the internal structure of the fifth decoder block and the sixth decoder block in a preset search space, and the connection mode between the first decoder block, the second decoder block, the third decoder block, the fourth decoder block, the fifth decoder block and the sixth decoder block.
  • the search space includes 1 ⁇ 1 convolution, 3 ⁇ 3 convolution, 3 ⁇ 3 separable convolution, 5 ⁇ 5 separable convolution, global average pooling, upsampling, 1 ⁇ 1 convolution module, 3 ⁇ 3 convolution with dilation rate 3, 3 ⁇ 3 convolution with dilation rate 12, separable 3 ⁇ 3 convolution with dilation rate 3, separable 5 ⁇ 5 convolution with dilation rate 6, skip connections, and zero operations that effectively invalidate the paths.
  • LKD represents the overall loss of the knowledge distillation network
  • LStudent represents the loss of the segmentation network architecture
  • LTeacher represents the teacher network loss
  • coff represents an adjustable parameter in the actual network training process.
  • coff takes a value of 0.3.
  • it also includes:
  • n is the instance of different categories
  • pixels is the pixel point
  • ytrue is the actual value of the corresponding category
  • ypred is the predicted value of the corresponding type.
  • selecting a plurality of segmentation network architectures from a plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determining an optimal segmentation network architecture from the plurality of segmentation network architectures includes:
  • the geometric mean is calculated using the average IoU, frequency-weighted IoU, and average pixel precision;
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures for full training according to a simulated annealing algorithm, including:
  • segmentation network architectures are selected from multiple segmentation network architectures to perform full training in the first stage, full training in the second stage, and full training in the third stage respectively.
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm to perform full training in the first stage, including:
  • the several segmentation network architectures are trained for 50 epochs using the enhanced dataset, with an auxiliary unit parameter of 0.2.
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm for full training in the second phase, including:
  • the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.2.
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm for full training in the second phase, including:
  • the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.15 and the BN layer is frozen.
  • an embodiment of the present application further provides an image instance segmentation system, including:
  • An acquisition module configured to acquire a trained teacher network and a controller network
  • a search module configured to search for a plurality of decoder structures using the controller network and construct a plurality of segmentation network architectures using each decoder structure and a fixed encoder;
  • An evaluation module is configured to use the trained teacher network and each segmentation network architecture to simultaneously perform image instance segmentation forward inference, and use the loss function of the trained teacher network to guide and correct the loss function of each segmentation network architecture after each forward inference, and select a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determine the optimal segmentation network architecture from the plurality of segmentation network architectures;
  • the image instance segmentation module is configured to perform image instance segmentation on the image to be segmented using an optimal segmentation network architecture.
  • an embodiment of the present application further provides a computer device, including:
  • the memory stores a computer program that can be run on the processor, and the processor executes any of the above image implementations when executing the program.
  • an embodiment of the present application further provides a non-volatile readable storage medium, which stores a computer program.
  • a computer program When the computer program is executed by a processor, the steps of any one of the above image instance segmentation methods are performed.
  • the embodiment of the present application has one of the following beneficial technical effects:
  • the solution proposed in the embodiment of the present application guides and corrects the training process of the searched student network (segmentation network architecture) by using the knowledge distillation method, so that a lightweight semantic segmentation model can be quickly obtained with a small computing cost, the problem of excessive parameters in the existing image segmentation model can be improved, and more reliable image segmentation prediction results can be achieved with a faster reasoning speed. It has better adaptability in autonomous driving scenarios.
  • FIG1 is a schematic diagram of a flow chart of an image instance segmentation method provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of a framework of a teacher network provided in an embodiment of the present application.
  • FIG3 is an image segmentation algorithm framework based on knowledge distillation neural network architecture search provided by an embodiment of the present application.
  • FIG4 is a schematic diagram of a controller network provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of a unit architecture provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of the structure of an image instance segmentation system provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of a non-volatile readable storage medium provided in an embodiment of the present application.
  • an embodiment of the present application provides an image instance segmentation method, as shown in FIG1 , which may include the steps of:
  • the solution proposed in the embodiment of the present application uses the knowledge distillation method to guide and correct the search student network (segmentation network architecture) training process, so that a lightweight semantic segmentation model can be quickly obtained with low computational overhead, the problem of excessive parameters in the existing image segmentation model can be improved, and more reliable image segmentation prediction results can be achieved with faster reasoning speed. It has better adaptability in autonomous driving scenarios.
  • obtaining a trained teacher network includes:
  • the encoder of the teacher network is constructed using the backbone network and the ASPP (Atrous Spatial Pyramid Pooling) module, where the ASPP module includes four atrous convolution blocks with different dilation rates and a global average pooling block;
  • the decoder of the teacher network is constructed using an upsampling module, a 1 ⁇ 1 convolution block, and a 3 ⁇ 3 convolution block, where the decoder of the teacher network takes the low-level feature maps output by the middle layer of the backbone network and the output of the ASPP module as input.
  • the DeepLabV3+ network uses ResNet101 as the backbone to extract the multi-layer semantic features in the original image.
  • the feature information is sampled in parallel by the ASPP module with different sampling rates of dilated convolution to obtain different proportions of image context information.
  • the ASPP module accepts the first part of the output of the backbone as input, and uses four dilated convolution blocks with different expansion rates (including convolution, BN, activation layer) and a global average pooling block (including pooling, convolution, BN (Batch Normalization), activation layer) to obtain a total of five groups of feature maps, which are concatenated and passed through a 1 ⁇ 1 convolution block (including convolution, BN, activation, dropout (random inactivation) layer) and finally sent to the Decoder module.
  • the Decoder module receives the low-level feature map from the middle layer of the backbone and the output from the ASPP module as input.
  • it also includes:
  • the decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain the image instance segmentation result;
  • the parameters of the encoder and decoder of the teacher network are adjusted according to the image instance segmentation results to train the teacher network.
  • it also includes:
  • the enhanced dataset can use a variety of data enhancement methods, for example, the image data enhancement method protected by the patent publication number CN114037637A can be used for data enhancement.
  • the image data enhancement method protected by the patent publication number CN114037637A can be used for data enhancement.
  • segment the original image obtain the segmented image and the segmented image.
  • the target category of the image is obtained through the target category to obtain the category to be enhanced; the original image is binarized according to the category to be enhanced to obtain a binary image, and according to the connected domain of the binary image, an instance image that matches the category to be enhanced in the original image is obtained; the instance image is perspective processed to obtain a first instance image, and the first instance image is scaled to obtain a second instance image; the vanishing point position is obtained from the original image, and the pasting position of the second instance image is determined according to the vanishing point position and the geometric size of the second instance image, and the second instance image is pasted to the original image according to the pasting position to obtain an enhanced image of the original image.
  • the backbone network and the ASPP module are used to process the images in the training set, including:
  • the backbone network is used to extract multi-layer semantic features of the images in the training set, and the ASPP module is used to sample the multi-layer semantic features in parallel with dilated convolutions at different sampling rates to obtain five sets of feature maps, which are then concatenated and input into the decoder of the teacher network.
  • the decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain an image segmentation result, including:
  • the upsampling module is used to interpolate and upsample the feature map from the ASPP module, and the 1 ⁇ 1 convolution block is used to reduce the channel dimension of the low-level feature map output from the intermediate layer of the backbone network;
  • the low-level feature map of channel dimension reduction and the feature map obtained by linear interpolation upsampling are concatenated and sent to a 3 ⁇ 3 convolution block for processing.
  • the upsampling module is then used for linear interpolation upsampling again to obtain the image instance segmentation result.
  • a 1 ⁇ 1 convolution block is used to perform channel dimension reduction on a low-level feature map output from an intermediate layer of the backbone network, including:
  • the decoder module can use 1 ⁇ 1 convolution to perform channel dimension reduction on the low-level feature map, including Images Pooling from 256 to 48 (the reason for downsampling to 48 is that too many channels will mask the importance of the feature map output by ASPP, and experiments have verified that 48 is the best).
  • the feature map from ASPP is interpolated and upsampled (Unsample By 4) to obtain a feature map of the same size as the low-level feature map.
  • the low-level feature map with channel dimension reduction and the feature map obtained by linear interpolation upsampling are concatenated using concat and sent to a set of 3*3 convolution blocks for processing. Linear interpolation upsampling is performed again to obtain a predicted map with the same resolution size as the original image.
  • the controller network comprises a two-layer recursive LSTM neural network with 100 hidden units, and all hidden units are randomly initialized from a uniform distribution.
  • the controller has a two-layer recursive LSTM neural network with 100 hidden units, and all units are randomly initialized from a uniform distribution.
  • the embodiment of the present application uses a PP0 optimization strategy for optimization, and the learning rate is 0.0001.
  • a controller network is used to search for multiple decoder structures and each decoder structure and a fixed encoder are used to form multiple segmentation network architectures, including:
  • the controller network is used to search the internal structure of the fifth decoder block and the sixth decoder block in a preset search space, and the connection mode between the first decoder block, the second decoder block, the third decoder block, the fourth decoder block, the fifth decoder block and the sixth decoder block.
  • the student network is an image segmentation network based on neural network architecture search, through the sampling process of neural network architecture search
  • the target network architecture is obtained, the structural knowledge and instance segmentation information in the distillation teacher network are obtained, and the network architecture adopts an encoder-decoder structure.
  • the image segmentation model requires multiple iterations to converge, it is currently difficult to perform a complete segmentation network architecture search from scratch due to limited computing resources, so the embodiment of the present application focuses the attention of the architecture search process on the decoder part.
  • the entire network structure uses the weights in the pre-trained classification network to initialize the encoder, which consists of multiple downsampling operations that reduce the input space dimension; on the other hand, the controller network generates a decoder structure, and the decoder part can access multiple outputs of encoders with different spatial and channel dimensions. In order to keep the sampling architecture compact and roughly the same size, each encoder output will use the same number of output channels for a 1 ⁇ 1 convolution.
  • the student network portion of Figure 3 shows a search layout containing 2 decoder blocks (the fifth decoder block block4 and the sixth decoder block block5) and 2 branch units.
  • Block4 and block5 in the figure obtain two sets of sampling pairs through the controller. The results of the sampling pairs are input into the two modules after element-by-element summation.
  • the two modules obtain the internal unit structure through the controller, and their outputs are connected through the concat operation, and then through conv1 ⁇ 1, and input into the main classifier to calculate the loss and finally form the prediction result of the image segmentation information.
  • the auxiliary cell in the figure has the same structure as other units and can be adjusted to directly output the true background (ground truth) or imitate the teacher's network prediction (or a combination of the above two).
  • Figure 4 shows the layout of the controller network for neural network architecture search, which can sequentially sample the decoder connection mode, including different modules, different operations, and different branch position indexes. Different modules reuse the sampled unit architecture, and apply the same unit with different weights to each module in the sampling pair, and finally add the outputs of the two units. The resulting layer will be added to the sampling pool (the next unit can sample the previous unit as input).
  • the block4 sampling range includes all modules before the module ⁇ block0 (first decoder block), block1 (second decoder block), block2 (third decoder block), block3 (fourth decoder block)>
  • the block5 sampling range includes all modules before the module ⁇ block0, block1, block2, block3, block4 (fifth decoder block>.
  • each unit accepts an input, and the controller first samples operation 1; then, two position indices (index) are sampled, namely input index0 and the output result index1 of sampled operation 1; finally, two corresponding operations are sampled.
  • the output of each operation is added, and in the next step, all three layers (from each operation and its summed result) and the initial two layers are sampled.
  • the number of sampling times of positions within the unit is controlled by another hyperparameter to keep the number of all possible architectures to a feasible number. All existing non-sampled summed outputs within the unit are summed and used as the unit output. In this case, sum is used because the concatenation layer operation may cause the vector size of the output of different architectures to change.
  • 09 represents the sampling position
  • the sampling operations 1-7 represent the operations performed at the corresponding positions.
  • the search space includes 1 ⁇ 1 convolution, 3 ⁇ 3 convolution, 3 ⁇ 3 separable convolution, 5 ⁇ 5 separable convolution, global average pooling, upsampling, 1 ⁇ 1 convolution module, 3 ⁇ 3 convolution with dilation rate 3, 3 ⁇ 3 convolution with dilation rate 12, separable 3 ⁇ 3 convolution with dilation rate 3, separable 5 ⁇ 5 convolution with dilation rate 6, skip connections, and zero operations that effectively invalidate the paths.
  • the number of sampling times of the layer pair is controlled by a hyperparameter, which is set to 3 in the experiment.
  • the encoder part of the network in the embodiment of the present application is MobileNet-v2, pre-trained on MS COCO, and a lightweight RefineNet decoder is used for semantic segmentation during pre-training.
  • the embodiment of the present application uses the outputs of the four layers 2, 3, 6, and 8 of MobileNet-v2 corresponding to block0 to block3 as the input of the decoder; a 1 ⁇ 1 convolutional layer for encoder output adaptation, with 48 output channels during search and 64 output channels during training.
  • the encoder weights are randomly initialized using the Xavier scheme.
  • the embodiment of the present application uses a controller to search for a combination of basic units to construct a neural network architecture. Based on existing semantic segmentation research, the search space is set as follows in the embodiment of the present application:
  • GAP Global average pooling, upsampling, 1 ⁇ 1 convolution module
  • LKD represents the overall loss of the knowledge distillation network
  • LStudent represents the loss of the segmentation network architecture
  • LTeacher represents the teacher network loss
  • coff represents an adjustable parameter in the actual network training process.
  • coff takes a value of 0.3.
  • the sampling architecture is used to perform instance segmentation forward inference simultaneously with the teacher network.
  • LKD represents the overall loss of the knowledge distillation network
  • LStudent represents the loss of the segmentation network architecture
  • LTeacher represents the teacher network loss
  • coff represents an adjustable parameter in the actual network training process.
  • it also includes:
  • n is an instance of different categories, pixels is a pixel, ytrue is the actual value of the corresponding category, and ypred is the predicted value of the corresponding type.
  • the background class is not considered during the calculation, because a large number of pixels belong to the background, and adding the background class to the calculation will have a negative impact on the result.
  • the loss function is crucial to the accuracy of the model prediction results.
  • the embodiment of the present application uses Dice Soft Loss as the loss function because the loss function can be calculated separately for instances of different categories.
  • This loss function is a commonly used loss function in semantic segmentation tasks. It evolved from a loss function based on the dice coefficient, which represents a measure of the overlap between predicted values and actual values of different categories. Calculate the dice loss for each category, and then sum and average it. The detailed expression is as follows:
  • n is an instance of different categories
  • pixels is a pixel point
  • ytrue is the actual value of the corresponding category
  • ypred is the predicted value of the corresponding type.
  • selecting a plurality of segmentation network architectures from a plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determining an optimal segmentation network architecture from the plurality of segmentation network architectures includes:
  • the geometric mean is calculated using the average IoU, frequency-weighted IoU, and average pixel precision;
  • the embodiment of the present application randomly divides the training set into two non-overlapping sets: an initial training set (Train DataSet0) and an initial validation set (Valid DataSet0).
  • the initial training set can be image enhanced and is used to train the sampled architecture on a given task (i.e., semantic segmentation); the initial validation set is not subjected to any image processing and is used to evaluate the trained architecture and provide a scalar (often referred to as feedback in reinforcement learning literature) to the controller.
  • the internal training process is divided into two stages. The first stage is the architecture search stage.
  • the weights of the encoder are obtained through pre-training, and its output is calculated and stored in the memory.
  • Each sampling process directly imports the encoder output, which can save a lot of computing time and efficiency.
  • the decoder is training, which facilitates the rapid adaptation of the decoder weights and the reasonable estimation of the sampling architecture performance.
  • the second stage is the full training stage However, not all sampled architectures can enter this stage. A simple simulated annealing algorithm is used to decide whether to continue training the sampled architecture for the second stage.
  • the reason why all sampling architectures are not trained is that the sampling architectures that have completed the first stage of training can already predict future development prospects after training on the current batch. Terminating the unpromising architectures in advance can save computing resources and find the target architecture with higher accuracy more quickly.
  • the external optimization process optimizes the controller through the proximal policy optimization (PPO) method given the sampling sequence, log probability and feedback signal, and strikes a balance between the diversity of the sampling architecture and the complexity of the tuning process, so as to achieve the controller network model update and global parameter optimization.
  • PPO proximal policy optimization
  • the embodiment of the present application retains the running average of the feedback after the first stage to decide whether to continue training the sampled architecture.
  • the criterion for evaluating the future prospects of the architecture that is, reward, uses the geometric mean of three quantities:
  • Mean intersection-over-union (mIoU), mainly used for semantic segmentation benchmarks
  • k represents the number of categories
  • i represents the true value
  • j represents the predicted value
  • Pij represents predicting i as j. The same applies to the following.
  • Frequency-weighted IoU which scales each class IoU according to the number of pixels present in that class
  • MPA Mean-pixel accuracy
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures for full training according to a simulated annealing algorithm, including:
  • segmentation network architectures are selected from multiple segmentation network architectures to perform full training in the first stage, full training in the second stage, and full training in the third stage respectively.
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm to perform full training in the first phase, including:
  • the augmented dataset is used for training for 50 epochs, where the auxiliary unit parameter used is 0.2.
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm for full training in the second phase, including:
  • the enhanced dataset was used to train 50 epochs, with the auxiliary unit parameter used being 0.2.
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm for full training in the second phase, including:
  • the enhanced dataset was used to train 50 epochs, with the auxiliary unit parameter of 0.15 and the BN layer frozen.
  • the enhanced data set in each epoch, is fully input into several segmentation network architectures, that is, one epoch can be understood as the process of training all samples once.
  • one epoch when a complete data set passes through the neural network once and returns once, this process is called an epoch, that is, all training samples undergo a forward propagation and a backward propagation in the neural network.
  • the solution proposed in the embodiment of this application uses the data enhancement method to enhance the image of the data set; then, the DeepLabV3+ neural network is trained on the enhanced data set to obtain the image segmentation information and use it as the teacher network; the knowledge distillation method is used to guide and correct the search student network training process, and the loss function is calculated for different categories of image segmentation data, which can effectively improve the detection accuracy of small sample data in image segmentation.
  • a lightweight semantic segmentation model can be quickly obtained with less computing overhead, and more reliable image segmentation prediction results can be achieved with faster reasoning speed, which has better adaptability in autonomous driving scenarios.
  • an embodiment of the present application further provides an image instance segmentation system 400, as shown in FIG6 , comprising:
  • An acquisition module 401 is configured to acquire a trained teacher network and a controller network
  • a search module 402 is configured to search for a plurality of decoder structures using the controller network and construct a plurality of segmentation network architectures using each decoder structure and a fixed encoder;
  • An evaluation module 403 is configured to use the trained teacher network and each segmentation network architecture to simultaneously perform image instance segmentation forward inference, and use the loss function of the trained teacher network to guide and correct the loss function of each segmentation network architecture after each forward inference, and select a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determine the optimal segmentation network architecture from the plurality of segmentation network architectures;
  • the image instance segmentation module 404 is configured to perform image instance segmentation on the image to be segmented using an optimal segmentation network architecture.
  • a teacher network building module is further included, configured to:
  • the encoder of the teacher network is constructed using the backbone network and the atrous spatial pyramid pooling (ASPP) module, where the ASPP module includes four atrous convolution blocks with different dilation rates and a global average pooling block.
  • ASPP atrous spatial pyramid pooling
  • the decoder of the teacher network is constructed using an upsampling module, a 1 ⁇ 1 convolution block, and a 3 ⁇ 3 convolution block, where the decoder of the teacher network takes the low-level feature maps output by the middle layer of the backbone network and the output of the ASPP module as input.
  • a teacher network training module is further included, configured to:
  • the decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain the image instance segmentation result;
  • the parameters of the encoder and decoder of the teacher network are adjusted according to the image instance segmentation results to train the teacher network.
  • the teacher network training module is further configured to:
  • the teacher network training module is further configured to:
  • the backbone network is used to extract multi-layer semantic features of the images in the training set, and the ASPP module is used to sample the multi-layer semantic features in parallel with dilated convolutions at different sampling rates to obtain five sets of feature maps, which are then concatenated and input into the decoder of the teacher network.
  • the teacher network training module is further configured to:
  • the upsampling module is used to interpolate and upsample the feature map from the ASPP module, and the 1 ⁇ 1 convolution block is used to reduce the channel dimension of the low-level feature map output from the intermediate layer of the backbone network;
  • the low-level feature map of channel dimension reduction and the feature map obtained by linear interpolation upsampling are concatenated and sent to a 3 ⁇ 3 convolution block for processing.
  • the upsampling module is then used for linear interpolation upsampling again to obtain the image instance segmentation result.
  • the teacher network training module is further configured to:
  • the controller network comprises a two-layer recursive LSTM neural network with 100 hidden units, and all hidden units are randomly initialized from a uniform distribution.
  • the search module is further configured to:
  • the controller network is used to search the internal structure of the fifth decoder block and the sixth decoder block in a preset search space, and the connection mode between the first decoder block, the second decoder block, the third decoder block, the fourth decoder block, the fifth decoder block and the sixth decoder block.
  • the search space includes 1 ⁇ 1 convolution, 3 ⁇ 3 convolution, 3 ⁇ 3 separable convolution, 5 ⁇ 5 separable convolution, global average pooling, upsampling, 1 ⁇ 1 convolution module, 3 ⁇ 3 convolution with dilation rate 3, 3 ⁇ 3 convolution with dilation rate 12, separable 3 ⁇ 3 convolution with dilation rate 3, separable 5 ⁇ 5 convolution with dilation rate 6, skip connections, and zero operations that effectively invalidate the paths.
  • LKD represents the overall loss of the knowledge distillation network
  • LStudent represents the loss of the segmentation network architecture
  • LTeacher represents the teacher network loss
  • coff represents an adjustable parameter in the actual network training process.
  • coff takes a value of 0.3.
  • the evaluation module is further configured to:
  • n is the instance of different categories
  • pixels is the pixel point
  • ytrue is the actual value of the corresponding category
  • ypred is the predicted value of the corresponding type.
  • the evaluation module is further configured to:
  • the geometric mean is calculated using the average IoU, frequency-weighted IoU, and average pixel precision;
  • the evaluation module is further configured to:
  • segmentation network architectures are selected from multiple segmentation network architectures to perform full training in the first stage, full training in the second stage, and full training in the third stage respectively.
  • the evaluation module is further configured to:
  • the several segmentation network architectures are trained for 50 epochs using the enhanced dataset, with an auxiliary unit parameter of 0.2.
  • the evaluation module is further configured to:
  • the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.2.
  • the evaluation module is further configured to:
  • the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.15 and the BN layer is frozen.
  • an embodiment of the present application further provides a computer device 501, including:
  • the memory 510 stores a computer program 511 that can be run on the processor.
  • the processor 520 executes the program, the processor 520 performs the steps of any of the above image instance segmentation methods.
  • an embodiment of the present application also provides a non-volatile readable storage medium 601, and the non-volatile readable storage medium 601 stores a computer program 610.
  • the computer program 610 is executed by a processor, the steps of any one of the above image instance segmentation methods are performed.
  • non-volatile readable storage medium eg, memory
  • the non-volatile readable storage medium may be either a volatile memory or a non-volatile memory, or may include both volatile memory and non-volatile memory.
  • the program may be stored in a non-volatile readable storage medium.
  • the medium can be a read-only memory, a magnetic or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

Disclosed in the present application are an image instance segmentation method and system, a device and a storage medium. The method comprises the following steps: acquiring a trained teacher network and a controller network; searching a plurality of decoder structures by means of the controller network, and forming a plurality of segmentation network architectures by means of each of the decoder structures and a fixed encoder; simultaneously performing image instance segmentation forward reasoning by means of the trained teacher network and each of the segmentation network architectures, after each forward reasoning, using a loss function of the trained teacher network to guide and correct a loss function of each of the segmentation network architectures, selecting several segmentation network architectures from the plurality of segmentation network architectures according to a simulated annealing algorithm for full training, and determining an optimal segmentation network architecture from the several segmentation network architectures; and performing image instance segmentation on an image to be segmented by means of the optimal segmentation network architecture.

Description

一种图像实例分割方法、***、设备以及非易失性可读存储介质Image instance segmentation method, system, device and non-volatile readable storage medium
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2022年11月30日提交中国专利局,申请号为202211515764.5,申请名称为“一种图像实例分割方法、***、设备以及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to a Chinese patent application filed with the China Patent Office on November 30, 2022, with application number 202211515764.5 and application name “A method, system, device and storage medium for image instance segmentation”, the entire contents of which are incorporated by reference into this application.
技术领域Technical Field
本申请涉及图像处理领域,特别涉及一种图像实例分割方法、***、设备以及非易失性可读存储介质。The present application relates to the field of image processing, and in particular to an image instance segmentation method, system, device and non-volatile readable storage medium.
背景技术Background technique
图像语义分割技术已经成为计算机视觉领域的重要研究方向,广泛应用于移动机器人、自动驾驶、无人机、医学诊断等实际应用场景。目前图像分割技术,主要分为两个研究方向:语义分割和实例分割。语义分割指的是对图像中的每个像素都划分出对应的类别,即实现像素级分类,故也称密集分类;实例分割是在语义分割的基础上区分出同一类别的不同实例。Image semantic segmentation technology has become an important research direction in the field of computer vision and is widely used in practical application scenarios such as mobile robots, autonomous driving, drones, and medical diagnosis. At present, image segmentation technology is mainly divided into two research directions: semantic segmentation and instance segmentation. Semantic segmentation refers to the division of each pixel in the image into a corresponding category, that is, pixel-level classification, so it is also called dense classification; instance segmentation is to distinguish different instances of the same category based on semantic segmentation.
目前专家设计的图像分割神经网络模型,已经具有较高的精度水平,例如Mask RCNN(Region-based Convolutional Neural Networks,基于区域的卷积神经网络)、DeepLab、U-net系列算法等神经网络。其中DeepLab系列是语义分割领域影响较大的一个分支,DeepLabV3+属于此系列目前较优秀的变种之一。因此,研究人员开始探索通过神经网络结构搜索(Neural Architecture Search,NAS)实现自动设计神经网络。目前相关研究人员主要聚焦神经网络架构搜索算法,自动建立神经网络,快速应用于实践。现有的神经网络架构算法采用强化学习和进化算法的搜索方法进行架构搜索,通过性能评估方法对采样得到的网络架构进行评估,再通过优化评估指标获取最佳模型结构。前者主要通过神经网络架构搜索(Neural Architecture Search)框架与环境交互的过程中获得最大的奖励实现,主要代表算法有NASNet、MetaQNN(Quantum Neural Network,量子神经网络)、BlockQNN等;后者主要是通用进化算法模拟生物遗传和进化的规律,实现NAS过程,主要代表算法有NEAT(Evolving Neural Networks through Augmenting Topologies,增强拓扑的神经进化网络)、DeepNEAT、CoDeepNEAT等。At present, the image segmentation neural network models designed by experts have achieved a high level of accuracy, such as Mask RCNN (Region-based Convolutional Neural Networks), DeepLab, U-net series algorithms and other neural networks. Among them, the DeepLab series is a branch with a greater influence in the field of semantic segmentation, and DeepLabV3+ is one of the better variants of this series. Therefore, researchers began to explore the automatic design of neural networks through Neural Architecture Search (NAS). At present, relevant researchers mainly focus on neural network architecture search algorithms, automatically establish neural networks, and quickly apply them in practice. The existing neural network architecture algorithm uses the search method of reinforcement learning and evolutionary algorithms to search for architectures, evaluates the sampled network architectures through performance evaluation methods, and then obtains the best model structure by optimizing the evaluation indicators. The former is mainly achieved by obtaining the maximum reward in the process of interacting with the environment through the Neural Architecture Search framework. The main representative algorithms include NASNet, MetaQNN (Quantum Neural Network), BlockQNN, etc.; the latter is mainly a general evolutionary algorithm to simulate the laws of biological inheritance and evolution to realize the NAS process. The main representative algorithms include NEAT (Evolving Neural Networks through Augmenting Topologies), DeepNEAT, CoDeepNEAT, etc.
神经网络架构搜索可以针对特定任务去自动设计定制的神经网络,具有深远的影响意义,但是对于图像分割这种需要致密的逐像素点类别划分的任务,在实际应用中存在着计算资源和时间受限的约束。即现有的图像实例分割方法使用的传统神经网络模型存在参数量较大的问题,导致在边缘设备中无法直接使用,例如在自动驾驶场景,目前车端芯片无法承载现有的高精度神经网络,直接使用参数量较小的神经网络又会导致对密集图像的识别不准、不同类别边缘部分界定困难等问题。 Neural network architecture search can automatically design customized neural networks for specific tasks, which has far-reaching implications. However, for tasks such as image segmentation that require dense pixel-by-pixel classification, there are constraints of limited computing resources and time in practical applications. That is, the traditional neural network model used in existing image instance segmentation methods has a large number of parameters, which makes it impossible to use it directly in edge devices. For example, in autonomous driving scenarios, the current vehicle-side chips cannot carry existing high-precision neural networks, and directly using neural networks with smaller parameters will lead to inaccurate recognition of dense images and difficulty in defining the edges of different categories.
发明内容Summary of the invention
有鉴于此,为了克服上述问题的至少一个方而,本申请实施例提出一种图像实例分割方法,包括以下步骤:In view of this, in order to overcome at least one aspect of the above problems, an embodiment of the present application proposes an image instance segmentation method, comprising the following steps:
获取已训练的教师网络和控制器网络;Get the trained teacher network and controller network;
利用控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构;Using the controller network to search for multiple decoder structures and using each decoder structure and a fixed encoder to form multiple segmentation network architectures;
利用已训练的教师网络和每一个分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用已训练的教师网络的损失函数对每一个分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练并从若干个分割网络架构中确定最优的分割网络架构;Using the trained teacher network and each segmentation network architecture to simultaneously perform image instance segmentation forward inference, and using the loss function of the trained teacher network to guide and correct the loss function of each segmentation network architecture after each forward inference, and selecting several segmentation network architectures from multiple segmentation network architectures for full training according to the simulated annealing algorithm and determining the optimal segmentation network architecture from the several segmentation network architectures;
利用最优的分割网络架构对待分割的图像进行图像实例分割。Use the optimal segmentation network architecture to perform image instance segmentation on the image to be segmented.
在一些实施例中,获取已训练的教师网络,包括:In some embodiments, obtaining a trained teacher network includes:
利用骨干网络以及空洞空间金字塔池化ASPP模块构建教师网络的编码器,其中,ASPP模块包括四种不同膨胀率的空洞卷积块和一个全局平均池化块;The encoder of the teacher network is constructed using the backbone network and the atrous spatial pyramid pooling (ASPP) module, where the ASPP module includes four atrous convolution blocks with different dilation rates and a global average pooling block.
利用上采样模块、1×1卷积块、3×3卷积块构建教师网络的解码器,其中,教师网络的解码器将骨干网络中间层输出的低级特征图和ASPP模块的输出作为输入。The decoder of the teacher network is constructed using an upsampling module, a 1×1 convolution block, and a 3×3 convolution block, where the decoder of the teacher network takes the low-level feature maps output by the middle layer of the backbone network and the output of the ASPP module as input.
在一些实施例中,还包括:In some embodiments, it also includes:
构建第一训练集;Construct the first training set;
利用骨干网络以及ASPP模块对训练集中的图像进行处理;Use the backbone network and ASPP module to process the images in the training set;
利用解码器对骨干网络中间层输出的低级特征图和ASPP模块的输出进行处理以得到图像实例分割结果;The decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain the image instance segmentation result;
根据图像实例分割结果调整教师网络的编码器和解码器的参数以对教师网络进行训练。The parameters of the encoder and decoder of the teacher network are adjusted according to the image instance segmentation results to train the teacher network.
在一些实施例中,还包括:In some embodiments, it also includes:
对第一训练集中的数据进行数据增强。Perform data augmentation on the data in the first training set.
在一些实施例中,利用骨干网络以及ASPP模块对训练集中的图像进行处理,包括:In some embodiments, the backbone network and the ASPP module are used to process the images in the training set, including:
利用骨干网络提取训练集中图像的多层语义特征并利用ASPP模块对多层语义特征以不同采样率的空洞卷积并行采样以得到五组特征图,并将五组特征图拼接后输入到教师网络的解码器。The backbone network is used to extract multi-layer semantic features of the images in the training set, and the ASPP module is used to sample the multi-layer semantic features in parallel with dilated convolutions at different sampling rates to obtain five sets of feature maps, which are then concatenated and input into the decoder of the teacher network.
在一些实施例中,利用解码器对骨干网络中间层输出的低级特征图和ASPP模块的输出进行处理以得到图像分割结果,包括:In some embodiments, the decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain an image segmentation result, including:
利用上采样模块对来自ASPP模块的特征图进行插值上采样并利用1×1卷积块对来自骨干网络中间层输出的低级特征图进行通道降维; The upsampling module is used to interpolate and upsample the feature map from the ASPP module, and the 1×1 convolution block is used to reduce the channel dimension of the low-level feature map output from the intermediate layer of the backbone network;
将通道降维的低级特征图和线性插值上采样得到的特征图拼接,并送入3×3卷积块进行处理,并再次利用上采样模块进行线性插值上采样以得到图像实例分割结果。The low-level feature map of channel dimension reduction and the feature map obtained by linear interpolation upsampling are concatenated and sent to a 3×3 convolution block for processing. The upsampling module is then used for linear interpolation upsampling again to obtain the image instance segmentation result.
在一些实施例中,利用1×1卷积块对来自骨干网络中间层输出的低级特征图进行通道降维,包括:In some embodiments, a 1×1 convolution block is used to perform channel dimension reduction on a low-level feature map output from an intermediate layer of the backbone network, including:
将骨干网络中间层输出的低级特征图的通道降到48。Reduce the number of channels of the low-level feature map output by the intermediate layer of the backbone network to 48.
在一些实施例中,控制器网络包括100个隐藏单元的两层递归LSTM神经网络,且所有隐藏单元从均匀分布中随机初始化。In some embodiments, the controller network comprises a two-layer recursive LSTM neural network with 100 hidden units, and all hidden units are randomly initialized from a uniform distribution.
在一些实施例中,利用控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构,包括:In some embodiments, a controller network is used to search for multiple decoder structures and each decoder structure and a fixed encoder are used to form multiple segmentation network architectures, including:
获取预设的第一解码器块、第二解码器块、第三解码器块以及第四解码器块;Obtaining a preset first decoder block, a second decoder block, a third decoder block, and a fourth decoder block;
利用控制器网络在预设的搜索空间中搜索第五解码器块和第六解码器块的内部结构,以及第一解码器块、第二解码器块、第三解码器块、第四解码器块、第五解码器块和第六解码器块之间的连接方式。The controller network is used to search the internal structure of the fifth decoder block and the sixth decoder block in a preset search space, and the connection mode between the first decoder block, the second decoder block, the third decoder block, the fourth decoder block, the fifth decoder block and the sixth decoder block.
在一些实施例中,搜索空间包括1×1卷积,3×3卷积,3×3可分离卷积,5×5可分离卷积,全局平均池化、上采样、1×1卷积模块,扩张率为3的3×3卷积,扩张率为12的3×3卷积,扩张率为3的可分离3×3卷积,扩张率为6的可分离5×5卷积,跳跃连接,有效地使路径无效的零操作。In some embodiments, the search space includes 1×1 convolution, 3×3 convolution, 3×3 separable convolution, 5×5 separable convolution, global average pooling, upsampling, 1×1 convolution module, 3×3 convolution with dilation rate 3, 3×3 convolution with dilation rate 12, separable 3×3 convolution with dilation rate 3, separable 5×5 convolution with dilation rate 6, skip connections, and zero operations that effectively invalidate the paths.
在一些实施例中,利用已训练的教师网络和每一个分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后,使用已训练的教师网络的损失函数对每一个分割网络架构的损失函数进行指导并修正,包括利用如下公式对每一个分割网络架构的损失函数进行指导并修正:
LKD=LStudent+coff*LTeacher
In some embodiments, the trained teacher network and each segmentation network architecture are used to simultaneously perform image instance segmentation forward inference, and after each forward inference, the loss function of each segmentation network architecture is guided and corrected using the loss function of the trained teacher network, including guiding and correcting the loss function of each segmentation network architecture using the following formula:
L KD = L Student + coff * L Teacher
其中,其中,LKD表示知识蒸馏网络总体损失,LStudent表示分割网络架构的损失,LTeacher表示教师网络损失,coff表示一个在实际的网络训练过程中可调节的参数。Among them, LKD represents the overall loss of the knowledge distillation network, LStudent represents the loss of the segmentation network architecture, LTeacher represents the teacher network loss, and coff represents an adjustable parameter in the actual network training process.
在一些实施例中,coff取值为0.3。In some embodiments, coff takes a value of 0.3.
在一些实施例中,还包括:In some embodiments, it also includes:
利用公式计算分割网络架构的损失;其中,n为不同类别实例,pixels为像素点,ytrue为对应类别的实际值,ypred为对应类型的预测值。Using the formula Calculate the loss of the segmentation network architecture; where n is the instance of different categories, pixels is the pixel point, ytrue is the actual value of the corresponding category, and ypred is the predicted value of the corresponding type.
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练并从若干个分割网络架构中确定最优的分割网络架构,包括: In some embodiments, selecting a plurality of segmentation network architectures from a plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determining an optimal segmentation network architecture from the plurality of segmentation network architectures includes:
获取每一个分割网络架构的平均交并比、频率加权交并比以及平均像素精度;Obtain the average IoU, frequency-weighted IoU, and average pixel accuracy for each segmentation network architecture;
利用平均交并比、频率加权交并比以及平均像素精度计算几何平均值;The geometric mean is calculated using the average IoU, frequency-weighted IoU, and average pixel precision;
根据几何平均值挑选若干个分割网络架构。Several segmentation network architectures were selected based on the geometric mean.
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练,包括:In some embodiments, a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures for full training according to a simulated annealing algorithm, including:
根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构分别进行第一阶段的全量训练、第二阶段的全量训练、第三阶段的全量训练。According to the simulated annealing algorithm, several segmentation network architectures are selected from multiple segmentation network architectures to perform full training in the first stage, full training in the second stage, and full training in the third stage respectively.
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行第一阶段的全量训练,包括:In some embodiments, a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm to perform full training in the first stage, including:
利用增强数据集对所述若干个分割网络架构训练50个时期epoch,其中使用的辅助单元参数为0.2。The several segmentation network architectures are trained for 50 epochs using the enhanced dataset, with an auxiliary unit parameter of 0.2.
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行第二阶段的全量训练,包括:In some embodiments, a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm for full training in the second phase, including:
在经过第一阶段训练完毕的模型参数基础上,利用所述增强数据集对所述若干个分割网络架构训练50个所述epoch,其中使用的辅助单元参数为0.2。Based on the model parameters trained in the first stage, the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.2.
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行第二阶段的全量训练,包括:In some embodiments, a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm for full training in the second phase, including:
在经过第二阶段训练完毕的模型参数基础上,利用所述增强数据集对所述若干个分割网络架构训练50个所述epoch,其中使用的辅助单元参数为0.15并冻结BN层。Based on the model parameters trained in the second stage, the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.15 and the BN layer is frozen.
基于同一发明构思,根据本申请的另一个方而,本申请的实施例还提供了一种图像实例分割***,包括:Based on the same inventive concept, according to another aspect of the present application, an embodiment of the present application further provides an image instance segmentation system, including:
获取模块,被配置为获取已训练的教师网络和控制器网络;An acquisition module, configured to acquire a trained teacher network and a controller network;
搜索模块,被配置为利用控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构;A search module configured to search for a plurality of decoder structures using the controller network and construct a plurality of segmentation network architectures using each decoder structure and a fixed encoder;
评估模块,被配置为利用已训练的教师网络和每一个分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用已训练的教师网络的损失函数对每一个分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练并从若干个分割网络架构中确定最优的分割网络架构;An evaluation module is configured to use the trained teacher network and each segmentation network architecture to simultaneously perform image instance segmentation forward inference, and use the loss function of the trained teacher network to guide and correct the loss function of each segmentation network architecture after each forward inference, and select a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determine the optimal segmentation network architecture from the plurality of segmentation network architectures;
图像实例分割模块,被配置为利用最优的分割网络架构对待分割的图像进行图像实例分割。The image instance segmentation module is configured to perform image instance segmentation on the image to be segmented using an optimal segmentation network architecture.
基于同一发明构思,根据本申请的另一个方而,本申请的实施例还提供了一种计算机设备,包括:Based on the same inventive concept, according to another aspect of the present application, an embodiment of the present application further provides a computer device, including:
至少一个处理器;以及at least one processor; and
存储器,存储器存储有可在处理器上运行的计算机程序,处理器执行程序时执行如上的任一种图像实 例分割方法的步骤。The memory stores a computer program that can be run on the processor, and the processor executes any of the above image implementations when executing the program. Example segmentation method steps.
基于同一发明构思,根据本申请的另一个方而,本申请的实施例还提供了一种非易失性可读存储介质,非易失性可读存储介质存储有计算机程序,计算机程序被处理器执行时执行如上的任一种图像实例分割方法的步骤。Based on the same inventive concept, according to another aspect of the present application, an embodiment of the present application further provides a non-volatile readable storage medium, which stores a computer program. When the computer program is executed by a processor, the steps of any one of the above image instance segmentation methods are performed.
本申请实施例具有以下有益技术效果之一:本申请实施例提出的方案通过使用知识蒸馏的方法,对搜索的学生网络(分割网络架构)训练过程进行指导并修正,从而能够在较小计算开支的情况下,快速获得轻量级的语义分割模型,改善现有图像分割模型参数过大的问题,以较快的推理速度实现更为可靠的图像分割预测结果。在自动驾驶场景具有更佳适配性。The embodiment of the present application has one of the following beneficial technical effects: The solution proposed in the embodiment of the present application guides and corrects the training process of the searched student network (segmentation network architecture) by using the knowledge distillation method, so that a lightweight semantic segmentation model can be quickly obtained with a small computing cost, the problem of excessive parameters in the existing image segmentation model can be improved, and more reliable image segmentation prediction results can be achieved with a faster reasoning speed. It has better adaptability in autonomous driving scenarios.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下而将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下而描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的实施例。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other embodiments can be obtained based on these drawings without paying any creative work.
图1为本申请的实施例提供的图像实例分割方法的流程示意图;FIG1 is a schematic diagram of a flow chart of an image instance segmentation method provided in an embodiment of the present application;
图2为本申请的实施例提供的教师网络的框架示意图;FIG2 is a schematic diagram of a framework of a teacher network provided in an embodiment of the present application;
图3为本申请的实施例提供的基于知识蒸馏的神经网络架构搜索的图像分割算法框架;FIG3 is an image segmentation algorithm framework based on knowledge distillation neural network architecture search provided by an embodiment of the present application;
图4为本申请的实施例提供的控制器网络示意图;FIG4 is a schematic diagram of a controller network provided in an embodiment of the present application;
图5为本申请的实施例提供的单元架构示意图;FIG5 is a schematic diagram of a unit architecture provided in an embodiment of the present application;
图6为本申请的实施例提供的图像实例分割***的结构示意图;FIG6 is a schematic diagram of the structure of an image instance segmentation system provided in an embodiment of the present application;
图7为本申请的实施例提供的计算机设备的结构示意图;FIG7 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application;
图8为本申请的实施例提供的非易失性可读存储介质的结构示意图。FIG8 is a schematic diagram of the structure of a non-volatile readable storage medium provided in an embodiment of the present application.
具体实施方式Detailed ways
为使本申请实施例的目的、技术方案和优点更加清楚明白,以下结合可选实施例,并参照附图,对本申请实施例进行详细说明。In order to make the objectives, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application are described in detail below in combination with optional embodiments and with reference to the accompanying drawings.
需要说明的是,本申请实施例中所有使用“第一”和“第二”的表述均是为了区分两个相同名称非相同的实体或者非相同的参量,可见“第一”“第二”仅为了表述的方便,不应理解为对本申请实施例的限定,后续实施例对此不再一一说明。It should be noted that all expressions using "first" and "second" in the embodiments of the present application are for distinguishing two non-identical entities with the same name or non-identical parameters. It can be seen that "first" and "second" are only for the convenience of expression and should not be understood as limitations on the embodiments of the present application. The subsequent embodiments will not explain this one by one.
根据本申请的一个方而,本申请的实施例提出一种图像实例分割方法,如图1所示,其可以包括步骤:According to one aspect of the present application, an embodiment of the present application provides an image instance segmentation method, as shown in FIG1 , which may include the steps of:
S1,获取已训练的教师网络和控制器网络;S1, obtain the trained teacher network and controller network;
S2,利用控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构; S2, using the controller network to search for multiple decoder structures and using each decoder structure and a fixed encoder to form multiple segmentation network architectures;
S3,利用已训练的教师网络和每一个分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用已训练的教师网络的损失函数对每一个分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练并从若干个分割网络架构中确定最优的分割网络架构;S3, using the trained teacher network and each segmentation network architecture to simultaneously perform image instance segmentation forward inference, and using the loss function of the trained teacher network to guide and correct the loss function of each segmentation network architecture after each forward inference, and selecting several segmentation network architectures from multiple segmentation network architectures for full training according to the simulated annealing algorithm and determining the optimal segmentation network architecture from the several segmentation network architectures;
S4,利用最优的分割网络架构对待分割的图像进行图像实例分割。S4, using the optimal segmentation network architecture to perform image instance segmentation on the image to be segmented.
本申请实施例提出的方案通过使用知识蒸馏的方法,对搜索的学生网络(分割网络架构)训练过程进行指导并修正,从而能够在较小计算开支的情况下,快速获得轻量级的语义分割模型,改善现有图像分割模型参数过大的问题,以较快的推理速度实现更为可靠的图像分割预测结果。在自动驾驶场景具有更佳适配性。The solution proposed in the embodiment of the present application uses the knowledge distillation method to guide and correct the search student network (segmentation network architecture) training process, so that a lightweight semantic segmentation model can be quickly obtained with low computational overhead, the problem of excessive parameters in the existing image segmentation model can be improved, and more reliable image segmentation prediction results can be achieved with faster reasoning speed. It has better adaptability in autonomous driving scenarios.
在一些实施例中,获取已训练的教师网络,包括:In some embodiments, obtaining a trained teacher network includes:
利用骨干网络以及ASPP(Atrous Spatial Pyramid Pooling,空洞空间金字塔池化)模块构建教师网络的编码器,其中,ASPP模块包括四种不同膨胀率的空洞卷积块和一个全局平均池化块;The encoder of the teacher network is constructed using the backbone network and the ASPP (Atrous Spatial Pyramid Pooling) module, where the ASPP module includes four atrous convolution blocks with different dilation rates and a global average pooling block;
利用上采样模块、1×1卷积块、3×3卷积块构建教师网络的解码器,其中,教师网络的解码器将骨干网络中间层输出的低级特征图和ASPP模块的输出作为输入。The decoder of the teacher network is constructed using an upsampling module, a 1×1 convolution block, and a 3×3 convolution block, where the decoder of the teacher network takes the low-level feature maps output by the middle layer of the backbone network and the output of the ASPP module as input.
可选的,如图2所示,教师网络部分,DeepLabV3+网络应用ResNet101作为backbone(骨干网络),提取原图像中的多层语义特征,经过ASPP模块对特征信息以不同采样率的空洞卷积并行采样,获取了不同比例的图像上下文信息。ASPP模块接受backbone的第一部分输出作为输入,使用了四种不同膨胀率的空洞卷积块(包括卷积、BN、激活层)和一个全局平均池化块(包括池化、卷积、BN(Batch Normalization,批标准化)、激活层)得到一共五组特征图,将其concat(拼接)起来之后,经过一个1×1卷积块(包括卷积、BN、激活、dropout(随机失活)层),最后送入Decoder(解码)模块。该Decoder模块,接收来自backbone中间层的低级特征图和来自ASPP模块的输出作为输入。Optionally, as shown in Figure 2, in the teacher network part, the DeepLabV3+ network uses ResNet101 as the backbone to extract the multi-layer semantic features in the original image. The feature information is sampled in parallel by the ASPP module with different sampling rates of dilated convolution to obtain different proportions of image context information. The ASPP module accepts the first part of the output of the backbone as input, and uses four dilated convolution blocks with different expansion rates (including convolution, BN, activation layer) and a global average pooling block (including pooling, convolution, BN (Batch Normalization), activation layer) to obtain a total of five groups of feature maps, which are concatenated and passed through a 1×1 convolution block (including convolution, BN, activation, dropout (random inactivation) layer) and finally sent to the Decoder module. The Decoder module receives the low-level feature map from the middle layer of the backbone and the output from the ASPP module as input.
在一些实施例中,还包括:In some embodiments, it also includes:
构建第一训练集;Construct the first training set;
利用骨干网络以及ASPP模块对训练集中的图像进行处理;Use the backbone network and ASPP module to process the images in the training set;
利用解码器对骨干网络中间层输出的低级特征图和ASPP模块的输出进行处理以得到图像实例分割结果;The decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain the image instance segmentation result;
根据图像实例分割结果调整教师网络的编码器和解码器的参数以对教师网络进行训练。The parameters of the encoder and decoder of the teacher network are adjusted according to the image instance segmentation results to train the teacher network.
在一些实施例中,还包括:In some embodiments, it also includes:
对第一训练集中的数据进行数据增强。Perform data augmentation on the data in the first training set.
可选的,增强数据集可以使用多种数据增强方法,例如可以使用公开号为CN114037637A的专利中所保护的图像数据增强方法进行数据增强,在此简单描述其步骤:对原始图像进行分割,获取分割图像与分割 图像的目标类别,通过目标类别获取待增强类别;按照待增强类别对原始图像分别进行二值化处理,获取二值图像,根据二值图像的连通域,获取原始图像中与待增强类别存在匹配关系的实例图像;对实例图像进行透视处理,获取第一实例图像,对第一实例图像进行缩放,获取第二实例图像;从原始图像中获取灭点位置,根据灭点位置与第二实例图像的几何尺寸,确定第二实例图像的粘贴位置,根据粘贴位置将第二实例图像粘贴至原始图像,获取原始图像的增强图像。Optionally, the enhanced dataset can use a variety of data enhancement methods, for example, the image data enhancement method protected by the patent publication number CN114037637A can be used for data enhancement. Here is a brief description of its steps: segment the original image, obtain the segmented image and the segmented image. The target category of the image is obtained through the target category to obtain the category to be enhanced; the original image is binarized according to the category to be enhanced to obtain a binary image, and according to the connected domain of the binary image, an instance image that matches the category to be enhanced in the original image is obtained; the instance image is perspective processed to obtain a first instance image, and the first instance image is scaled to obtain a second instance image; the vanishing point position is obtained from the original image, and the pasting position of the second instance image is determined according to the vanishing point position and the geometric size of the second instance image, and the second instance image is pasted to the original image according to the pasting position to obtain an enhanced image of the original image.
在一些实施例中,利用骨干网络以及ASPP模块对训练集中的图像进行处理,包括:In some embodiments, the backbone network and the ASPP module are used to process the images in the training set, including:
利用骨干网络提取训练集中图像的多层语义特征并利用ASPP模块对多层语义特征以不同采样率的空洞卷积并行采样以得到五组特征图,并将五组特征图拼接后输入到教师网络的解码器。The backbone network is used to extract multi-layer semantic features of the images in the training set, and the ASPP module is used to sample the multi-layer semantic features in parallel with dilated convolutions at different sampling rates to obtain five sets of feature maps, which are then concatenated and input into the decoder of the teacher network.
在一些实施例中,利用解码器对骨干网络中间层输出的低级特征图和ASPP模块的输出进行处理以得到图像分割结果,包括:In some embodiments, the decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain an image segmentation result, including:
利用上采样模块对来自ASPP模块的特征图进行插值上采样并利用1×1卷积块对来自骨干网络中间层输出的低级特征图进行通道降维;The upsampling module is used to interpolate and upsample the feature map from the ASPP module, and the 1×1 convolution block is used to reduce the channel dimension of the low-level feature map output from the intermediate layer of the backbone network;
将通道降维的低级特征图和线性插值上采样得到的特征图拼接,并送入3×3卷积块进行处理,并再次利用上采样模块进行线性插值上采样以得到图像实例分割结果。The low-level feature map of channel dimension reduction and the feature map obtained by linear interpolation upsampling are concatenated and sent to a 3×3 convolution block for processing. The upsampling module is then used for linear interpolation upsampling again to obtain the image instance segmentation result.
在一些实施例中,利用1×1卷积块对来自骨干网络中间层输出的低级特征图进行通道降维,包括:In some embodiments, a 1×1 convolution block is used to perform channel dimension reduction on a low-level feature map output from an intermediate layer of the backbone network, including:
将骨干网络中间层输出的低级特征图的通道降到48。Reduce the number of channels of the low-level feature map output by the intermediate layer of the backbone network to 48.
可选的,如图2所示,解码器模块可以对低级特征图使用1×1卷积进行通道降维,其中,包括Images Pooling(图像池化)从256降到48(之所以需要降采样到48,是因为太多的通道会掩盖ASPP输出的特征图的重要性,且实验验证48最佳)。对来自ASPP的特征图进行插值上采样(Unsample By4),得到与低级特征图尺寸相同的特征图。将通道降维的低级特征图和线性插值上采样得到的特征图使用concat拼接起来,并送入一组3*3卷积块进行处理。再次进行线性插值上采样,得到与原图分辨率大小一样的预测图。Optionally, as shown in Figure 2, the decoder module can use 1×1 convolution to perform channel dimension reduction on the low-level feature map, including Images Pooling from 256 to 48 (the reason for downsampling to 48 is that too many channels will mask the importance of the feature map output by ASPP, and experiments have verified that 48 is the best). The feature map from ASPP is interpolated and upsampled (Unsample By 4) to obtain a feature map of the same size as the low-level feature map. The low-level feature map with channel dimension reduction and the feature map obtained by linear interpolation upsampling are concatenated using concat and sent to a set of 3*3 convolution blocks for processing. Linear interpolation upsampling is performed again to obtain a predicted map with the same resolution size as the original image.
在一些实施例中,控制器网络包括100个隐藏单元的两层递归LSTM神经网络,且所有隐藏单元从均匀分布中随机初始化。In some embodiments, the controller network comprises a two-layer recursive LSTM neural network with 100 hidden units, and all hidden units are randomly initialized from a uniform distribution.
可选的,控制器具有100个隐藏单元的两层递归LSTM神经网络,所有单元都从均匀分布中随机初始化。本申请实施例使用PP0优化策略进行优化,学习率为0.0001。Optionally, the controller has a two-layer recursive LSTM neural network with 100 hidden units, and all units are randomly initialized from a uniform distribution. The embodiment of the present application uses a PP0 optimization strategy for optimization, and the learning rate is 0.0001.
在一些实施例中,利用控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构,包括:In some embodiments, a controller network is used to search for multiple decoder structures and each decoder structure and a fixed encoder are used to form multiple segmentation network architectures, including:
获取预设的第一解码器块、第二解码器块、第三解码器块以及第四解码器块;Obtaining a preset first decoder block, a second decoder block, a third decoder block, and a fourth decoder block;
利用控制器网络在预设的搜索空间中搜索第五解码器块和第六解码器块的内部结构,以及第一解码器块、第二解码器块、第三解码器块、第四解码器块、第五解码器块和第六解码器块之间的连接方式。The controller network is used to search the internal structure of the fifth decoder block and the sixth decoder block in a preset search space, and the connection mode between the first decoder block, the second decoder block, the third decoder block, the fourth decoder block, the fifth decoder block and the sixth decoder block.
可选的,学生网络部分为基于神经网络架构搜索的图像分割网络,通过神经网络架构搜索的采样过程 获得目标网络架构、蒸馏教师网络中的结构知识和实例分割信息,网络架构采用编码器解码器结构。因为图像分割模型需要多次迭代才能收敛,受限于计算资源,目前从头开始执行完整的分割网络架构搜索难以实践,所以本申请实施例将架构搜索过程的注意力集中在解码器部分。整个网络结构一方而使用预先训练的分类网络中的权重初始化编码器,该分类网络由多个降低输入空间维度的下采样操作组成;另一方而,由控制器网络产生解码器结构,该解码器部分可以访问具有不同空间和频道维度的编码器的多个输出。为了保持采样架构紧凑且大小大致相同,每个编码器输出都会使用相同数量的输出通道进行一次1×1卷积。Optionally, the student network is an image segmentation network based on neural network architecture search, through the sampling process of neural network architecture search The target network architecture is obtained, the structural knowledge and instance segmentation information in the distillation teacher network are obtained, and the network architecture adopts an encoder-decoder structure. Because the image segmentation model requires multiple iterations to converge, it is currently difficult to perform a complete segmentation network architecture search from scratch due to limited computing resources, so the embodiment of the present application focuses the attention of the architecture search process on the decoder part. On the one hand, the entire network structure uses the weights in the pre-trained classification network to initialize the encoder, which consists of multiple downsampling operations that reduce the input space dimension; on the other hand, the controller network generates a decoder structure, and the decoder part can access multiple outputs of encoders with different spatial and channel dimensions. In order to keep the sampling architecture compact and roughly the same size, each encoder output will use the same number of output channels for a 1×1 convolution.
图3学生网络部分展示了含有2个解码器块(第五解码器块block4和第六解码器块block5)和2个分支单元的搜索布局。图中的block4和block5通过控制器获得两组采样对,采样对的结果经过逐元素的求和操作后输入这两个模块,这两个模块经过控制器获得内部单元结构,其输出经过concat操作进行连接后,再经过conv1×1,输入到主分类器中,计算loss(损失)最终形成对图像的分割信息预测结果。图中的附加单元(auxiliary cell)与其他单元结构相同,可以被调节为直接输出真实背景(ground truth),或者模仿教师的网络预测(或以上两种的组合)。同时,它在训练或测试期间都不会影响主分类器的输出,只为网络的其余部分提供更好的梯度。不过,每个采样架构的反馈(reward)仍然由主分类器的输出决定。为简单起见,本申请实施例只对所有辅助输出应用分割损失。The student network portion of Figure 3 shows a search layout containing 2 decoder blocks (the fifth decoder block block4 and the sixth decoder block block5) and 2 branch units. Block4 and block5 in the figure obtain two sets of sampling pairs through the controller. The results of the sampling pairs are input into the two modules after element-by-element summation. The two modules obtain the internal unit structure through the controller, and their outputs are connected through the concat operation, and then through conv1×1, and input into the main classifier to calculate the loss and finally form the prediction result of the image segmentation information. The auxiliary cell in the figure has the same structure as other units and can be adjusted to directly output the true background (ground truth) or imitate the teacher's network prediction (or a combination of the above two). At the same time, it will not affect the output of the main classifier during training or testing, but only provide better gradients for the rest of the network. However, the feedback (reward) of each sampling architecture is still determined by the output of the main classifier. For simplicity, the embodiment of the present application only applies segmentation loss to all auxiliary outputs.
图4为神经网络架构搜索的控制器网络的布局,它可以顺序地采样出要解码器连接方式,包括不同模块、不同运算操作以及不同分支位置索引。不同模块重复使用采样得到的单元架构,并将具有不同权重的相同单元应用于采样对内的每一模块,最后将两个单元的输出相加。结果层将添加到采样池中(下一个单元可以采样上一个单元作为输入)。其中,block4采样范围包括该模块前而所有模块<block0(第一解码器块),block1(第二解码器块),block2(第三解码器块),block3(第四解码器块)>,block5采样范围包括该模块前而所有模块<block0,block1,block2,block3,block4(第五解码器块>。单元内部架构采样下而将展开描述。Figure 4 shows the layout of the controller network for neural network architecture search, which can sequentially sample the decoder connection mode, including different modules, different operations, and different branch position indexes. Different modules reuse the sampled unit architecture, and apply the same unit with different weights to each module in the sampling pair, and finally add the outputs of the two units. The resulting layer will be added to the sampling pool (the next unit can sample the previous unit as input). Among them, the block4 sampling range includes all modules before the module <block0 (first decoder block), block1 (second decoder block), block2 (third decoder block), block3 (fourth decoder block)>, and the block5 sampling range includes all modules before the module <block0, block1, block2, block3, block4 (fifth decoder block>. The internal architecture sampling of the unit will be described in detail below.
单元架构内部结果如图5所示。每个单元接受一个输入,控制器首先采样操作1;然后,采样两个位置索引(index),即输入index0和采样操作1的输出结果index1;最后采样两个相应的操作。将每个运算的输出相加,并在下一步对所有三层(来自每个运算及其相加结果)以及初始两层进行采样。单元内位置的采样次数由另一个超参数控制,以便将所有可能的架构的数量保持在一个可行的数量。对单元内所有现有的非采样求和输出求和,并将其用作单元输出。在这种情况下,使用求和(sum),因为拼接层(concatenation)运算可能会导致不同架构输出的向量尺寸变化。其中,09表示采样位置,采样操作1-7表示对应位置进行的运算。The results inside the unit architecture are shown in Figure 5. Each unit accepts an input, and the controller first samples operation 1; then, two position indices (index) are sampled, namely input index0 and the output result index1 of sampled operation 1; finally, two corresponding operations are sampled. The output of each operation is added, and in the next step, all three layers (from each operation and its summed result) and the initial two layers are sampled. The number of sampling times of positions within the unit is controlled by another hyperparameter to keep the number of all possible architectures to a feasible number. All existing non-sampled summed outputs within the unit are summed and used as the unit output. In this case, sum is used because the concatenation layer operation may cause the vector size of the output of different architectures to change. Among them, 09 represents the sampling position, and the sampling operations 1-7 represent the operations performed at the corresponding positions.
在一些实施例中,搜索空间包括1×1卷积,3×3卷积,3×3可分离卷积,5×5可分离卷积,全局平均池化、上采样、1×1卷积模块,扩张率为3的3×3卷积,扩张率为12的3×3卷积,扩张率为3的可分离3×3卷积,扩张率为6的可分离5×5卷积,跳跃连接,有效地使路径无效的零操作。 In some embodiments, the search space includes 1×1 convolution, 3×3 convolution, 3×3 separable convolution, 5×5 separable convolution, global average pooling, upsampling, 1×1 convolution module, 3×3 convolution with dilation rate 3, 3×3 convolution with dilation rate 12, separable 3×3 convolution with dilation rate 3, separable 5×5 convolution with dilation rate 6, skip connections, and zero operations that effectively invalidate the paths.
可选的,层对的采样次数由一个超参数控制,在实验中将该参数设置为3。本申请实施例网络的编码器部分是MobileNet-v2,在MS COCO上预训练,预训练时使用轻量级的RefineNet解码器进行语义分割。本申请实施例使用MobileNet-v2的2、3、6、8的四个层的输出与block0至block3相对应,作为解码器的输入;用于编码器输出自适应的1×1卷积层,在搜索期间有48个输出通道,在训练期间有64个输出通道。使用Xavier方案随机初始化编码器权重。Optionally, the number of sampling times of the layer pair is controlled by a hyperparameter, which is set to 3 in the experiment. The encoder part of the network in the embodiment of the present application is MobileNet-v2, pre-trained on MS COCO, and a lightweight RefineNet decoder is used for semantic segmentation during pre-training. The embodiment of the present application uses the outputs of the four layers 2, 3, 6, and 8 of MobileNet-v2 corresponding to block0 to block3 as the input of the decoder; a 1×1 convolutional layer for encoder output adaptation, with 48 output channels during search and 64 output channels during training. The encoder weights are randomly initialized using the Xavier scheme.
本申请实施例使用控制器搜索基本单元的组合来构建神经网络架构,基于现有语义分割研究,本申请实施例中设置搜索空间如下:The embodiment of the present application uses a controller to search for a combination of basic units to construct a neural network architecture. Based on existing semantic segmentation research, the search space is set as follows in the embodiment of the present application:
·1×1卷积(Cony),1×1 convolution (Cony),
·3×3卷积,3×3 convolution,
·3×3可分离卷积,3×3 separable convolution,
·5×5可分离卷积,5×5 separable convolution,
·全局平均池化、上采样、1×1卷积模块(图中缩写为GAP),Global average pooling, upsampling, 1×1 convolution module (abbreviated as GAP in the figure),
·扩张率为3的3×3卷积,3×3 convolution with dilation rate 3,
·扩张率为12的3×3卷积,3×3 convolution with dilation rate 12,
·扩张率为3的可分离3×3卷积,A separable 3×3 convolution with a dilation rate of 3,
·扩张率为6的可分离5×5卷积,A separable 5×5 convolution with a dilation rate of 6,
·跳跃连接,Skip connections,
·有效地使路径无效的零操作。A zero operation that effectively invalidates the path.
在一些实施例中,利用已训练的教师网络和每一个分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后,使用已训练的教师网络的损失函数对每一个分割网络架构的损失函数进行指导并修正,包括利用如下公式对每一个分割网络架构的损失函数进行指导并修正:
LKD=LStudent+coff*LTeacher
In some embodiments, the trained teacher network and each segmentation network architecture are used to simultaneously perform image instance segmentation forward inference, and after each forward inference, the loss function of each segmentation network architecture is guided and corrected using the loss function of the trained teacher network, including guiding and correcting the loss function of each segmentation network architecture using the following formula:
L KD = L Student + coff * L Teacher
其中,其中,LKD表示知识蒸馏网络总体损失,LStudent表示分割网络架构的损失,LTeacher表示教师网络损失,coff表示一个在实际的网络训练过程中可调节的参数。Among them, LKD represents the overall loss of the knowledge distillation network, LStudent represents the loss of the segmentation network architecture, LTeacher represents the teacher network loss, and coff represents an adjustable parameter in the actual network training process.
在一些实施例中,coff取值为0.3。In some embodiments, coff takes a value of 0.3.
可选的,通过神经网络架构搜索框架获得采样架构后,使用该采样架构与教师网络同时进行实例分割正向推理,在每次正向推理后,使用教师网络的损失函数对学生网络的损失函数进行指导并修正,如公式(1):
LKD=LStudent+coff*LTeacher
Optionally, after obtaining the sampling architecture through the neural network architecture search framework, the sampling architecture is used to perform instance segmentation forward inference simultaneously with the teacher network. After each forward inference, the loss function of the teacher network is used to guide and correct the loss function of the student network, as shown in formula (1):
L KD = L Student + coff * L Teacher
(1)式中其中,LKD表示知识蒸馏网络总体损失,LStudent表示分割网络架构的损失,LTeacher表示教师网络损失,coff表示一个在实际的网络训练过程中可调节的参数。经过反复训练学生网络,学生网络可以逐渐获取教师网络对不同实例的各层次的特征图和边缘信息,实现对图像实例的像素级定位。 (1) where LKD represents the overall loss of the knowledge distillation network, LStudent represents the loss of the segmentation network architecture, LTeacher represents the teacher network loss, and coff represents an adjustable parameter in the actual network training process. After repeated training of the student network, the student network can gradually obtain the feature maps and edge information of each level of the teacher network for different instances, and realize the pixel-level positioning of the image instance.
在一些实施例中,还包括:In some embodiments, it also includes:
利用公式计算分割网络架构的损失;其中,n为不同类别实例,pixels为像素点,ytrue为对应类别的实际值,ypred为对应类型的预测值。Using the formula Calculate the loss of the segmentation network architecture; where n is an instance of different categories, pixels is a pixel, ytrue is the actual value of the corresponding category, and ypred is the predicted value of the corresponding type.
可选的,在计算时,并未考虑背景类,因为大量像素属于背景,加入背景类的计算后会对结果产生负而影响。在学生网络训练的过程中,需要最小化或者最大化目标函数,其中需要最小化目标的函数称为“损失函数”。损失函数的选择对于模型预测结果的精度至关重要,本申请实施例使用Dice Soft Loss作为损失函数,因为该损失函数可以针对不同类别的实例分别计算,该损失函数是语义分割任务中常用的损失函数,由基于dice系数的损失函数演变而来,表示不同类别预测值与实际值重叠的度量。对每个类别求其dice损失,再求和取平均值,详细表达式如式:
Optionally, the background class is not considered during the calculation, because a large number of pixels belong to the background, and adding the background class to the calculation will have a negative impact on the result. During the student network training process, it is necessary to minimize or maximize the objective function, and the function that needs to minimize the objective is called the "loss function". The choice of loss function is crucial to the accuracy of the model prediction results. The embodiment of the present application uses Dice Soft Loss as the loss function because the loss function can be calculated separately for instances of different categories. This loss function is a commonly used loss function in semantic segmentation tasks. It evolved from a loss function based on the dice coefficient, which represents a measure of the overlap between predicted values and actual values of different categories. Calculate the dice loss for each category, and then sum and average it. The detailed expression is as follows:
其中,n为不同类别实例,pixels为像素点,ytrue为对应类别的实际值,ypred为对应类型的预测值。Among them, n is an instance of different categories, pixels is a pixel point, ytrue is the actual value of the corresponding category, and ypred is the predicted value of the corresponding type.
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练并从若干个分割网络架构中确定最优的分割网络架构,包括:In some embodiments, selecting a plurality of segmentation network architectures from a plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determining an optimal segmentation network architecture from the plurality of segmentation network architectures includes:
获取每一个分割网络架构的平均交并比、频率加权交并比以及平均像素精度;Obtain the average IoU, frequency-weighted IoU, and average pixel accuracy for each segmentation network architecture;
利用平均交并比、频率加权交并比以及平均像素精度计算几何平均值;The geometric mean is calculated using the average IoU, frequency-weighted IoU, and average pixel precision;
根据几何平均值挑选若干个分割网络架构。Several segmentation network architectures were selected based on the geometric mean.
可选的,本申请实施例将训练集随机划分为两个不重叠的集合:初始训练集(Train DataSet0)和初始验证集(Valid DataSet0)。初始训练集可进行图像增强,用于在给定任务(即语义分割)上训练采样的架构;而初始验证集未经任何图像处理,用于评估训练的架构并为控制器提供标量(在强化学习文献中经常称为反馈)。搜索优化过程存在两个训练过程:采样架构的内部优化和控制器的外部优化。内部训练过程分为两个阶段,第一阶段为架构搜索阶段,在该阶段中,编码器的权重通过预训练得到,其输出经过计算后,存储在内存中,每次采样过程直接导入编码器输出,这样可以大量节约运算时间和效率,此时只有解码器在训练,方便解码器权重的快速自适应和对采样架构性能的合理估计。第二阶段为全量训练阶 段,但并非所有采样架构都可以进入该阶段,主要通过简单的模拟退火算法来决定是否继续为第二阶段训练采样架构。Optionally, the embodiment of the present application randomly divides the training set into two non-overlapping sets: an initial training set (Train DataSet0) and an initial validation set (Valid DataSet0). The initial training set can be image enhanced and is used to train the sampled architecture on a given task (i.e., semantic segmentation); the initial validation set is not subjected to any image processing and is used to evaluate the trained architecture and provide a scalar (often referred to as feedback in reinforcement learning literature) to the controller. There are two training processes in the search optimization process: internal optimization of the sampling architecture and external optimization of the controller. The internal training process is divided into two stages. The first stage is the architecture search stage. In this stage, the weights of the encoder are obtained through pre-training, and its output is calculated and stored in the memory. Each sampling process directly imports the encoder output, which can save a lot of computing time and efficiency. At this time, only the decoder is training, which facilitates the rapid adaptation of the decoder weights and the reasonable estimation of the sampling architecture performance. The second stage is the full training stage However, not all sampled architectures can enter this stage. A simple simulated annealing algorithm is used to decide whether to continue training the sampled architecture for the second stage.
之所以不训练全部的采样架构,是因为完成第一阶段训练的采样架构经过在当前batch上的训练后,已经可以预估未来的发展前景,提前终止没有前途的架构一则可以节约运算资源,二则可以更快地找到精度较高的目标架构。外部优化过程在给定采样序列、对数概率和反馈信号的情况下,通过近端策略优化(PPO)方法对控制器进行优化,在采样架构的多样性和调优过程的复杂性之间取得平衡,实现控制器网络模型更新和参数全局优化。The reason why all sampling architectures are not trained is that the sampling architectures that have completed the first stage of training can already predict future development prospects after training on the current batch. Terminating the unpromising architectures in advance can save computing resources and find the target architecture with higher accuracy more quickly. The external optimization process optimizes the controller through the proximal policy optimization (PPO) method given the sampling sequence, log probability and feedback signal, and strikes a balance between the diversity of the sampling architecture and the complexity of the tuning process, so as to achieve the controller network model update and global parameter optimization.
如上,本申请实施例在第一阶段之后保留反馈的运行平均值,来决定是否继续训练采样架构。在网络架构搜索过程中,评估架构未来前景优劣的标准,也就是reward,使用三个量的几何平均值:As mentioned above, the embodiment of the present application retains the running average of the feedback after the first stage to decide whether to continue training the sampled architecture. In the process of network architecture search, the criterion for evaluating the future prospects of the architecture, that is, reward, uses the geometric mean of three quantities:
平均交并比(mean intersection-over-union,mIoU),主要用于语义分割基准;
Mean intersection-over-union (mIoU), mainly used for semantic segmentation benchmarks;
其中,k表示类别数量,i表示真实值,j表示预测值,Pij表示将i预测为j。以下相同。Among them, k represents the number of categories, i represents the true value, j represents the predicted value, and Pij represents predicting i as j. The same applies to the following.
频率加权交并比(frequency-weighted IoU,fwIoU),根据该类中存在的像素数缩放每个类IoU;
Frequency-weighted IoU (fwIoU), which scales each class IoU according to the number of pixels present in that class;
平均像素精度(mean-pixel accuracy,MPA),即平均每个类别的正确像素数。
Mean-pixel accuracy (MPA), which is the average number of correct pixels for each category.
求以上三个量的几何平均值:
Find the geometric mean of the above three quantities:
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练,包括:In some embodiments, a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures for full training according to a simulated annealing algorithm, including:
根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构分别进行第一阶段的全量训练、第二阶段的全量训练、第三阶段的全量训练。According to the simulated annealing algorithm, several segmentation network architectures are selected from multiple segmentation network architectures to perform full training in the first stage, full training in the second stage, and full training in the third stage respectively.
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行第一阶段的全量训练,包括:In some embodiments, a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm to perform full training in the first phase, including:
利用增强数据集训练50个epoch,其中使用的辅助单元参数为0.2。The augmented dataset is used for training for 50 epochs, where the auxiliary unit parameter used is 0.2.
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行第二阶段的全量训练,包括:In some embodiments, a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm for full training in the second phase, including:
在经过第一阶段训练完毕的模型参数基础上,利用增强数据集训练50个epoch,其中使用的辅助单元参数为0.2。Based on the model parameters trained in the first stage, the enhanced dataset was used to train 50 epochs, with the auxiliary unit parameter used being 0.2.
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行第二阶段的全量训练,包括:In some embodiments, a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm for full training in the second phase, including:
在经过第二阶段训练完毕的模型参数基础上,利用增强数据集训练50个epoch,其中使用的辅助单元参数为0.15并冻结BN层。Based on the model parameters trained in the second stage, the enhanced dataset was used to train 50 epochs, with the auxiliary unit parameter of 0.15 and the BN layer frozen.
可选的,在本申请实施例中,在每个epoch中增强数据集被全部输入到若干个分割网络架构中,即一次epoch(时期)可以理解为将所有的样本训练一次的过程。详细来说,当一个完整的数据集通过了神经网络一次并且返回了一次,这个过程称为一次epoch,也就是说,所有训练样本在神经网络中都进行了一次正向传播和反向传播。Optionally, in an embodiment of the present application, in each epoch, the enhanced data set is fully input into several segmentation network architectures, that is, one epoch can be understood as the process of training all samples once. In detail, when a complete data set passes through the neural network once and returns once, this process is called an epoch, that is, all training samples undergo a forward propagation and a backward propagation in the neural network.
本申请实施例提出的方案利用数据增强方法,对数据集进行图像增强;然后在增强数据集上,对DeepLabV3+神经网络进行训练,获得图像的分割信息,并将其用作教师网络;使用知识蒸馏的方法,对搜索的学生网络训练过程进行指导并修正,并针对图像分割数据的不同类别进行损失函数计算,可以有效提高小样本数据在图像分割中的检测精度。这样能够在较小计算开支的情况下,快速获得轻量级的语义分割模型,以较快的推理速度实现更为可靠的图像分割预测结果,在自动驾驶场景具有更佳适配性。The solution proposed in the embodiment of this application uses the data enhancement method to enhance the image of the data set; then, the DeepLabV3+ neural network is trained on the enhanced data set to obtain the image segmentation information and use it as the teacher network; the knowledge distillation method is used to guide and correct the search student network training process, and the loss function is calculated for different categories of image segmentation data, which can effectively improve the detection accuracy of small sample data in image segmentation. In this way, a lightweight semantic segmentation model can be quickly obtained with less computing overhead, and more reliable image segmentation prediction results can be achieved with faster reasoning speed, which has better adaptability in autonomous driving scenarios.
基于同一发明构思,根据本申请的另一个方而,本申请的实施例还提供了一种图像实例分割***400,如图6所示,包括:Based on the same inventive concept, according to another aspect of the present application, an embodiment of the present application further provides an image instance segmentation system 400, as shown in FIG6 , comprising:
获取模块401,被配置为获取已训练的教师网络和控制器网络;An acquisition module 401 is configured to acquire a trained teacher network and a controller network;
搜索模块402,被配置为利用控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构; A search module 402 is configured to search for a plurality of decoder structures using the controller network and construct a plurality of segmentation network architectures using each decoder structure and a fixed encoder;
评估模块403,被配置为利用已训练的教师网络和每一个分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用已训练的教师网络的损失函数对每一个分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练并从若干个分割网络架构中确定最优的分割网络架构;An evaluation module 403 is configured to use the trained teacher network and each segmentation network architecture to simultaneously perform image instance segmentation forward inference, and use the loss function of the trained teacher network to guide and correct the loss function of each segmentation network architecture after each forward inference, and select a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determine the optimal segmentation network architecture from the plurality of segmentation network architectures;
图像实例分割模块404,被配置为利用最优的分割网络架构对待分割的图像进行图像实例分割。The image instance segmentation module 404 is configured to perform image instance segmentation on the image to be segmented using an optimal segmentation network architecture.
在一些实施例中,还包括教师网络构建模块,被配置为:In some embodiments, a teacher network building module is further included, configured to:
利用骨干网络以及空洞空间金字塔池化ASPP模块构建教师网络的编码器,其中,ASPP模块包括四种不同膨胀率的空洞卷积块和一个全局平均池化块;The encoder of the teacher network is constructed using the backbone network and the atrous spatial pyramid pooling (ASPP) module, where the ASPP module includes four atrous convolution blocks with different dilation rates and a global average pooling block.
利用上采样模块、1×1卷积块、3×3卷积块构建教师网络的解码器,其中,教师网络的解码器将骨干网络中间层输出的低级特征图和ASPP模块的输出作为输入。The decoder of the teacher network is constructed using an upsampling module, a 1×1 convolution block, and a 3×3 convolution block, where the decoder of the teacher network takes the low-level feature maps output by the middle layer of the backbone network and the output of the ASPP module as input.
在一些实施例中,还包括教师网络训练模块,被配置为:In some embodiments, a teacher network training module is further included, configured to:
构建第一训练集;Construct the first training set;
利用骨干网络以及ASPP模块对训练集中的图像进行处理;Use the backbone network and ASPP module to process the images in the training set;
利用解码器对骨干网络中间层输出的低级特征图和ASPP模块的输出进行处理以得到图像实例分割结果;The decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain the image instance segmentation result;
根据图像实例分割结果调整教师网络的编码器和解码器的参数以对教师网络进行训练。The parameters of the encoder and decoder of the teacher network are adjusted according to the image instance segmentation results to train the teacher network.
在一些实施例中,教师网络训练模块还被配置为:In some embodiments, the teacher network training module is further configured to:
对第一训练集中的数据进行数据增强。Perform data augmentation on the data in the first training set.
在一些实施例中,教师网络训练模块还被配置为:In some embodiments, the teacher network training module is further configured to:
利用骨干网络提取训练集中图像的多层语义特征并利用ASPP模块对多层语义特征以不同采样率的空洞卷积并行采样以得到五组特征图,并将五组特征图拼接后输入到教师网络的解码器。The backbone network is used to extract multi-layer semantic features of the images in the training set, and the ASPP module is used to sample the multi-layer semantic features in parallel with dilated convolutions at different sampling rates to obtain five sets of feature maps, which are then concatenated and input into the decoder of the teacher network.
在一些实施例中,教师网络训练模块还被配置为:In some embodiments, the teacher network training module is further configured to:
利用上采样模块对来自ASPP模块的特征图进行插值上采样并利用1×1卷积块对来自骨干网络中间层输出的低级特征图进行通道降维;The upsampling module is used to interpolate and upsample the feature map from the ASPP module, and the 1×1 convolution block is used to reduce the channel dimension of the low-level feature map output from the intermediate layer of the backbone network;
将通道降维的低级特征图和线性插值上采样得到的特征图拼接,并送入3×3卷积块进行处理,并再次利用上采样模块进行线性插值上采样以得到图像实例分割结果。The low-level feature map of channel dimension reduction and the feature map obtained by linear interpolation upsampling are concatenated and sent to a 3×3 convolution block for processing. The upsampling module is then used for linear interpolation upsampling again to obtain the image instance segmentation result.
在一些实施例中,教师网络训练模块还被配置为:In some embodiments, the teacher network training module is further configured to:
将骨干网络中间层输出的低级特征图的通道降到48。Reduce the number of channels of the low-level feature map output by the intermediate layer of the backbone network to 48.
在一些实施例中,控制器网络包括100个隐藏单元的两层递归LSTM神经网络,且所有隐藏单元从均匀分布中随机初始化。In some embodiments, the controller network comprises a two-layer recursive LSTM neural network with 100 hidden units, and all hidden units are randomly initialized from a uniform distribution.
在一些实施例中,搜索模块还被配置为: In some embodiments, the search module is further configured to:
获取预设的第一解码器块、第二解码器块、第三解码器块以及第四解码器块;Obtaining a preset first decoder block, a second decoder block, a third decoder block, and a fourth decoder block;
利用控制器网络在预设的搜索空间中搜索第五解码器块和第六解码器块的内部结构,以及第一解码器块、第二解码器块、第三解码器块、第四解码器块、第五解码器块和第六解码器块之间的连接方式。The controller network is used to search the internal structure of the fifth decoder block and the sixth decoder block in a preset search space, and the connection mode between the first decoder block, the second decoder block, the third decoder block, the fourth decoder block, the fifth decoder block and the sixth decoder block.
在一些实施例中,搜索空间包括1×1卷积,3×3卷积,3×3可分离卷积,5×5可分离卷积,全局平均池化、上采样、1×1卷积模块,扩张率为3的3×3卷积,扩张率为12的3×3卷积,扩张率为3的可分离3×3卷积,扩张率为6的可分离5×5卷积,跳跃连接,有效地使路径无效的零操作。In some embodiments, the search space includes 1×1 convolution, 3×3 convolution, 3×3 separable convolution, 5×5 separable convolution, global average pooling, upsampling, 1×1 convolution module, 3×3 convolution with dilation rate 3, 3×3 convolution with dilation rate 12, separable 3×3 convolution with dilation rate 3, separable 5×5 convolution with dilation rate 6, skip connections, and zero operations that effectively invalidate the paths.
在一些实施例中,评估模块还被配置为利用如下公式对每一个分割网络架构的损失函数进行指导并修正:
LKD=LStudent+coff*LTeacher
In some embodiments, the evaluation module is further configured to guide and correct the loss function of each segmentation network architecture using the following formula:
L KD = L Student + coff * L Teacher
其中,其中,LKD表示知识蒸馏网络总体损失,LStudent表示分割网络架构的损失,LTeacher表示教师网络损失,coff表示一个在实际的网络训练过程中可调节的参数。Among them, LKD represents the overall loss of the knowledge distillation network, LStudent represents the loss of the segmentation network architecture, LTeacher represents the teacher network loss, and coff represents an adjustable parameter in the actual network training process.
在一些实施例中,coff取值为0.3。In some embodiments, coff takes a value of 0.3.
在一些实施例中,评估模块还被配置为:In some embodiments, the evaluation module is further configured to:
利用公式计算分割网络架构的损失;其中,n为不同类别实例,pixels为像素点,ytrue为对应类别的实际值,ypred为对应类型的预测值。Using the formula Calculate the loss of the segmentation network architecture; where n is the instance of different categories, pixels is the pixel point, ytrue is the actual value of the corresponding category, and ypred is the predicted value of the corresponding type.
在一些实施例中,评估模块还被配置为:In some embodiments, the evaluation module is further configured to:
获取每一个分割网络架构的平均交并比、频率加权交并比以及平均像素精度;Obtain the average IoU, frequency-weighted IoU, and average pixel accuracy for each segmentation network architecture;
利用平均交并比、频率加权交并比以及平均像素精度计算几何平均值;The geometric mean is calculated using the average IoU, frequency-weighted IoU, and average pixel precision;
根据几何平均值挑选若干个分割网络架构。Several segmentation network architectures were selected based on the geometric mean.
在一些实施例中,评估模块还被配置为:In some embodiments, the evaluation module is further configured to:
根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构分别进行第一阶段的全量训练、第二阶段的全量训练、第三阶段的全量训练。According to the simulated annealing algorithm, several segmentation network architectures are selected from multiple segmentation network architectures to perform full training in the first stage, full training in the second stage, and full training in the third stage respectively.
在一些实施例中,评估模块还被配置为:In some embodiments, the evaluation module is further configured to:
利用增强数据集对所述若干个分割网络架构训练50个时期epoch,其中使用的辅助单元参数为0.2。The several segmentation network architectures are trained for 50 epochs using the enhanced dataset, with an auxiliary unit parameter of 0.2.
在一些实施例中,评估模块还被配置为:In some embodiments, the evaluation module is further configured to:
在经过第一阶段训练完毕的模型参数基础上,利用所述增强数据集对所述若干个分割网络架构训练50个所述epoch,其中使用的辅助单元参数为0.2。 Based on the model parameters trained in the first stage, the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.2.
在一些实施例中,评估模块还被配置为:In some embodiments, the evaluation module is further configured to:
在经过第二阶段训练完毕的模型参数基础上,利用所述增强数据集对所述若干个分割网络架构训练50个所述epoch,其中使用的辅助单元参数为0.15并冻结BN层。Based on the model parameters trained in the second stage, the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.15 and the BN layer is frozen.
基于同一发明构思,根据本申请的另一个方而,如图7所示,本申请的实施例还提供了一种计算机设备501,包括:Based on the same inventive concept, according to another aspect of the present application, as shown in FIG. 7 , an embodiment of the present application further provides a computer device 501, including:
至少一个处理器520;以及at least one processor 520; and
存储器510,存储器510存储有可在处理器上运行的计算机程序511,处理器520执行程序时执行如上的任一种图像实例分割方法的步骤。The memory 510 stores a computer program 511 that can be run on the processor. When the processor 520 executes the program, the processor 520 performs the steps of any of the above image instance segmentation methods.
基于同一发明构思,根据本申请的另一个方而,如图8所示,本申请的实施例还提供了一种非易失性可读存储介质601,非易失性可读存储介质601存储有计算机程序610,计算机程序610被处理器执行时执行如上的任一种图像实例分割方法的步骤。Based on the same inventive concept, according to another aspect of the present application, as shown in Figure 8, an embodiment of the present application also provides a non-volatile readable storage medium 601, and the non-volatile readable storage medium 601 stores a computer program 610. When the computer program 610 is executed by a processor, the steps of any one of the above image instance segmentation methods are performed.
最后需要说明的是,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关硬件来完成,程序可存储于一非易失性可读存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。Finally, it should be noted that a person skilled in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing the relevant hardware through a computer program, and the program can be stored in a non-volatile readable storage medium. When the program is executed, it can include the processes of the embodiments of the above-mentioned methods.
此外,应该明白的是,本文的非易失性可读存储介质(例如,存储器)可以是易失性存储器或非易失性存储器,或者可以包括易失性存储器和非易失性存储器两者。Furthermore, it should be appreciated that the non-volatile readable storage medium (eg, memory) herein may be either a volatile memory or a non-volatile memory, or may include both volatile memory and non-volatile memory.
本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。为了清楚地说明硬件和软件的这种可互换性,已经就各种示意性组件、方块、模块、电路和步骤的功能对其进行了一般性的描述。这种功能是被实现为软件还是被实现为硬件取决于实际应用以及施加给整个***的设计约束。本领域技术人员可以针对每种实际应用以各种方式来实现的功能,但是这种实现决定不应被解释为导致脱离本申请实施例公开的范围。It will also be appreciated by those skilled in the art that various exemplary logic blocks, modules, circuits and algorithmic steps described in conjunction with the disclosure herein can be implemented as electronic hardware, computer software or a combination of the two. In order to clearly illustrate this interchangeability of hardware and software, a general description has been given to the functions of various schematic components, blocks, modules, circuits and steps. Whether this function is implemented as software or implemented as hardware depends on practical application and the design constraints imposed on the entire system. Those skilled in the art can implement the function in various ways for every practical application, but this implementation decision should not be interpreted as causing a departure from the disclosed scope of the present application embodiment.
以上是本申请公开的示例性实施例,但是应当注意,在不背离权利要求限定的本申请实施例公开的范围的前提下,可以进行多种改变和修改。根据这里描述的公开实施例的方法权利要求的功能、步骤和/或动作不需以任何特定顺序执行。此外,尽管本申请实施例公开的元素可以以个体形式描述或要求,但除非明确限制为单数,也可以理解为多个。The above are exemplary embodiments disclosed in the present application, but it should be noted that various changes and modifications may be made without departing from the scope disclosed in the embodiments of the present application as defined in the claims. The functions, steps and/or actions of the method claims according to the disclosed embodiments described herein do not need to be performed in any particular order. In addition, although the elements disclosed in the embodiments of the present application may be described or required in individual form, they may also be understood as multiple unless explicitly limited to the singular.
应当理解的是,在本文中使用的,除非上下文清楚地支持例外情况,单数形式“一个”旨在也包括复数形式。还应当理解的是,在本文中使用的“和/或”是指包括一个或者一个以上相关联地列出的项目的任意和所有可能组合。It should be understood that, as used herein, the singular forms "a", "an" are intended to include the plural forms as well, unless the context clearly supports an exception. It should also be understood that, as used herein, "and/or" refers to any and all possible combinations including one or more of the associated listed items.
上述本申请实施例公开实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments disclosed in the above-mentioned embodiments of the present application are only for description and do not represent the advantages or disadvantages of the embodiments.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,程序可以存储于一种非易失性可读存储介质中,上述提到的非易失性可读存储 介质可以是只读存储器,磁盘或光盘等。Those skilled in the art will appreciate that all or part of the steps of the above embodiments may be accomplished by hardware, or by a program to instruct the relevant hardware to accomplish the steps. The program may be stored in a non-volatile readable storage medium. The medium can be a read-only memory, a magnetic or optical disk, etc.
所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本申请实施例公开的范围(包括权利要求)被限于这些例子;在本申请实施例的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,并存在如上的本申请实施例的不同方而的许多其它变化,为了简明它们没有在细节中提供。因此,凡在本申请实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本申请实施例的保护范围之内。 Those skilled in the art should understand that the discussion of any of the above embodiments is only exemplary and is not intended to imply that the scope of the disclosure of the embodiments of the present application (including the claims) is limited to these examples; under the idea of the embodiments of the present application, the technical features in the above embodiments or different embodiments can also be combined, and there are many other changes in different aspects of the embodiments of the present application as above, which are not provided in detail for the sake of simplicity. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the embodiments of the present application should be included in the protection scope of the embodiments of the present application.

Claims (21)

  1. 一种图像实例分割方法,其特征在于,包括以下步骤:A method for image instance segmentation, characterized in that it comprises the following steps:
    获取已训练的教师网络和控制器网络;Get the trained teacher network and controller network;
    利用所述控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构;Searching for a plurality of decoder structures using the controller network and forming a plurality of segmentation network architectures using each decoder structure and a fixed encoder;
    利用所述已训练的教师网络和每一个所述分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用所述已训练的教师网络的损失函数对每一个所述分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行全量训练并从所述若干个分割网络架构中确定最优的分割网络架构;Using the trained teacher network and each of the segmentation network architectures to simultaneously perform image instance segmentation forward inference, and using the loss function of the trained teacher network to guide and correct the loss function of each of the segmentation network architectures after each forward inference, and selecting a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determining the optimal segmentation network architecture from the plurality of segmentation network architectures;
    利用所述最优的分割网络架构对待分割的图像进行图像实例分割。The optimal segmentation network architecture is used to perform image instance segmentation on the image to be segmented.
  2. 如权利要求1所述的方法,其特征在于,获取已训练的教师网络,包括:The method according to claim 1, wherein obtaining a trained teacher network comprises:
    利用骨干网络以及空洞空间金字塔池化ASPP模块构建教师网络的编码器,其中,所述ASPP模块包括四种不同膨胀率的空洞卷积块和一个全局平均池化块;The encoder of the teacher network is constructed using the backbone network and the atrous spatial pyramid pooling (ASPP) module, wherein the ASPP module includes four atrous convolution blocks with different dilation rates and a global average pooling block;
    利用上采样模块、1×1卷积块、3×3卷积块构建教师网络的解码器,其中,所述教师网络的解码器将所述骨干网络中间层输出的低级特征图和所述ASPP模块的输出作为输入。The decoder of the teacher network is constructed using an upsampling module, a 1×1 convolution block, and a 3×3 convolution block, wherein the decoder of the teacher network takes the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module as input.
  3. 如权利要求2所述的方法,其特征在于,还包括:The method according to claim 2, further comprising:
    构建第一训练集;Construct the first training set;
    利用骨干网络以及ASPP模块对训练集中的图像进行处理;Use the backbone network and ASPP module to process the images in the training set;
    利用所述解码器对所述骨干网络中间层输出的低级特征图和所述ASPP模块的输出进行处理以得到图像实例分割结果;Using the decoder to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain an image instance segmentation result;
    根据所述图像实例分割结果调整所述教师网络的编码器和解码器的参数以对所述教师网络进行训练。The parameters of the encoder and the decoder of the teacher network are adjusted according to the image instance segmentation result to train the teacher network.
  4. 如权利要求3所述的方法,其特征在于,还包括:The method according to claim 3, further comprising:
    对所述第一训练集中的数据进行数据增强。Data augmentation is performed on the data in the first training set.
  5. 如权利要求3所述的方法,其特征在于,利用骨干网络以及ASPP模块对训练集中的图像进行处理,包括:The method according to claim 3, characterized in that the use of the backbone network and the ASPP module to process the images in the training set comprises:
    利用所述骨干网络提取训练集中图像的多层语义特征并利用所述ASPP模块对所述多层语义特征以不同采样率的空洞卷积并行采样以得到五组特征图,并将所述五组特征图拼接后输入到所述教师网络的解码器。The backbone network is used to extract multi-layer semantic features of the images in the training set, and the ASPP module is used to sample the multi-layer semantic features in parallel with hole convolutions at different sampling rates to obtain five groups of feature maps, and the five groups of feature maps are spliced and input into the decoder of the teacher network.
  6. 如权利要求3所述的方法,其特征在于,利用所述解码器对所述骨干网络中间层输出的低级特征图和所述ASPP模块的输出进行处理以得到图像分割结果,包括:The method according to claim 3, characterized in that the decoder is used to process the low-level feature map output by the intermediate layer of the backbone network and the output of the ASPP module to obtain an image segmentation result, comprising:
    利用所述上采样模块对来自所述ASPP模块的特征图进行插值上采样并利用所述1×1卷积块对来自所 述骨干网络中间层输出的低级特征图进行通道降维;The upsampling module is used to interpolate and upsample the feature map from the ASPP module and the 1×1 convolution block is used to interpolate and upsample the feature map from the ASPP module. The low-level feature map output by the middle layer of the backbone network is subjected to channel dimension reduction;
    将通道降维的低级特征图和线性插值上采样得到的特征图拼接,并送入所述3×3卷积块进行处理,并再次利用所述上采样模块进行线性插值上采样以得到所述图像实例分割结果。The low-level feature map of channel dimension reduction and the feature map obtained by linear interpolation upsampling are spliced and sent to the 3×3 convolution block for processing, and the upsampling module is used again to perform linear interpolation upsampling to obtain the image instance segmentation result.
  7. 如权利要求6所述的方法,其特征在于,利用所述1×1卷积块对来自所述骨干网络中间层输出的低级特征图进行通道降维,包括:The method of claim 6, wherein using the 1×1 convolutional block to perform channel dimensionality reduction on the low-level feature map output from the intermediate layer of the backbone network comprises:
    将所述骨干网络中间层输出的低级特征图的通道降到48。The number of channels of the low-level feature map output by the intermediate layer of the backbone network is reduced to 48.
  8. 如权利要求1所述的方法,其特征在于,所述控制器网络包括100个隐藏单元的两层递归LSTM神经网络,且所有隐藏单元从均匀分布中随机初始化。The method of claim 1, wherein the controller network comprises a two-layer recursive LSTM neural network with 100 hidden units, and all hidden units are randomly initialized from a uniform distribution.
  9. 如权利要求1所述的方法,其特征在于,利用所述控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构,包括:The method of claim 1, wherein searching for a plurality of decoder structures using the controller network and forming a plurality of segmentation network architectures using each decoder structure and a fixed encoder comprises:
    获取预设的第一解码器块、第二解码器块、第三解码器块以及第四解码器块;Obtaining a preset first decoder block, a second decoder block, a third decoder block, and a fourth decoder block;
    利用所述控制器网络在预设的搜索空间中搜索第五解码器块和第六解码器块的内部结构,以及第一解码器块、第二解码器块、第三解码器块、第四解码器块、第五解码器块和第六解码器块之间的连接方式。The controller network is used to search the internal structure of the fifth decoder block and the sixth decoder block in a preset search space, and the connection mode between the first decoder block, the second decoder block, the third decoder block, the fourth decoder block, the fifth decoder block and the sixth decoder block.
  10. 如权利要求9所述的方法,其特征在于,所述搜索空间包括1×1卷积,3×3卷积,3×3可分离卷积,5×5可分离卷积,全局平均池化、上采样、1×1卷积模块,扩张率为3的3×3卷积,扩张率为12的3×3卷积,扩张率为3的可分离3×3卷积,扩张率为6的可分离5×5卷积,跳跃连接,有效地使路径无效的零操作。The method of claim 9, characterized in that the search space includes 1×1 convolution, 3×3 convolution, 3×3 separable convolution, 5×5 separable convolution, global average pooling, upsampling, 1×1 convolution module, 3×3 convolution with dilation rate 3, 3×3 convolution with dilation rate 12, separable 3×3 convolution with dilation rate 3, separable 5×5 convolution with dilation rate 6, skip connection, zero operation that effectively invalidates the path.
  11. 如权利要求1所述的方法,其特征在于,利用所述已训练的教师网络和每一个所述分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后,使用所述已训练的教师网络的损失函数对每一个所述分割网络架构的损失函数进行指导并修正,包括利用如下公式对每一个所述分割网络架构的损失函数进行指导并修正:
    LKD=LStudent+coff*LTeacher
    The method of claim 1, characterized in that the trained teacher network and each of the segmentation network architectures are used to simultaneously perform image instance segmentation forward inference, and after each forward inference, the loss function of the trained teacher network is used to guide and correct the loss function of each of the segmentation network architectures, including using the following formula to guide and correct the loss function of each of the segmentation network architectures:
    L KD = L Student + coff * L Teacher
    其中,LKD表示知识蒸馏网络总体损失,LStudent表示分割网络架构的损失,LTeacher表示教师网络损失,coff表示一个在实际的网络训练过程中可调节的参数。Among them, LKD represents the overall loss of the knowledge distillation network, LStudent represents the loss of the segmentation network architecture, LTeacher represents the teacher network loss, and coff represents an adjustable parameter in the actual network training process.
  12. 如权利要求11所述的方法,其特征在于,所述coff取值为0.3。The method according to claim 11, characterized in that the coff value is 0.3.
  13. 如权利要求1所述的方法,其特征在于,还包括: The method according to claim 1, further comprising:
    利用公式计算所述分割网络架构的损失;其中,n为不同类别实例,pixels为像素点,ytrue为对应类别的实际值,ypred为对应类型的预测值。Using the formula Calculate the loss of the segmentation network architecture; where n is an instance of different categories, pixels is a pixel point, ytrue is the actual value of the corresponding category, and ypred is the predicted value of the corresponding type.
  14. 如权利要求1所述的方法,其特征在于,根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行全量训练并从所述若干个分割网络架构中确定最优的分割网络架构,包括:The method according to claim 1, characterized in that selecting a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determining the optimal segmentation network architecture from the plurality of segmentation network architectures comprises:
    获取每一个分割网络架构的平均交并比、频率加权交并比以及平均像素精度;Obtain the average IoU, frequency-weighted IoU, and average pixel accuracy for each segmentation network architecture;
    利用平均交并比、频率加权交并比以及平均像素精度计算几何平均值;The geometric mean is calculated using the average IoU, frequency-weighted IoU, and average pixel precision;
    根据所述几何平均值挑选若干个分割网络架构。Several segmentation network architectures are selected based on the geometric mean.
  15. 如权利要求1所述的方法,其特征在于,根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行全量训练,包括:The method according to claim 1, characterized in that selecting a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training according to a simulated annealing algorithm comprises:
    根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构分别进行第一阶段的全量训练、第二阶段的全量训练、第三阶段的全量训练。According to the simulated annealing algorithm, several segmentation network architectures are selected from the multiple segmentation network architectures to perform full training in the first stage, full training in the second stage, and full training in the third stage respectively.
  16. 如权利要求15所述的方法,其特征在于,根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行第一阶段的全量训练,包括:The method of claim 15, wherein selecting a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training in the first phase according to a simulated annealing algorithm comprises:
    利用增强数据集对所述若干个分割网络架构训练50个时期epoch,其中,使用的辅助单元参数为0.2。The several segmentation network architectures are trained for 50 epochs using the enhanced dataset, wherein the auxiliary unit parameter used is 0.2.
  17. 如权利要求16所述的方法,其特征在于,根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行第二阶段的全量训练,包括:The method of claim 16, wherein selecting a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training in the second phase according to a simulated annealing algorithm comprises:
    在经过第一阶段训练完毕的模型参数基础上,利用所述增强数据集对所述若干个分割网络架构训练50个所述epoch,其中使用的辅助单元参数为0.2。Based on the model parameters trained in the first stage, the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.2.
  18. 如权利要求17所述的方法,其特征在于,根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行第二阶段的全量训练,包括:The method of claim 17, wherein selecting a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training in the second phase according to a simulated annealing algorithm comprises:
    在经过第二阶段训练完毕的模型参数基础上,利用所述增强数据集对所述若干个分割网络架构训练50个所述epoch,其中使用的辅助单元参数为0.15并冻结BN层。Based on the model parameters trained in the second stage, the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.15 and the BN layer is frozen.
  19. 一种图像实例分割***,其特征在于,包括:An image instance segmentation system, characterized by comprising:
    获取模块,被配置为获取已训练的教师网络和控制器网络;An acquisition module, configured to acquire a trained teacher network and a controller network;
    搜索模块,被配置为利用所述控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码 器构成多个分割网络架构;A search module is configured to use the controller network to search for multiple decoder structures and use each decoder structure and a fixed encoding The devices form multiple segmentation network architectures;
    评估模块,被配置为利用所述已训练的教师网络和每一个所述分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用所述已训练的教师网络的损失函数对每一个所述分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行全量训练并从所述若干个分割网络架构中确定最优的分割网络架构;An evaluation module is configured to use the trained teacher network and each of the segmentation network architectures to simultaneously perform image instance segmentation forward reasoning, and use the loss function of the trained teacher network to guide and correct the loss function of each of the segmentation network architectures after each forward reasoning, and select a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determine the optimal segmentation network architecture from the plurality of segmentation network architectures;
    图像实例分割模块,被配置为利用所述最优的分割网络架构对待分割的图像进行图像实例分割。The image instance segmentation module is configured to perform image instance segmentation on the image to be segmented using the optimal segmentation network architecture.
  20. 一种计算机设备,包括:A computer device comprising:
    至少一个处理器;以及at least one processor; and
    存储器,所述存储器存储有可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时执行如权利要求1-18任意一项所述的方法的步骤。A memory storing a computer program executable on the processor, wherein the processor executes the steps of the method according to any one of claims 1 to 18 when executing the program.
  21. 一种非易失性可读存储介质,所述非易失性可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时执行如权利要求1-18任意一项所述的方法的步骤。 A non-volatile readable storage medium storing a computer program, wherein the computer program, when executed by a processor, performs the steps of the method according to any one of claims 1 to 18.
PCT/CN2023/101908 2022-11-30 2023-06-21 Image instance segmentation method and system, device and nonvolatile readable storage medium WO2024113782A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211515764.5A CN115546492B (en) 2022-11-30 2022-11-30 Image instance segmentation method, system, equipment and storage medium
CN202211515764.5 2022-11-30

Publications (1)

Publication Number Publication Date
WO2024113782A1 true WO2024113782A1 (en) 2024-06-06

Family

ID=84721895

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101908 WO2024113782A1 (en) 2022-11-30 2023-06-21 Image instance segmentation method and system, device and nonvolatile readable storage medium

Country Status (2)

Country Link
CN (1) CN115546492B (en)
WO (1) WO2024113782A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115546492B (en) * 2022-11-30 2023-03-10 苏州浪潮智能科技有限公司 Image instance segmentation method, system, equipment and storage medium
CN116862836A (en) * 2023-05-30 2023-10-10 北京透彻未来科技有限公司 System and computer equipment for detecting extensive organ lymph node metastasis cancer

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445008A (en) * 2020-03-24 2020-07-24 暗物智能科技(广州)有限公司 Knowledge distillation-based neural network searching method and system
CN113409299A (en) * 2021-07-12 2021-09-17 北京邮电大学 Medical image segmentation model compression method
US20210334543A1 (en) * 2020-04-28 2021-10-28 Ajou University Industry-Academic Cooperation Foundation Method for semantic segmentation based on knowledge distillation
WO2022098203A1 (en) * 2020-11-09 2022-05-12 Samsung Electronics Co., Ltd. Method and apparatus for image segmentation
CN115546492A (en) * 2022-11-30 2022-12-30 苏州浪潮智能科技有限公司 Image instance segmentation method, system, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022056438A1 (en) * 2020-09-14 2022-03-17 Chan Zuckerberg Biohub, Inc. Genomic sequence dataset generation
CN114299380A (en) * 2021-11-16 2022-04-08 中国华能集团清洁能源技术研究院有限公司 Remote sensing image semantic segmentation model training method and device for contrast consistency learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445008A (en) * 2020-03-24 2020-07-24 暗物智能科技(广州)有限公司 Knowledge distillation-based neural network searching method and system
US20210334543A1 (en) * 2020-04-28 2021-10-28 Ajou University Industry-Academic Cooperation Foundation Method for semantic segmentation based on knowledge distillation
WO2022098203A1 (en) * 2020-11-09 2022-05-12 Samsung Electronics Co., Ltd. Method and apparatus for image segmentation
CN113409299A (en) * 2021-07-12 2021-09-17 北京邮电大学 Medical image segmentation model compression method
CN115546492A (en) * 2022-11-30 2022-12-30 苏州浪潮智能科技有限公司 Image instance segmentation method, system, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Master's Thesis", 1 June 2019, SHANXI AGRICULTURAL UNIVERSITY, CN, article LIU, HUI: "Flower Recognition Research Based on Wide Residual Network and Migration Learning", pages: 1 - 43, XP009555339, DOI: 10.27285/d.cnki.gsxnu.2019.000648 *
PARK SANGYONG, HEO YONG SEOK: "Knowledge Distillation for Semantic Segmentation Using Channel and Spatial Correlations and Adaptive Cross Entropy", SENSORS, MDPI, CH, vol. 20, no. 16, 1 January 2020 (2020-01-01), CH , pages 4616, XP093176553, ISSN: 1424-8220, DOI: 10.3390/s20164616 *

Also Published As

Publication number Publication date
CN115546492A (en) 2022-12-30
CN115546492B (en) 2023-03-10

Similar Documents

Publication Publication Date Title
WO2024113782A1 (en) Image instance segmentation method and system, device and nonvolatile readable storage medium
CN107945204B (en) Pixel-level image matting method based on generation countermeasure network
CN112347248A (en) Aspect-level text emotion classification method and system
CN110782008B (en) Training method, prediction method and device of deep learning model
CN114255361A (en) Neural network model training method, image processing method and device
CN114144794A (en) Electronic device and method for controlling electronic device
CN111738269B (en) Model training method, image processing device, model training apparatus, and storage medium
Zhu et al. Nasb: Neural architecture search for binary convolutional neural networks
CN112381763A (en) Surface defect detection method
CN112561028A (en) Method for training neural network model, and method and device for data processing
CN112686376A (en) Node representation method based on timing diagram neural network and incremental learning method
CN113065525A (en) Age recognition model training method, face age recognition method and related device
CN111950633A (en) Neural network training method, neural network target detection method, neural network training device, neural network target detection device and storage medium
CN116564355A (en) Multi-mode emotion recognition method, system, equipment and medium based on self-attention mechanism fusion
CN116229519A (en) Knowledge distillation-based two-dimensional human body posture estimation method
CN113313250B (en) Neural network training method and system adopting mixed precision quantization and knowledge distillation
CN113299298B (en) Residual error unit, network and target identification method, system, device and medium
CN114491289A (en) Social content depression detection method of bidirectional gated convolutional network
CN114298224A (en) Image classification method, device and computer readable storage medium
KR20200023695A (en) Learning system to reduce computation volume
CN115376195B (en) Method for training multi-scale network model and face key point detection method
CN111508024A (en) Method for estimating pose of robot based on deep learning
CN113887653B (en) Positioning method and system for tight coupling weak supervision learning based on ternary network
CN114298290A (en) Neural network coding method and coder based on self-supervision learning
CN113095328A (en) Self-training-based semantic segmentation method guided by Gini index

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23895946

Country of ref document: EP

Kind code of ref document: A1