WO2024113782A1 - 一种图像实例分割方法、***、设备以及非易失性可读存储介质 - Google Patents

一种图像实例分割方法、***、设备以及非易失性可读存储介质 Download PDF

Info

Publication number
WO2024113782A1
WO2024113782A1 PCT/CN2023/101908 CN2023101908W WO2024113782A1 WO 2024113782 A1 WO2024113782 A1 WO 2024113782A1 CN 2023101908 W CN2023101908 W CN 2023101908W WO 2024113782 A1 WO2024113782 A1 WO 2024113782A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
segmentation
decoder
segmentation network
network architectures
Prior art date
Application number
PCT/CN2023/101908
Other languages
English (en)
French (fr)
Inventor
周镇镇
张潇澜
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024113782A1 publication Critical patent/WO2024113782A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Definitions

  • the present application relates to the field of image processing, and in particular to an image instance segmentation method, system, device and non-volatile readable storage medium.
  • Image semantic segmentation technology has become an important research direction in the field of computer vision and is widely used in practical application scenarios such as mobile robots, autonomous driving, drones, and medical diagnosis.
  • image segmentation technology is mainly divided into two research directions: semantic segmentation and instance segmentation.
  • Semantic segmentation refers to the division of each pixel in the image into a corresponding category, that is, pixel-level classification, so it is also called dense classification; instance segmentation is to distinguish different instances of the same category based on semantic segmentation.
  • the image segmentation neural network models designed by experts have achieved a high level of accuracy, such as Mask RCNN (Region-based Convolutional Neural Networks), DeepLab, U-net series algorithms and other neural networks.
  • Mask RCNN Region-based Convolutional Neural Networks
  • DeepLab DeepLab
  • U-net series algorithms and other neural networks.
  • DeepLab series is a branch with a greater influence in the field of semantic segmentation
  • DeepLabV3+ is one of the better variants of this series. Therefore, researchers began to explore the automatic design of neural networks through Neural Architecture Search (NAS).
  • NAS Neural Architecture Search
  • relevant researchers mainly focus on neural network architecture search algorithms, automatically establish neural networks, and quickly apply them in practice.
  • the existing neural network architecture algorithm uses the search method of reinforcement learning and evolutionary algorithms to search for architectures, evaluates the sampled network architectures through performance evaluation methods, and then obtains the best model structure by optimizing the evaluation indicators.
  • the former is mainly achieved by obtaining the maximum reward in the process of interacting with the environment through the Neural Architecture Search framework.
  • the main representative algorithms include NASNet, MetaQNN (Quantum Neural Network), BlockQNN, etc.; the latter is mainly a general evolutionary algorithm to simulate the laws of biological inheritance and evolution to realize the NAS process.
  • the main representative algorithms include NEAT (Evolving Neural Networks through Augmenting Topologies), DeepNEAT, CoDeepNEAT, etc.
  • Neural network architecture search can automatically design customized neural networks for specific tasks, which has far-reaching implications.
  • tasks such as image segmentation that require dense pixel-by-pixel classification
  • the traditional neural network model used in existing image instance segmentation methods has a large number of parameters, which makes it impossible to use it directly in edge devices.
  • the current vehicle-side chips cannot carry existing high-precision neural networks, and directly using neural networks with smaller parameters will lead to inaccurate recognition of dense images and difficulty in defining the edges of different categories.
  • an embodiment of the present application proposes an image instance segmentation method, comprising the following steps:
  • controller network to search for multiple decoder structures and using each decoder structure and a fixed encoder to form multiple segmentation network architectures
  • obtaining a trained teacher network includes:
  • the encoder of the teacher network is constructed using the backbone network and the atrous spatial pyramid pooling (ASPP) module, where the ASPP module includes four atrous convolution blocks with different dilation rates and a global average pooling block.
  • ASPP atrous spatial pyramid pooling
  • the decoder of the teacher network is constructed using an upsampling module, a 1 ⁇ 1 convolution block, and a 3 ⁇ 3 convolution block, where the decoder of the teacher network takes the low-level feature maps output by the middle layer of the backbone network and the output of the ASPP module as input.
  • it also includes:
  • the decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain the image instance segmentation result;
  • the parameters of the encoder and decoder of the teacher network are adjusted according to the image instance segmentation results to train the teacher network.
  • it also includes:
  • the backbone network and the ASPP module are used to process the images in the training set, including:
  • the backbone network is used to extract multi-layer semantic features of the images in the training set, and the ASPP module is used to sample the multi-layer semantic features in parallel with dilated convolutions at different sampling rates to obtain five sets of feature maps, which are then concatenated and input into the decoder of the teacher network.
  • the decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain an image segmentation result, including:
  • the upsampling module is used to interpolate and upsample the feature map from the ASPP module, and the 1 ⁇ 1 convolution block is used to reduce the channel dimension of the low-level feature map output from the intermediate layer of the backbone network;
  • the low-level feature map of channel dimension reduction and the feature map obtained by linear interpolation upsampling are concatenated and sent to a 3 ⁇ 3 convolution block for processing.
  • the upsampling module is then used for linear interpolation upsampling again to obtain the image instance segmentation result.
  • a 1 ⁇ 1 convolution block is used to perform channel dimension reduction on a low-level feature map output from an intermediate layer of the backbone network, including:
  • the controller network comprises a two-layer recursive LSTM neural network with 100 hidden units, and all hidden units are randomly initialized from a uniform distribution.
  • a controller network is used to search for multiple decoder structures and each decoder structure and a fixed encoder are used to form multiple segmentation network architectures, including:
  • the controller network is used to search the internal structure of the fifth decoder block and the sixth decoder block in a preset search space, and the connection mode between the first decoder block, the second decoder block, the third decoder block, the fourth decoder block, the fifth decoder block and the sixth decoder block.
  • the search space includes 1 ⁇ 1 convolution, 3 ⁇ 3 convolution, 3 ⁇ 3 separable convolution, 5 ⁇ 5 separable convolution, global average pooling, upsampling, 1 ⁇ 1 convolution module, 3 ⁇ 3 convolution with dilation rate 3, 3 ⁇ 3 convolution with dilation rate 12, separable 3 ⁇ 3 convolution with dilation rate 3, separable 5 ⁇ 5 convolution with dilation rate 6, skip connections, and zero operations that effectively invalidate the paths.
  • LKD represents the overall loss of the knowledge distillation network
  • LStudent represents the loss of the segmentation network architecture
  • LTeacher represents the teacher network loss
  • coff represents an adjustable parameter in the actual network training process.
  • coff takes a value of 0.3.
  • it also includes:
  • n is the instance of different categories
  • pixels is the pixel point
  • ytrue is the actual value of the corresponding category
  • ypred is the predicted value of the corresponding type.
  • selecting a plurality of segmentation network architectures from a plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determining an optimal segmentation network architecture from the plurality of segmentation network architectures includes:
  • the geometric mean is calculated using the average IoU, frequency-weighted IoU, and average pixel precision;
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures for full training according to a simulated annealing algorithm, including:
  • segmentation network architectures are selected from multiple segmentation network architectures to perform full training in the first stage, full training in the second stage, and full training in the third stage respectively.
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm to perform full training in the first stage, including:
  • the several segmentation network architectures are trained for 50 epochs using the enhanced dataset, with an auxiliary unit parameter of 0.2.
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm for full training in the second phase, including:
  • the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.2.
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm for full training in the second phase, including:
  • the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.15 and the BN layer is frozen.
  • an embodiment of the present application further provides an image instance segmentation system, including:
  • An acquisition module configured to acquire a trained teacher network and a controller network
  • a search module configured to search for a plurality of decoder structures using the controller network and construct a plurality of segmentation network architectures using each decoder structure and a fixed encoder;
  • An evaluation module is configured to use the trained teacher network and each segmentation network architecture to simultaneously perform image instance segmentation forward inference, and use the loss function of the trained teacher network to guide and correct the loss function of each segmentation network architecture after each forward inference, and select a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determine the optimal segmentation network architecture from the plurality of segmentation network architectures;
  • the image instance segmentation module is configured to perform image instance segmentation on the image to be segmented using an optimal segmentation network architecture.
  • an embodiment of the present application further provides a computer device, including:
  • the memory stores a computer program that can be run on the processor, and the processor executes any of the above image implementations when executing the program.
  • an embodiment of the present application further provides a non-volatile readable storage medium, which stores a computer program.
  • a computer program When the computer program is executed by a processor, the steps of any one of the above image instance segmentation methods are performed.
  • the embodiment of the present application has one of the following beneficial technical effects:
  • the solution proposed in the embodiment of the present application guides and corrects the training process of the searched student network (segmentation network architecture) by using the knowledge distillation method, so that a lightweight semantic segmentation model can be quickly obtained with a small computing cost, the problem of excessive parameters in the existing image segmentation model can be improved, and more reliable image segmentation prediction results can be achieved with a faster reasoning speed. It has better adaptability in autonomous driving scenarios.
  • FIG1 is a schematic diagram of a flow chart of an image instance segmentation method provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of a framework of a teacher network provided in an embodiment of the present application.
  • FIG3 is an image segmentation algorithm framework based on knowledge distillation neural network architecture search provided by an embodiment of the present application.
  • FIG4 is a schematic diagram of a controller network provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of a unit architecture provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of the structure of an image instance segmentation system provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of a non-volatile readable storage medium provided in an embodiment of the present application.
  • an embodiment of the present application provides an image instance segmentation method, as shown in FIG1 , which may include the steps of:
  • the solution proposed in the embodiment of the present application uses the knowledge distillation method to guide and correct the search student network (segmentation network architecture) training process, so that a lightweight semantic segmentation model can be quickly obtained with low computational overhead, the problem of excessive parameters in the existing image segmentation model can be improved, and more reliable image segmentation prediction results can be achieved with faster reasoning speed. It has better adaptability in autonomous driving scenarios.
  • obtaining a trained teacher network includes:
  • the encoder of the teacher network is constructed using the backbone network and the ASPP (Atrous Spatial Pyramid Pooling) module, where the ASPP module includes four atrous convolution blocks with different dilation rates and a global average pooling block;
  • the decoder of the teacher network is constructed using an upsampling module, a 1 ⁇ 1 convolution block, and a 3 ⁇ 3 convolution block, where the decoder of the teacher network takes the low-level feature maps output by the middle layer of the backbone network and the output of the ASPP module as input.
  • the DeepLabV3+ network uses ResNet101 as the backbone to extract the multi-layer semantic features in the original image.
  • the feature information is sampled in parallel by the ASPP module with different sampling rates of dilated convolution to obtain different proportions of image context information.
  • the ASPP module accepts the first part of the output of the backbone as input, and uses four dilated convolution blocks with different expansion rates (including convolution, BN, activation layer) and a global average pooling block (including pooling, convolution, BN (Batch Normalization), activation layer) to obtain a total of five groups of feature maps, which are concatenated and passed through a 1 ⁇ 1 convolution block (including convolution, BN, activation, dropout (random inactivation) layer) and finally sent to the Decoder module.
  • the Decoder module receives the low-level feature map from the middle layer of the backbone and the output from the ASPP module as input.
  • it also includes:
  • the decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain the image instance segmentation result;
  • the parameters of the encoder and decoder of the teacher network are adjusted according to the image instance segmentation results to train the teacher network.
  • it also includes:
  • the enhanced dataset can use a variety of data enhancement methods, for example, the image data enhancement method protected by the patent publication number CN114037637A can be used for data enhancement.
  • the image data enhancement method protected by the patent publication number CN114037637A can be used for data enhancement.
  • segment the original image obtain the segmented image and the segmented image.
  • the target category of the image is obtained through the target category to obtain the category to be enhanced; the original image is binarized according to the category to be enhanced to obtain a binary image, and according to the connected domain of the binary image, an instance image that matches the category to be enhanced in the original image is obtained; the instance image is perspective processed to obtain a first instance image, and the first instance image is scaled to obtain a second instance image; the vanishing point position is obtained from the original image, and the pasting position of the second instance image is determined according to the vanishing point position and the geometric size of the second instance image, and the second instance image is pasted to the original image according to the pasting position to obtain an enhanced image of the original image.
  • the backbone network and the ASPP module are used to process the images in the training set, including:
  • the backbone network is used to extract multi-layer semantic features of the images in the training set, and the ASPP module is used to sample the multi-layer semantic features in parallel with dilated convolutions at different sampling rates to obtain five sets of feature maps, which are then concatenated and input into the decoder of the teacher network.
  • the decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain an image segmentation result, including:
  • the upsampling module is used to interpolate and upsample the feature map from the ASPP module, and the 1 ⁇ 1 convolution block is used to reduce the channel dimension of the low-level feature map output from the intermediate layer of the backbone network;
  • the low-level feature map of channel dimension reduction and the feature map obtained by linear interpolation upsampling are concatenated and sent to a 3 ⁇ 3 convolution block for processing.
  • the upsampling module is then used for linear interpolation upsampling again to obtain the image instance segmentation result.
  • a 1 ⁇ 1 convolution block is used to perform channel dimension reduction on a low-level feature map output from an intermediate layer of the backbone network, including:
  • the decoder module can use 1 ⁇ 1 convolution to perform channel dimension reduction on the low-level feature map, including Images Pooling from 256 to 48 (the reason for downsampling to 48 is that too many channels will mask the importance of the feature map output by ASPP, and experiments have verified that 48 is the best).
  • the feature map from ASPP is interpolated and upsampled (Unsample By 4) to obtain a feature map of the same size as the low-level feature map.
  • the low-level feature map with channel dimension reduction and the feature map obtained by linear interpolation upsampling are concatenated using concat and sent to a set of 3*3 convolution blocks for processing. Linear interpolation upsampling is performed again to obtain a predicted map with the same resolution size as the original image.
  • the controller network comprises a two-layer recursive LSTM neural network with 100 hidden units, and all hidden units are randomly initialized from a uniform distribution.
  • the controller has a two-layer recursive LSTM neural network with 100 hidden units, and all units are randomly initialized from a uniform distribution.
  • the embodiment of the present application uses a PP0 optimization strategy for optimization, and the learning rate is 0.0001.
  • a controller network is used to search for multiple decoder structures and each decoder structure and a fixed encoder are used to form multiple segmentation network architectures, including:
  • the controller network is used to search the internal structure of the fifth decoder block and the sixth decoder block in a preset search space, and the connection mode between the first decoder block, the second decoder block, the third decoder block, the fourth decoder block, the fifth decoder block and the sixth decoder block.
  • the student network is an image segmentation network based on neural network architecture search, through the sampling process of neural network architecture search
  • the target network architecture is obtained, the structural knowledge and instance segmentation information in the distillation teacher network are obtained, and the network architecture adopts an encoder-decoder structure.
  • the image segmentation model requires multiple iterations to converge, it is currently difficult to perform a complete segmentation network architecture search from scratch due to limited computing resources, so the embodiment of the present application focuses the attention of the architecture search process on the decoder part.
  • the entire network structure uses the weights in the pre-trained classification network to initialize the encoder, which consists of multiple downsampling operations that reduce the input space dimension; on the other hand, the controller network generates a decoder structure, and the decoder part can access multiple outputs of encoders with different spatial and channel dimensions. In order to keep the sampling architecture compact and roughly the same size, each encoder output will use the same number of output channels for a 1 ⁇ 1 convolution.
  • the student network portion of Figure 3 shows a search layout containing 2 decoder blocks (the fifth decoder block block4 and the sixth decoder block block5) and 2 branch units.
  • Block4 and block5 in the figure obtain two sets of sampling pairs through the controller. The results of the sampling pairs are input into the two modules after element-by-element summation.
  • the two modules obtain the internal unit structure through the controller, and their outputs are connected through the concat operation, and then through conv1 ⁇ 1, and input into the main classifier to calculate the loss and finally form the prediction result of the image segmentation information.
  • the auxiliary cell in the figure has the same structure as other units and can be adjusted to directly output the true background (ground truth) or imitate the teacher's network prediction (or a combination of the above two).
  • Figure 4 shows the layout of the controller network for neural network architecture search, which can sequentially sample the decoder connection mode, including different modules, different operations, and different branch position indexes. Different modules reuse the sampled unit architecture, and apply the same unit with different weights to each module in the sampling pair, and finally add the outputs of the two units. The resulting layer will be added to the sampling pool (the next unit can sample the previous unit as input).
  • the block4 sampling range includes all modules before the module ⁇ block0 (first decoder block), block1 (second decoder block), block2 (third decoder block), block3 (fourth decoder block)>
  • the block5 sampling range includes all modules before the module ⁇ block0, block1, block2, block3, block4 (fifth decoder block>.
  • each unit accepts an input, and the controller first samples operation 1; then, two position indices (index) are sampled, namely input index0 and the output result index1 of sampled operation 1; finally, two corresponding operations are sampled.
  • the output of each operation is added, and in the next step, all three layers (from each operation and its summed result) and the initial two layers are sampled.
  • the number of sampling times of positions within the unit is controlled by another hyperparameter to keep the number of all possible architectures to a feasible number. All existing non-sampled summed outputs within the unit are summed and used as the unit output. In this case, sum is used because the concatenation layer operation may cause the vector size of the output of different architectures to change.
  • 09 represents the sampling position
  • the sampling operations 1-7 represent the operations performed at the corresponding positions.
  • the search space includes 1 ⁇ 1 convolution, 3 ⁇ 3 convolution, 3 ⁇ 3 separable convolution, 5 ⁇ 5 separable convolution, global average pooling, upsampling, 1 ⁇ 1 convolution module, 3 ⁇ 3 convolution with dilation rate 3, 3 ⁇ 3 convolution with dilation rate 12, separable 3 ⁇ 3 convolution with dilation rate 3, separable 5 ⁇ 5 convolution with dilation rate 6, skip connections, and zero operations that effectively invalidate the paths.
  • the number of sampling times of the layer pair is controlled by a hyperparameter, which is set to 3 in the experiment.
  • the encoder part of the network in the embodiment of the present application is MobileNet-v2, pre-trained on MS COCO, and a lightweight RefineNet decoder is used for semantic segmentation during pre-training.
  • the embodiment of the present application uses the outputs of the four layers 2, 3, 6, and 8 of MobileNet-v2 corresponding to block0 to block3 as the input of the decoder; a 1 ⁇ 1 convolutional layer for encoder output adaptation, with 48 output channels during search and 64 output channels during training.
  • the encoder weights are randomly initialized using the Xavier scheme.
  • the embodiment of the present application uses a controller to search for a combination of basic units to construct a neural network architecture. Based on existing semantic segmentation research, the search space is set as follows in the embodiment of the present application:
  • GAP Global average pooling, upsampling, 1 ⁇ 1 convolution module
  • LKD represents the overall loss of the knowledge distillation network
  • LStudent represents the loss of the segmentation network architecture
  • LTeacher represents the teacher network loss
  • coff represents an adjustable parameter in the actual network training process.
  • coff takes a value of 0.3.
  • the sampling architecture is used to perform instance segmentation forward inference simultaneously with the teacher network.
  • LKD represents the overall loss of the knowledge distillation network
  • LStudent represents the loss of the segmentation network architecture
  • LTeacher represents the teacher network loss
  • coff represents an adjustable parameter in the actual network training process.
  • it also includes:
  • n is an instance of different categories, pixels is a pixel, ytrue is the actual value of the corresponding category, and ypred is the predicted value of the corresponding type.
  • the background class is not considered during the calculation, because a large number of pixels belong to the background, and adding the background class to the calculation will have a negative impact on the result.
  • the loss function is crucial to the accuracy of the model prediction results.
  • the embodiment of the present application uses Dice Soft Loss as the loss function because the loss function can be calculated separately for instances of different categories.
  • This loss function is a commonly used loss function in semantic segmentation tasks. It evolved from a loss function based on the dice coefficient, which represents a measure of the overlap between predicted values and actual values of different categories. Calculate the dice loss for each category, and then sum and average it. The detailed expression is as follows:
  • n is an instance of different categories
  • pixels is a pixel point
  • ytrue is the actual value of the corresponding category
  • ypred is the predicted value of the corresponding type.
  • selecting a plurality of segmentation network architectures from a plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determining an optimal segmentation network architecture from the plurality of segmentation network architectures includes:
  • the geometric mean is calculated using the average IoU, frequency-weighted IoU, and average pixel precision;
  • the embodiment of the present application randomly divides the training set into two non-overlapping sets: an initial training set (Train DataSet0) and an initial validation set (Valid DataSet0).
  • the initial training set can be image enhanced and is used to train the sampled architecture on a given task (i.e., semantic segmentation); the initial validation set is not subjected to any image processing and is used to evaluate the trained architecture and provide a scalar (often referred to as feedback in reinforcement learning literature) to the controller.
  • the internal training process is divided into two stages. The first stage is the architecture search stage.
  • the weights of the encoder are obtained through pre-training, and its output is calculated and stored in the memory.
  • Each sampling process directly imports the encoder output, which can save a lot of computing time and efficiency.
  • the decoder is training, which facilitates the rapid adaptation of the decoder weights and the reasonable estimation of the sampling architecture performance.
  • the second stage is the full training stage However, not all sampled architectures can enter this stage. A simple simulated annealing algorithm is used to decide whether to continue training the sampled architecture for the second stage.
  • the reason why all sampling architectures are not trained is that the sampling architectures that have completed the first stage of training can already predict future development prospects after training on the current batch. Terminating the unpromising architectures in advance can save computing resources and find the target architecture with higher accuracy more quickly.
  • the external optimization process optimizes the controller through the proximal policy optimization (PPO) method given the sampling sequence, log probability and feedback signal, and strikes a balance between the diversity of the sampling architecture and the complexity of the tuning process, so as to achieve the controller network model update and global parameter optimization.
  • PPO proximal policy optimization
  • the embodiment of the present application retains the running average of the feedback after the first stage to decide whether to continue training the sampled architecture.
  • the criterion for evaluating the future prospects of the architecture that is, reward, uses the geometric mean of three quantities:
  • Mean intersection-over-union (mIoU), mainly used for semantic segmentation benchmarks
  • k represents the number of categories
  • i represents the true value
  • j represents the predicted value
  • Pij represents predicting i as j. The same applies to the following.
  • Frequency-weighted IoU which scales each class IoU according to the number of pixels present in that class
  • MPA Mean-pixel accuracy
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures for full training according to a simulated annealing algorithm, including:
  • segmentation network architectures are selected from multiple segmentation network architectures to perform full training in the first stage, full training in the second stage, and full training in the third stage respectively.
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm to perform full training in the first phase, including:
  • the augmented dataset is used for training for 50 epochs, where the auxiliary unit parameter used is 0.2.
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm for full training in the second phase, including:
  • the enhanced dataset was used to train 50 epochs, with the auxiliary unit parameter used being 0.2.
  • a plurality of segmentation network architectures are selected from a plurality of segmentation network architectures according to a simulated annealing algorithm for full training in the second phase, including:
  • the enhanced dataset was used to train 50 epochs, with the auxiliary unit parameter of 0.15 and the BN layer frozen.
  • the enhanced data set in each epoch, is fully input into several segmentation network architectures, that is, one epoch can be understood as the process of training all samples once.
  • one epoch when a complete data set passes through the neural network once and returns once, this process is called an epoch, that is, all training samples undergo a forward propagation and a backward propagation in the neural network.
  • the solution proposed in the embodiment of this application uses the data enhancement method to enhance the image of the data set; then, the DeepLabV3+ neural network is trained on the enhanced data set to obtain the image segmentation information and use it as the teacher network; the knowledge distillation method is used to guide and correct the search student network training process, and the loss function is calculated for different categories of image segmentation data, which can effectively improve the detection accuracy of small sample data in image segmentation.
  • a lightweight semantic segmentation model can be quickly obtained with less computing overhead, and more reliable image segmentation prediction results can be achieved with faster reasoning speed, which has better adaptability in autonomous driving scenarios.
  • an embodiment of the present application further provides an image instance segmentation system 400, as shown in FIG6 , comprising:
  • An acquisition module 401 is configured to acquire a trained teacher network and a controller network
  • a search module 402 is configured to search for a plurality of decoder structures using the controller network and construct a plurality of segmentation network architectures using each decoder structure and a fixed encoder;
  • An evaluation module 403 is configured to use the trained teacher network and each segmentation network architecture to simultaneously perform image instance segmentation forward inference, and use the loss function of the trained teacher network to guide and correct the loss function of each segmentation network architecture after each forward inference, and select a plurality of segmentation network architectures from the plurality of segmentation network architectures for full training according to a simulated annealing algorithm and determine the optimal segmentation network architecture from the plurality of segmentation network architectures;
  • the image instance segmentation module 404 is configured to perform image instance segmentation on the image to be segmented using an optimal segmentation network architecture.
  • a teacher network building module is further included, configured to:
  • the encoder of the teacher network is constructed using the backbone network and the atrous spatial pyramid pooling (ASPP) module, where the ASPP module includes four atrous convolution blocks with different dilation rates and a global average pooling block.
  • ASPP atrous spatial pyramid pooling
  • the decoder of the teacher network is constructed using an upsampling module, a 1 ⁇ 1 convolution block, and a 3 ⁇ 3 convolution block, where the decoder of the teacher network takes the low-level feature maps output by the middle layer of the backbone network and the output of the ASPP module as input.
  • a teacher network training module is further included, configured to:
  • the decoder is used to process the low-level feature map output by the middle layer of the backbone network and the output of the ASPP module to obtain the image instance segmentation result;
  • the parameters of the encoder and decoder of the teacher network are adjusted according to the image instance segmentation results to train the teacher network.
  • the teacher network training module is further configured to:
  • the teacher network training module is further configured to:
  • the backbone network is used to extract multi-layer semantic features of the images in the training set, and the ASPP module is used to sample the multi-layer semantic features in parallel with dilated convolutions at different sampling rates to obtain five sets of feature maps, which are then concatenated and input into the decoder of the teacher network.
  • the teacher network training module is further configured to:
  • the upsampling module is used to interpolate and upsample the feature map from the ASPP module, and the 1 ⁇ 1 convolution block is used to reduce the channel dimension of the low-level feature map output from the intermediate layer of the backbone network;
  • the low-level feature map of channel dimension reduction and the feature map obtained by linear interpolation upsampling are concatenated and sent to a 3 ⁇ 3 convolution block for processing.
  • the upsampling module is then used for linear interpolation upsampling again to obtain the image instance segmentation result.
  • the teacher network training module is further configured to:
  • the controller network comprises a two-layer recursive LSTM neural network with 100 hidden units, and all hidden units are randomly initialized from a uniform distribution.
  • the search module is further configured to:
  • the controller network is used to search the internal structure of the fifth decoder block and the sixth decoder block in a preset search space, and the connection mode between the first decoder block, the second decoder block, the third decoder block, the fourth decoder block, the fifth decoder block and the sixth decoder block.
  • the search space includes 1 ⁇ 1 convolution, 3 ⁇ 3 convolution, 3 ⁇ 3 separable convolution, 5 ⁇ 5 separable convolution, global average pooling, upsampling, 1 ⁇ 1 convolution module, 3 ⁇ 3 convolution with dilation rate 3, 3 ⁇ 3 convolution with dilation rate 12, separable 3 ⁇ 3 convolution with dilation rate 3, separable 5 ⁇ 5 convolution with dilation rate 6, skip connections, and zero operations that effectively invalidate the paths.
  • LKD represents the overall loss of the knowledge distillation network
  • LStudent represents the loss of the segmentation network architecture
  • LTeacher represents the teacher network loss
  • coff represents an adjustable parameter in the actual network training process.
  • coff takes a value of 0.3.
  • the evaluation module is further configured to:
  • n is the instance of different categories
  • pixels is the pixel point
  • ytrue is the actual value of the corresponding category
  • ypred is the predicted value of the corresponding type.
  • the evaluation module is further configured to:
  • the geometric mean is calculated using the average IoU, frequency-weighted IoU, and average pixel precision;
  • the evaluation module is further configured to:
  • segmentation network architectures are selected from multiple segmentation network architectures to perform full training in the first stage, full training in the second stage, and full training in the third stage respectively.
  • the evaluation module is further configured to:
  • the several segmentation network architectures are trained for 50 epochs using the enhanced dataset, with an auxiliary unit parameter of 0.2.
  • the evaluation module is further configured to:
  • the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.2.
  • the evaluation module is further configured to:
  • the enhanced data set is used to train the several segmentation network architectures for 50 epochs, wherein the auxiliary unit parameter used is 0.15 and the BN layer is frozen.
  • an embodiment of the present application further provides a computer device 501, including:
  • the memory 510 stores a computer program 511 that can be run on the processor.
  • the processor 520 executes the program, the processor 520 performs the steps of any of the above image instance segmentation methods.
  • an embodiment of the present application also provides a non-volatile readable storage medium 601, and the non-volatile readable storage medium 601 stores a computer program 610.
  • the computer program 610 is executed by a processor, the steps of any one of the above image instance segmentation methods are performed.
  • non-volatile readable storage medium eg, memory
  • the non-volatile readable storage medium may be either a volatile memory or a non-volatile memory, or may include both volatile memory and non-volatile memory.
  • the program may be stored in a non-volatile readable storage medium.
  • the medium can be a read-only memory, a magnetic or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

本申请公开了一种图像实例分割方法、***、设备以及存储介质,包括以下步骤:获取已训练的教师网络和控制器网络;利用所述控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构;利用所述已训练的教师网络和每一个所述分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用所述已训练的教师网络的损失函数对每一个所述分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行全量训练并从所述若干个分割网络架构中确定最优的分割网络架构;利用所述最优的分割网络架构对待分割的图像进行图像实例分割。

Description

一种图像实例分割方法、***、设备以及非易失性可读存储介质
相关申请的交叉引用
本申请要求于2022年11月30日提交中国专利局,申请号为202211515764.5,申请名称为“一种图像实例分割方法、***、设备以及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理领域,特别涉及一种图像实例分割方法、***、设备以及非易失性可读存储介质。
背景技术
图像语义分割技术已经成为计算机视觉领域的重要研究方向,广泛应用于移动机器人、自动驾驶、无人机、医学诊断等实际应用场景。目前图像分割技术,主要分为两个研究方向:语义分割和实例分割。语义分割指的是对图像中的每个像素都划分出对应的类别,即实现像素级分类,故也称密集分类;实例分割是在语义分割的基础上区分出同一类别的不同实例。
目前专家设计的图像分割神经网络模型,已经具有较高的精度水平,例如Mask RCNN(Region-based Convolutional Neural Networks,基于区域的卷积神经网络)、DeepLab、U-net系列算法等神经网络。其中DeepLab系列是语义分割领域影响较大的一个分支,DeepLabV3+属于此系列目前较优秀的变种之一。因此,研究人员开始探索通过神经网络结构搜索(Neural Architecture Search,NAS)实现自动设计神经网络。目前相关研究人员主要聚焦神经网络架构搜索算法,自动建立神经网络,快速应用于实践。现有的神经网络架构算法采用强化学习和进化算法的搜索方法进行架构搜索,通过性能评估方法对采样得到的网络架构进行评估,再通过优化评估指标获取最佳模型结构。前者主要通过神经网络架构搜索(Neural Architecture Search)框架与环境交互的过程中获得最大的奖励实现,主要代表算法有NASNet、MetaQNN(Quantum Neural Network,量子神经网络)、BlockQNN等;后者主要是通用进化算法模拟生物遗传和进化的规律,实现NAS过程,主要代表算法有NEAT(Evolving Neural Networks through Augmenting Topologies,增强拓扑的神经进化网络)、DeepNEAT、CoDeepNEAT等。
神经网络架构搜索可以针对特定任务去自动设计定制的神经网络,具有深远的影响意义,但是对于图像分割这种需要致密的逐像素点类别划分的任务,在实际应用中存在着计算资源和时间受限的约束。即现有的图像实例分割方法使用的传统神经网络模型存在参数量较大的问题,导致在边缘设备中无法直接使用,例如在自动驾驶场景,目前车端芯片无法承载现有的高精度神经网络,直接使用参数量较小的神经网络又会导致对密集图像的识别不准、不同类别边缘部分界定困难等问题。
发明内容
有鉴于此,为了克服上述问题的至少一个方而,本申请实施例提出一种图像实例分割方法,包括以下步骤:
获取已训练的教师网络和控制器网络;
利用控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构;
利用已训练的教师网络和每一个分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用已训练的教师网络的损失函数对每一个分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练并从若干个分割网络架构中确定最优的分割网络架构;
利用最优的分割网络架构对待分割的图像进行图像实例分割。
在一些实施例中,获取已训练的教师网络,包括:
利用骨干网络以及空洞空间金字塔池化ASPP模块构建教师网络的编码器,其中,ASPP模块包括四种不同膨胀率的空洞卷积块和一个全局平均池化块;
利用上采样模块、1×1卷积块、3×3卷积块构建教师网络的解码器,其中,教师网络的解码器将骨干网络中间层输出的低级特征图和ASPP模块的输出作为输入。
在一些实施例中,还包括:
构建第一训练集;
利用骨干网络以及ASPP模块对训练集中的图像进行处理;
利用解码器对骨干网络中间层输出的低级特征图和ASPP模块的输出进行处理以得到图像实例分割结果;
根据图像实例分割结果调整教师网络的编码器和解码器的参数以对教师网络进行训练。
在一些实施例中,还包括:
对第一训练集中的数据进行数据增强。
在一些实施例中,利用骨干网络以及ASPP模块对训练集中的图像进行处理,包括:
利用骨干网络提取训练集中图像的多层语义特征并利用ASPP模块对多层语义特征以不同采样率的空洞卷积并行采样以得到五组特征图,并将五组特征图拼接后输入到教师网络的解码器。
在一些实施例中,利用解码器对骨干网络中间层输出的低级特征图和ASPP模块的输出进行处理以得到图像分割结果,包括:
利用上采样模块对来自ASPP模块的特征图进行插值上采样并利用1×1卷积块对来自骨干网络中间层输出的低级特征图进行通道降维;
将通道降维的低级特征图和线性插值上采样得到的特征图拼接,并送入3×3卷积块进行处理,并再次利用上采样模块进行线性插值上采样以得到图像实例分割结果。
在一些实施例中,利用1×1卷积块对来自骨干网络中间层输出的低级特征图进行通道降维,包括:
将骨干网络中间层输出的低级特征图的通道降到48。
在一些实施例中,控制器网络包括100个隐藏单元的两层递归LSTM神经网络,且所有隐藏单元从均匀分布中随机初始化。
在一些实施例中,利用控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构,包括:
获取预设的第一解码器块、第二解码器块、第三解码器块以及第四解码器块;
利用控制器网络在预设的搜索空间中搜索第五解码器块和第六解码器块的内部结构,以及第一解码器块、第二解码器块、第三解码器块、第四解码器块、第五解码器块和第六解码器块之间的连接方式。
在一些实施例中,搜索空间包括1×1卷积,3×3卷积,3×3可分离卷积,5×5可分离卷积,全局平均池化、上采样、1×1卷积模块,扩张率为3的3×3卷积,扩张率为12的3×3卷积,扩张率为3的可分离3×3卷积,扩张率为6的可分离5×5卷积,跳跃连接,有效地使路径无效的零操作。
在一些实施例中,利用已训练的教师网络和每一个分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后,使用已训练的教师网络的损失函数对每一个分割网络架构的损失函数进行指导并修正,包括利用如下公式对每一个分割网络架构的损失函数进行指导并修正:
LKD=LStudent+coff*LTeacher
其中,其中,LKD表示知识蒸馏网络总体损失,LStudent表示分割网络架构的损失,LTeacher表示教师网络损失,coff表示一个在实际的网络训练过程中可调节的参数。
在一些实施例中,coff取值为0.3。
在一些实施例中,还包括:
利用公式计算分割网络架构的损失;其中,n为不同类别实例,pixels为像素点,ytrue为对应类别的实际值,ypred为对应类型的预测值。
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练并从若干个分割网络架构中确定最优的分割网络架构,包括:
获取每一个分割网络架构的平均交并比、频率加权交并比以及平均像素精度;
利用平均交并比、频率加权交并比以及平均像素精度计算几何平均值;
根据几何平均值挑选若干个分割网络架构。
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练,包括:
根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构分别进行第一阶段的全量训练、第二阶段的全量训练、第三阶段的全量训练。
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行第一阶段的全量训练,包括:
利用增强数据集对所述若干个分割网络架构训练50个时期epoch,其中使用的辅助单元参数为0.2。
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行第二阶段的全量训练,包括:
在经过第一阶段训练完毕的模型参数基础上,利用所述增强数据集对所述若干个分割网络架构训练50个所述epoch,其中使用的辅助单元参数为0.2。
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行第二阶段的全量训练,包括:
在经过第二阶段训练完毕的模型参数基础上,利用所述增强数据集对所述若干个分割网络架构训练50个所述epoch,其中使用的辅助单元参数为0.15并冻结BN层。
基于同一发明构思,根据本申请的另一个方而,本申请的实施例还提供了一种图像实例分割***,包括:
获取模块,被配置为获取已训练的教师网络和控制器网络;
搜索模块,被配置为利用控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构;
评估模块,被配置为利用已训练的教师网络和每一个分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用已训练的教师网络的损失函数对每一个分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练并从若干个分割网络架构中确定最优的分割网络架构;
图像实例分割模块,被配置为利用最优的分割网络架构对待分割的图像进行图像实例分割。
基于同一发明构思,根据本申请的另一个方而,本申请的实施例还提供了一种计算机设备,包括:
至少一个处理器;以及
存储器,存储器存储有可在处理器上运行的计算机程序,处理器执行程序时执行如上的任一种图像实 例分割方法的步骤。
基于同一发明构思,根据本申请的另一个方而,本申请的实施例还提供了一种非易失性可读存储介质,非易失性可读存储介质存储有计算机程序,计算机程序被处理器执行时执行如上的任一种图像实例分割方法的步骤。
本申请实施例具有以下有益技术效果之一:本申请实施例提出的方案通过使用知识蒸馏的方法,对搜索的学生网络(分割网络架构)训练过程进行指导并修正,从而能够在较小计算开支的情况下,快速获得轻量级的语义分割模型,改善现有图像分割模型参数过大的问题,以较快的推理速度实现更为可靠的图像分割预测结果。在自动驾驶场景具有更佳适配性。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下而将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下而描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的实施例。
图1为本申请的实施例提供的图像实例分割方法的流程示意图;
图2为本申请的实施例提供的教师网络的框架示意图;
图3为本申请的实施例提供的基于知识蒸馏的神经网络架构搜索的图像分割算法框架;
图4为本申请的实施例提供的控制器网络示意图;
图5为本申请的实施例提供的单元架构示意图;
图6为本申请的实施例提供的图像实例分割***的结构示意图;
图7为本申请的实施例提供的计算机设备的结构示意图;
图8为本申请的实施例提供的非易失性可读存储介质的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚明白,以下结合可选实施例,并参照附图,对本申请实施例进行详细说明。
需要说明的是,本申请实施例中所有使用“第一”和“第二”的表述均是为了区分两个相同名称非相同的实体或者非相同的参量,可见“第一”“第二”仅为了表述的方便,不应理解为对本申请实施例的限定,后续实施例对此不再一一说明。
根据本申请的一个方而,本申请的实施例提出一种图像实例分割方法,如图1所示,其可以包括步骤:
S1,获取已训练的教师网络和控制器网络;
S2,利用控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构;
S3,利用已训练的教师网络和每一个分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用已训练的教师网络的损失函数对每一个分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练并从若干个分割网络架构中确定最优的分割网络架构;
S4,利用最优的分割网络架构对待分割的图像进行图像实例分割。
本申请实施例提出的方案通过使用知识蒸馏的方法,对搜索的学生网络(分割网络架构)训练过程进行指导并修正,从而能够在较小计算开支的情况下,快速获得轻量级的语义分割模型,改善现有图像分割模型参数过大的问题,以较快的推理速度实现更为可靠的图像分割预测结果。在自动驾驶场景具有更佳适配性。
在一些实施例中,获取已训练的教师网络,包括:
利用骨干网络以及ASPP(Atrous Spatial Pyramid Pooling,空洞空间金字塔池化)模块构建教师网络的编码器,其中,ASPP模块包括四种不同膨胀率的空洞卷积块和一个全局平均池化块;
利用上采样模块、1×1卷积块、3×3卷积块构建教师网络的解码器,其中,教师网络的解码器将骨干网络中间层输出的低级特征图和ASPP模块的输出作为输入。
可选的,如图2所示,教师网络部分,DeepLabV3+网络应用ResNet101作为backbone(骨干网络),提取原图像中的多层语义特征,经过ASPP模块对特征信息以不同采样率的空洞卷积并行采样,获取了不同比例的图像上下文信息。ASPP模块接受backbone的第一部分输出作为输入,使用了四种不同膨胀率的空洞卷积块(包括卷积、BN、激活层)和一个全局平均池化块(包括池化、卷积、BN(Batch Normalization,批标准化)、激活层)得到一共五组特征图,将其concat(拼接)起来之后,经过一个1×1卷积块(包括卷积、BN、激活、dropout(随机失活)层),最后送入Decoder(解码)模块。该Decoder模块,接收来自backbone中间层的低级特征图和来自ASPP模块的输出作为输入。
在一些实施例中,还包括:
构建第一训练集;
利用骨干网络以及ASPP模块对训练集中的图像进行处理;
利用解码器对骨干网络中间层输出的低级特征图和ASPP模块的输出进行处理以得到图像实例分割结果;
根据图像实例分割结果调整教师网络的编码器和解码器的参数以对教师网络进行训练。
在一些实施例中,还包括:
对第一训练集中的数据进行数据增强。
可选的,增强数据集可以使用多种数据增强方法,例如可以使用公开号为CN114037637A的专利中所保护的图像数据增强方法进行数据增强,在此简单描述其步骤:对原始图像进行分割,获取分割图像与分割 图像的目标类别,通过目标类别获取待增强类别;按照待增强类别对原始图像分别进行二值化处理,获取二值图像,根据二值图像的连通域,获取原始图像中与待增强类别存在匹配关系的实例图像;对实例图像进行透视处理,获取第一实例图像,对第一实例图像进行缩放,获取第二实例图像;从原始图像中获取灭点位置,根据灭点位置与第二实例图像的几何尺寸,确定第二实例图像的粘贴位置,根据粘贴位置将第二实例图像粘贴至原始图像,获取原始图像的增强图像。
在一些实施例中,利用骨干网络以及ASPP模块对训练集中的图像进行处理,包括:
利用骨干网络提取训练集中图像的多层语义特征并利用ASPP模块对多层语义特征以不同采样率的空洞卷积并行采样以得到五组特征图,并将五组特征图拼接后输入到教师网络的解码器。
在一些实施例中,利用解码器对骨干网络中间层输出的低级特征图和ASPP模块的输出进行处理以得到图像分割结果,包括:
利用上采样模块对来自ASPP模块的特征图进行插值上采样并利用1×1卷积块对来自骨干网络中间层输出的低级特征图进行通道降维;
将通道降维的低级特征图和线性插值上采样得到的特征图拼接,并送入3×3卷积块进行处理,并再次利用上采样模块进行线性插值上采样以得到图像实例分割结果。
在一些实施例中,利用1×1卷积块对来自骨干网络中间层输出的低级特征图进行通道降维,包括:
将骨干网络中间层输出的低级特征图的通道降到48。
可选的,如图2所示,解码器模块可以对低级特征图使用1×1卷积进行通道降维,其中,包括Images Pooling(图像池化)从256降到48(之所以需要降采样到48,是因为太多的通道会掩盖ASPP输出的特征图的重要性,且实验验证48最佳)。对来自ASPP的特征图进行插值上采样(Unsample By4),得到与低级特征图尺寸相同的特征图。将通道降维的低级特征图和线性插值上采样得到的特征图使用concat拼接起来,并送入一组3*3卷积块进行处理。再次进行线性插值上采样,得到与原图分辨率大小一样的预测图。
在一些实施例中,控制器网络包括100个隐藏单元的两层递归LSTM神经网络,且所有隐藏单元从均匀分布中随机初始化。
可选的,控制器具有100个隐藏单元的两层递归LSTM神经网络,所有单元都从均匀分布中随机初始化。本申请实施例使用PP0优化策略进行优化,学习率为0.0001。
在一些实施例中,利用控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构,包括:
获取预设的第一解码器块、第二解码器块、第三解码器块以及第四解码器块;
利用控制器网络在预设的搜索空间中搜索第五解码器块和第六解码器块的内部结构,以及第一解码器块、第二解码器块、第三解码器块、第四解码器块、第五解码器块和第六解码器块之间的连接方式。
可选的,学生网络部分为基于神经网络架构搜索的图像分割网络,通过神经网络架构搜索的采样过程 获得目标网络架构、蒸馏教师网络中的结构知识和实例分割信息,网络架构采用编码器解码器结构。因为图像分割模型需要多次迭代才能收敛,受限于计算资源,目前从头开始执行完整的分割网络架构搜索难以实践,所以本申请实施例将架构搜索过程的注意力集中在解码器部分。整个网络结构一方而使用预先训练的分类网络中的权重初始化编码器,该分类网络由多个降低输入空间维度的下采样操作组成;另一方而,由控制器网络产生解码器结构,该解码器部分可以访问具有不同空间和频道维度的编码器的多个输出。为了保持采样架构紧凑且大小大致相同,每个编码器输出都会使用相同数量的输出通道进行一次1×1卷积。
图3学生网络部分展示了含有2个解码器块(第五解码器块block4和第六解码器块block5)和2个分支单元的搜索布局。图中的block4和block5通过控制器获得两组采样对,采样对的结果经过逐元素的求和操作后输入这两个模块,这两个模块经过控制器获得内部单元结构,其输出经过concat操作进行连接后,再经过conv1×1,输入到主分类器中,计算loss(损失)最终形成对图像的分割信息预测结果。图中的附加单元(auxiliary cell)与其他单元结构相同,可以被调节为直接输出真实背景(ground truth),或者模仿教师的网络预测(或以上两种的组合)。同时,它在训练或测试期间都不会影响主分类器的输出,只为网络的其余部分提供更好的梯度。不过,每个采样架构的反馈(reward)仍然由主分类器的输出决定。为简单起见,本申请实施例只对所有辅助输出应用分割损失。
图4为神经网络架构搜索的控制器网络的布局,它可以顺序地采样出要解码器连接方式,包括不同模块、不同运算操作以及不同分支位置索引。不同模块重复使用采样得到的单元架构,并将具有不同权重的相同单元应用于采样对内的每一模块,最后将两个单元的输出相加。结果层将添加到采样池中(下一个单元可以采样上一个单元作为输入)。其中,block4采样范围包括该模块前而所有模块<block0(第一解码器块),block1(第二解码器块),block2(第三解码器块),block3(第四解码器块)>,block5采样范围包括该模块前而所有模块<block0,block1,block2,block3,block4(第五解码器块>。单元内部架构采样下而将展开描述。
单元架构内部结果如图5所示。每个单元接受一个输入,控制器首先采样操作1;然后,采样两个位置索引(index),即输入index0和采样操作1的输出结果index1;最后采样两个相应的操作。将每个运算的输出相加,并在下一步对所有三层(来自每个运算及其相加结果)以及初始两层进行采样。单元内位置的采样次数由另一个超参数控制,以便将所有可能的架构的数量保持在一个可行的数量。对单元内所有现有的非采样求和输出求和,并将其用作单元输出。在这种情况下,使用求和(sum),因为拼接层(concatenation)运算可能会导致不同架构输出的向量尺寸变化。其中,09表示采样位置,采样操作1-7表示对应位置进行的运算。
在一些实施例中,搜索空间包括1×1卷积,3×3卷积,3×3可分离卷积,5×5可分离卷积,全局平均池化、上采样、1×1卷积模块,扩张率为3的3×3卷积,扩张率为12的3×3卷积,扩张率为3的可分离3×3卷积,扩张率为6的可分离5×5卷积,跳跃连接,有效地使路径无效的零操作。
可选的,层对的采样次数由一个超参数控制,在实验中将该参数设置为3。本申请实施例网络的编码器部分是MobileNet-v2,在MS COCO上预训练,预训练时使用轻量级的RefineNet解码器进行语义分割。本申请实施例使用MobileNet-v2的2、3、6、8的四个层的输出与block0至block3相对应,作为解码器的输入;用于编码器输出自适应的1×1卷积层,在搜索期间有48个输出通道,在训练期间有64个输出通道。使用Xavier方案随机初始化编码器权重。
本申请实施例使用控制器搜索基本单元的组合来构建神经网络架构,基于现有语义分割研究,本申请实施例中设置搜索空间如下:
·1×1卷积(Cony),
·3×3卷积,
·3×3可分离卷积,
·5×5可分离卷积,
·全局平均池化、上采样、1×1卷积模块(图中缩写为GAP),
·扩张率为3的3×3卷积,
·扩张率为12的3×3卷积,
·扩张率为3的可分离3×3卷积,
·扩张率为6的可分离5×5卷积,
·跳跃连接,
·有效地使路径无效的零操作。
在一些实施例中,利用已训练的教师网络和每一个分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后,使用已训练的教师网络的损失函数对每一个分割网络架构的损失函数进行指导并修正,包括利用如下公式对每一个分割网络架构的损失函数进行指导并修正:
LKD=LStudent+coff*LTeacher
其中,其中,LKD表示知识蒸馏网络总体损失,LStudent表示分割网络架构的损失,LTeacher表示教师网络损失,coff表示一个在实际的网络训练过程中可调节的参数。
在一些实施例中,coff取值为0.3。
可选的,通过神经网络架构搜索框架获得采样架构后,使用该采样架构与教师网络同时进行实例分割正向推理,在每次正向推理后,使用教师网络的损失函数对学生网络的损失函数进行指导并修正,如公式(1):
LKD=LStudent+coff*LTeacher
(1)式中其中,LKD表示知识蒸馏网络总体损失,LStudent表示分割网络架构的损失,LTeacher表示教师网络损失,coff表示一个在实际的网络训练过程中可调节的参数。经过反复训练学生网络,学生网络可以逐渐获取教师网络对不同实例的各层次的特征图和边缘信息,实现对图像实例的像素级定位。
在一些实施例中,还包括:
利用公式计算分割网络架构的损失;其中,n为不同类别实例,pixels为像素点,ytrue为对应类别的实际值,ypred为对应类型的预测值。
可选的,在计算时,并未考虑背景类,因为大量像素属于背景,加入背景类的计算后会对结果产生负而影响。在学生网络训练的过程中,需要最小化或者最大化目标函数,其中需要最小化目标的函数称为“损失函数”。损失函数的选择对于模型预测结果的精度至关重要,本申请实施例使用Dice Soft Loss作为损失函数,因为该损失函数可以针对不同类别的实例分别计算,该损失函数是语义分割任务中常用的损失函数,由基于dice系数的损失函数演变而来,表示不同类别预测值与实际值重叠的度量。对每个类别求其dice损失,再求和取平均值,详细表达式如式:
其中,n为不同类别实例,pixels为像素点,ytrue为对应类别的实际值,ypred为对应类型的预测值。
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练并从若干个分割网络架构中确定最优的分割网络架构,包括:
获取每一个分割网络架构的平均交并比、频率加权交并比以及平均像素精度;
利用平均交并比、频率加权交并比以及平均像素精度计算几何平均值;
根据几何平均值挑选若干个分割网络架构。
可选的,本申请实施例将训练集随机划分为两个不重叠的集合:初始训练集(Train DataSet0)和初始验证集(Valid DataSet0)。初始训练集可进行图像增强,用于在给定任务(即语义分割)上训练采样的架构;而初始验证集未经任何图像处理,用于评估训练的架构并为控制器提供标量(在强化学习文献中经常称为反馈)。搜索优化过程存在两个训练过程:采样架构的内部优化和控制器的外部优化。内部训练过程分为两个阶段,第一阶段为架构搜索阶段,在该阶段中,编码器的权重通过预训练得到,其输出经过计算后,存储在内存中,每次采样过程直接导入编码器输出,这样可以大量节约运算时间和效率,此时只有解码器在训练,方便解码器权重的快速自适应和对采样架构性能的合理估计。第二阶段为全量训练阶 段,但并非所有采样架构都可以进入该阶段,主要通过简单的模拟退火算法来决定是否继续为第二阶段训练采样架构。
之所以不训练全部的采样架构,是因为完成第一阶段训练的采样架构经过在当前batch上的训练后,已经可以预估未来的发展前景,提前终止没有前途的架构一则可以节约运算资源,二则可以更快地找到精度较高的目标架构。外部优化过程在给定采样序列、对数概率和反馈信号的情况下,通过近端策略优化(PPO)方法对控制器进行优化,在采样架构的多样性和调优过程的复杂性之间取得平衡,实现控制器网络模型更新和参数全局优化。
如上,本申请实施例在第一阶段之后保留反馈的运行平均值,来决定是否继续训练采样架构。在网络架构搜索过程中,评估架构未来前景优劣的标准,也就是reward,使用三个量的几何平均值:
平均交并比(mean intersection-over-union,mIoU),主要用于语义分割基准;
其中,k表示类别数量,i表示真实值,j表示预测值,Pij表示将i预测为j。以下相同。
频率加权交并比(frequency-weighted IoU,fwIoU),根据该类中存在的像素数缩放每个类IoU;
平均像素精度(mean-pixel accuracy,MPA),即平均每个类别的正确像素数。
求以上三个量的几何平均值:
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练,包括:
根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构分别进行第一阶段的全量训练、第二阶段的全量训练、第三阶段的全量训练。
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行第一阶段的全量训练,包括:
利用增强数据集训练50个epoch,其中使用的辅助单元参数为0.2。
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行第二阶段的全量训练,包括:
在经过第一阶段训练完毕的模型参数基础上,利用增强数据集训练50个epoch,其中使用的辅助单元参数为0.2。
在一些实施例中,根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行第二阶段的全量训练,包括:
在经过第二阶段训练完毕的模型参数基础上,利用增强数据集训练50个epoch,其中使用的辅助单元参数为0.15并冻结BN层。
可选的,在本申请实施例中,在每个epoch中增强数据集被全部输入到若干个分割网络架构中,即一次epoch(时期)可以理解为将所有的样本训练一次的过程。详细来说,当一个完整的数据集通过了神经网络一次并且返回了一次,这个过程称为一次epoch,也就是说,所有训练样本在神经网络中都进行了一次正向传播和反向传播。
本申请实施例提出的方案利用数据增强方法,对数据集进行图像增强;然后在增强数据集上,对DeepLabV3+神经网络进行训练,获得图像的分割信息,并将其用作教师网络;使用知识蒸馏的方法,对搜索的学生网络训练过程进行指导并修正,并针对图像分割数据的不同类别进行损失函数计算,可以有效提高小样本数据在图像分割中的检测精度。这样能够在较小计算开支的情况下,快速获得轻量级的语义分割模型,以较快的推理速度实现更为可靠的图像分割预测结果,在自动驾驶场景具有更佳适配性。
基于同一发明构思,根据本申请的另一个方而,本申请的实施例还提供了一种图像实例分割***400,如图6所示,包括:
获取模块401,被配置为获取已训练的教师网络和控制器网络;
搜索模块402,被配置为利用控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构;
评估模块403,被配置为利用已训练的教师网络和每一个分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用已训练的教师网络的损失函数对每一个分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构进行全量训练并从若干个分割网络架构中确定最优的分割网络架构;
图像实例分割模块404,被配置为利用最优的分割网络架构对待分割的图像进行图像实例分割。
在一些实施例中,还包括教师网络构建模块,被配置为:
利用骨干网络以及空洞空间金字塔池化ASPP模块构建教师网络的编码器,其中,ASPP模块包括四种不同膨胀率的空洞卷积块和一个全局平均池化块;
利用上采样模块、1×1卷积块、3×3卷积块构建教师网络的解码器,其中,教师网络的解码器将骨干网络中间层输出的低级特征图和ASPP模块的输出作为输入。
在一些实施例中,还包括教师网络训练模块,被配置为:
构建第一训练集;
利用骨干网络以及ASPP模块对训练集中的图像进行处理;
利用解码器对骨干网络中间层输出的低级特征图和ASPP模块的输出进行处理以得到图像实例分割结果;
根据图像实例分割结果调整教师网络的编码器和解码器的参数以对教师网络进行训练。
在一些实施例中,教师网络训练模块还被配置为:
对第一训练集中的数据进行数据增强。
在一些实施例中,教师网络训练模块还被配置为:
利用骨干网络提取训练集中图像的多层语义特征并利用ASPP模块对多层语义特征以不同采样率的空洞卷积并行采样以得到五组特征图,并将五组特征图拼接后输入到教师网络的解码器。
在一些实施例中,教师网络训练模块还被配置为:
利用上采样模块对来自ASPP模块的特征图进行插值上采样并利用1×1卷积块对来自骨干网络中间层输出的低级特征图进行通道降维;
将通道降维的低级特征图和线性插值上采样得到的特征图拼接,并送入3×3卷积块进行处理,并再次利用上采样模块进行线性插值上采样以得到图像实例分割结果。
在一些实施例中,教师网络训练模块还被配置为:
将骨干网络中间层输出的低级特征图的通道降到48。
在一些实施例中,控制器网络包括100个隐藏单元的两层递归LSTM神经网络,且所有隐藏单元从均匀分布中随机初始化。
在一些实施例中,搜索模块还被配置为:
获取预设的第一解码器块、第二解码器块、第三解码器块以及第四解码器块;
利用控制器网络在预设的搜索空间中搜索第五解码器块和第六解码器块的内部结构,以及第一解码器块、第二解码器块、第三解码器块、第四解码器块、第五解码器块和第六解码器块之间的连接方式。
在一些实施例中,搜索空间包括1×1卷积,3×3卷积,3×3可分离卷积,5×5可分离卷积,全局平均池化、上采样、1×1卷积模块,扩张率为3的3×3卷积,扩张率为12的3×3卷积,扩张率为3的可分离3×3卷积,扩张率为6的可分离5×5卷积,跳跃连接,有效地使路径无效的零操作。
在一些实施例中,评估模块还被配置为利用如下公式对每一个分割网络架构的损失函数进行指导并修正:
LKD=LStudent+coff*LTeacher
其中,其中,LKD表示知识蒸馏网络总体损失,LStudent表示分割网络架构的损失,LTeacher表示教师网络损失,coff表示一个在实际的网络训练过程中可调节的参数。
在一些实施例中,coff取值为0.3。
在一些实施例中,评估模块还被配置为:
利用公式计算分割网络架构的损失;其中,n为不同类别实例,pixels为像素点,ytrue为对应类别的实际值,ypred为对应类型的预测值。
在一些实施例中,评估模块还被配置为:
获取每一个分割网络架构的平均交并比、频率加权交并比以及平均像素精度;
利用平均交并比、频率加权交并比以及平均像素精度计算几何平均值;
根据几何平均值挑选若干个分割网络架构。
在一些实施例中,评估模块还被配置为:
根据模拟退火算法从多个分割网络架构中挑选若干个分割网络架构分别进行第一阶段的全量训练、第二阶段的全量训练、第三阶段的全量训练。
在一些实施例中,评估模块还被配置为:
利用增强数据集对所述若干个分割网络架构训练50个时期epoch,其中使用的辅助单元参数为0.2。
在一些实施例中,评估模块还被配置为:
在经过第一阶段训练完毕的模型参数基础上,利用所述增强数据集对所述若干个分割网络架构训练50个所述epoch,其中使用的辅助单元参数为0.2。
在一些实施例中,评估模块还被配置为:
在经过第二阶段训练完毕的模型参数基础上,利用所述增强数据集对所述若干个分割网络架构训练50个所述epoch,其中使用的辅助单元参数为0.15并冻结BN层。
基于同一发明构思,根据本申请的另一个方而,如图7所示,本申请的实施例还提供了一种计算机设备501,包括:
至少一个处理器520;以及
存储器510,存储器510存储有可在处理器上运行的计算机程序511,处理器520执行程序时执行如上的任一种图像实例分割方法的步骤。
基于同一发明构思,根据本申请的另一个方而,如图8所示,本申请的实施例还提供了一种非易失性可读存储介质601,非易失性可读存储介质601存储有计算机程序610,计算机程序610被处理器执行时执行如上的任一种图像实例分割方法的步骤。
最后需要说明的是,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关硬件来完成,程序可存储于一非易失性可读存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。
此外,应该明白的是,本文的非易失性可读存储介质(例如,存储器)可以是易失性存储器或非易失性存储器,或者可以包括易失性存储器和非易失性存储器两者。
本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。为了清楚地说明硬件和软件的这种可互换性,已经就各种示意性组件、方块、模块、电路和步骤的功能对其进行了一般性的描述。这种功能是被实现为软件还是被实现为硬件取决于实际应用以及施加给整个***的设计约束。本领域技术人员可以针对每种实际应用以各种方式来实现的功能,但是这种实现决定不应被解释为导致脱离本申请实施例公开的范围。
以上是本申请公开的示例性实施例,但是应当注意,在不背离权利要求限定的本申请实施例公开的范围的前提下,可以进行多种改变和修改。根据这里描述的公开实施例的方法权利要求的功能、步骤和/或动作不需以任何特定顺序执行。此外,尽管本申请实施例公开的元素可以以个体形式描述或要求,但除非明确限制为单数,也可以理解为多个。
应当理解的是,在本文中使用的,除非上下文清楚地支持例外情况,单数形式“一个”旨在也包括复数形式。还应当理解的是,在本文中使用的“和/或”是指包括一个或者一个以上相关联地列出的项目的任意和所有可能组合。
上述本申请实施例公开实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,程序可以存储于一种非易失性可读存储介质中,上述提到的非易失性可读存储 介质可以是只读存储器,磁盘或光盘等。
所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本申请实施例公开的范围(包括权利要求)被限于这些例子;在本申请实施例的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,并存在如上的本申请实施例的不同方而的许多其它变化,为了简明它们没有在细节中提供。因此,凡在本申请实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本申请实施例的保护范围之内。

Claims (21)

  1. 一种图像实例分割方法,其特征在于,包括以下步骤:
    获取已训练的教师网络和控制器网络;
    利用所述控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构;
    利用所述已训练的教师网络和每一个所述分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用所述已训练的教师网络的损失函数对每一个所述分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行全量训练并从所述若干个分割网络架构中确定最优的分割网络架构;
    利用所述最优的分割网络架构对待分割的图像进行图像实例分割。
  2. 如权利要求1所述的方法,其特征在于,获取已训练的教师网络,包括:
    利用骨干网络以及空洞空间金字塔池化ASPP模块构建教师网络的编码器,其中,所述ASPP模块包括四种不同膨胀率的空洞卷积块和一个全局平均池化块;
    利用上采样模块、1×1卷积块、3×3卷积块构建教师网络的解码器,其中,所述教师网络的解码器将所述骨干网络中间层输出的低级特征图和所述ASPP模块的输出作为输入。
  3. 如权利要求2所述的方法,其特征在于,还包括:
    构建第一训练集;
    利用骨干网络以及ASPP模块对训练集中的图像进行处理;
    利用所述解码器对所述骨干网络中间层输出的低级特征图和所述ASPP模块的输出进行处理以得到图像实例分割结果;
    根据所述图像实例分割结果调整所述教师网络的编码器和解码器的参数以对所述教师网络进行训练。
  4. 如权利要求3所述的方法,其特征在于,还包括:
    对所述第一训练集中的数据进行数据增强。
  5. 如权利要求3所述的方法,其特征在于,利用骨干网络以及ASPP模块对训练集中的图像进行处理,包括:
    利用所述骨干网络提取训练集中图像的多层语义特征并利用所述ASPP模块对所述多层语义特征以不同采样率的空洞卷积并行采样以得到五组特征图,并将所述五组特征图拼接后输入到所述教师网络的解码器。
  6. 如权利要求3所述的方法,其特征在于,利用所述解码器对所述骨干网络中间层输出的低级特征图和所述ASPP模块的输出进行处理以得到图像分割结果,包括:
    利用所述上采样模块对来自所述ASPP模块的特征图进行插值上采样并利用所述1×1卷积块对来自所 述骨干网络中间层输出的低级特征图进行通道降维;
    将通道降维的低级特征图和线性插值上采样得到的特征图拼接,并送入所述3×3卷积块进行处理,并再次利用所述上采样模块进行线性插值上采样以得到所述图像实例分割结果。
  7. 如权利要求6所述的方法,其特征在于,利用所述1×1卷积块对来自所述骨干网络中间层输出的低级特征图进行通道降维,包括:
    将所述骨干网络中间层输出的低级特征图的通道降到48。
  8. 如权利要求1所述的方法,其特征在于,所述控制器网络包括100个隐藏单元的两层递归LSTM神经网络,且所有隐藏单元从均匀分布中随机初始化。
  9. 如权利要求1所述的方法,其特征在于,利用所述控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码器构成多个分割网络架构,包括:
    获取预设的第一解码器块、第二解码器块、第三解码器块以及第四解码器块;
    利用所述控制器网络在预设的搜索空间中搜索第五解码器块和第六解码器块的内部结构,以及第一解码器块、第二解码器块、第三解码器块、第四解码器块、第五解码器块和第六解码器块之间的连接方式。
  10. 如权利要求9所述的方法,其特征在于,所述搜索空间包括1×1卷积,3×3卷积,3×3可分离卷积,5×5可分离卷积,全局平均池化、上采样、1×1卷积模块,扩张率为3的3×3卷积,扩张率为12的3×3卷积,扩张率为3的可分离3×3卷积,扩张率为6的可分离5×5卷积,跳跃连接,有效地使路径无效的零操作。
  11. 如权利要求1所述的方法,其特征在于,利用所述已训练的教师网络和每一个所述分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后,使用所述已训练的教师网络的损失函数对每一个所述分割网络架构的损失函数进行指导并修正,包括利用如下公式对每一个所述分割网络架构的损失函数进行指导并修正:
    LKD=LStudent+coff*LTeacher
    其中,LKD表示知识蒸馏网络总体损失,LStudent表示分割网络架构的损失,LTeacher表示教师网络损失,coff表示一个在实际的网络训练过程中可调节的参数。
  12. 如权利要求11所述的方法,其特征在于,所述coff取值为0.3。
  13. 如权利要求1所述的方法,其特征在于,还包括:
    利用公式计算所述分割网络架构的损失;其中,n为不同类别实例,pixels为像素点,ytrue为对应类别的实际值,ypred为对应类型的预测值。
  14. 如权利要求1所述的方法,其特征在于,根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行全量训练并从所述若干个分割网络架构中确定最优的分割网络架构,包括:
    获取每一个分割网络架构的平均交并比、频率加权交并比以及平均像素精度;
    利用平均交并比、频率加权交并比以及平均像素精度计算几何平均值;
    根据所述几何平均值挑选若干个分割网络架构。
  15. 如权利要求1所述的方法,其特征在于,根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行全量训练,包括:
    根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构分别进行第一阶段的全量训练、第二阶段的全量训练、第三阶段的全量训练。
  16. 如权利要求15所述的方法,其特征在于,根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行第一阶段的全量训练,包括:
    利用增强数据集对所述若干个分割网络架构训练50个时期epoch,其中,使用的辅助单元参数为0.2。
  17. 如权利要求16所述的方法,其特征在于,根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行第二阶段的全量训练,包括:
    在经过第一阶段训练完毕的模型参数基础上,利用所述增强数据集对所述若干个分割网络架构训练50个所述epoch,其中使用的辅助单元参数为0.2。
  18. 如权利要求17所述的方法,其特征在于,根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行第二阶段的全量训练,包括:
    在经过第二阶段训练完毕的模型参数基础上,利用所述增强数据集对所述若干个分割网络架构训练50个所述epoch,其中使用的辅助单元参数为0.15并冻结BN层。
  19. 一种图像实例分割***,其特征在于,包括:
    获取模块,被配置为获取已训练的教师网络和控制器网络;
    搜索模块,被配置为利用所述控制器网络搜索多个解码器结构并利用每一个解码器结构和固定的编码 器构成多个分割网络架构;
    评估模块,被配置为利用所述已训练的教师网络和每一个所述分割网络架构同时进行图像实例分割正向推理,并在每次正向推理后使用所述已训练的教师网络的损失函数对每一个所述分割网络架构的损失函数进行指导并修正,以及根据模拟退火算法从多个所述分割网络架构中挑选若干个分割网络架构进行全量训练并从所述若干个分割网络架构中确定最优的分割网络架构;
    图像实例分割模块,被配置为利用所述最优的分割网络架构对待分割的图像进行图像实例分割。
  20. 一种计算机设备,包括:
    至少一个处理器;以及
    存储器,所述存储器存储有可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时执行如权利要求1-18任意一项所述的方法的步骤。
  21. 一种非易失性可读存储介质,所述非易失性可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时执行如权利要求1-18任意一项所述的方法的步骤。
PCT/CN2023/101908 2022-11-30 2023-06-21 一种图像实例分割方法、***、设备以及非易失性可读存储介质 WO2024113782A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211515764.5A CN115546492B (zh) 2022-11-30 2022-11-30 一种图像实例分割方法、***、设备以及存储介质
CN202211515764.5 2022-11-30

Publications (1)

Publication Number Publication Date
WO2024113782A1 true WO2024113782A1 (zh) 2024-06-06

Family

ID=84721895

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101908 WO2024113782A1 (zh) 2022-11-30 2023-06-21 一种图像实例分割方法、***、设备以及非易失性可读存储介质

Country Status (2)

Country Link
CN (1) CN115546492B (zh)
WO (1) WO2024113782A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115546492B (zh) * 2022-11-30 2023-03-10 苏州浪潮智能科技有限公司 一种图像实例分割方法、***、设备以及存储介质
CN116862836A (zh) * 2023-05-30 2023-10-10 北京透彻未来科技有限公司 一种泛器官***转移癌检测***及计算机设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445008A (zh) * 2020-03-24 2020-07-24 暗物智能科技(广州)有限公司 一种基于知识蒸馏的神经网络搜索方法及***
CN113409299A (zh) * 2021-07-12 2021-09-17 北京邮电大学 一种医学图像分割模型压缩方法
US20210334543A1 (en) * 2020-04-28 2021-10-28 Ajou University Industry-Academic Cooperation Foundation Method for semantic segmentation based on knowledge distillation
WO2022098203A1 (en) * 2020-11-09 2022-05-12 Samsung Electronics Co., Ltd. Method and apparatus for image segmentation
CN115546492A (zh) * 2022-11-30 2022-12-30 苏州浪潮智能科技有限公司 一种图像实例分割方法、***、设备以及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4211271A1 (en) * 2020-09-14 2023-07-19 CZ Biohub SF, LLC Genomic sequence dataset generation
CN114299380A (zh) * 2021-11-16 2022-04-08 中国华能集团清洁能源技术研究院有限公司 对比一致性学习的遥感图像语义分割模型训练方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445008A (zh) * 2020-03-24 2020-07-24 暗物智能科技(广州)有限公司 一种基于知识蒸馏的神经网络搜索方法及***
US20210334543A1 (en) * 2020-04-28 2021-10-28 Ajou University Industry-Academic Cooperation Foundation Method for semantic segmentation based on knowledge distillation
WO2022098203A1 (en) * 2020-11-09 2022-05-12 Samsung Electronics Co., Ltd. Method and apparatus for image segmentation
CN113409299A (zh) * 2021-07-12 2021-09-17 北京邮电大学 一种医学图像分割模型压缩方法
CN115546492A (zh) * 2022-11-30 2022-12-30 苏州浪潮智能科技有限公司 一种图像实例分割方法、***、设备以及存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIU, HUI: "Research on Flower Recognition Based on Wide Residual Network and Transfer Learning", AGRICULTURE & TECHNOLOGY, CHINA MASTER'S THESES FULL-TEXT DATABASE, no. 07, 15 July 2020 (2020-07-15), ISSN: 1674-0246 *
PARK SANGYONG, HEO YONG SEOK: "Knowledge Distillation for Semantic Segmentation Using Channel and Spatial Correlations and Adaptive Cross Entropy", SENSORS, MDPI, CH, vol. 20, no. 16, 1 January 2020 (2020-01-01), CH , pages 4616, XP093176553, ISSN: 1424-8220, DOI: 10.3390/s20164616 *

Also Published As

Publication number Publication date
CN115546492B (zh) 2023-03-10
CN115546492A (zh) 2022-12-30

Similar Documents

Publication Publication Date Title
WO2024113782A1 (zh) 一种图像实例分割方法、***、设备以及非易失性可读存储介质
CN107945204B (zh) 一种基于生成对抗网络的像素级人像抠图方法
CN112347248A (zh) 一种方面级文本情感分类方法及***
CN114255361A (zh) 神经网络模型的训练方法、图像处理方法及装置
CN110782008A (zh) 深度学习模型的训练方法、预测方法和装置
CN114144794A (zh) 电子装置及用于控制电子装置的方法
CN112084911B (zh) 一种基于全局注意力的人脸特征点定位方法及***
Zhu et al. Nasb: Neural architecture search for binary convolutional neural networks
CN112561028A (zh) 训练神经网络模型的方法、数据处理的方法及装置
CN112686376A (zh) 一种基于时序图神经网络的节点表示方法及增量学习方法
CN116992779B (zh) 基于数字孪生模型的光伏储能***仿真方法及***
CN113065525A (zh) 年龄识别模型训练方法、人脸年龄识别方法及相关装置
CN116564355A (zh) 一种基于自注意力机制融合的多模态情感识别方法、***、设备及介质
CN116049459A (zh) 跨模态互检索的方法、装置、服务器及存储介质
CN113609904B (zh) 一种基于动态全局信息建模和孪生网络的单目标跟踪算法
KR102149355B1 (ko) 연산량을 줄이는 학습 시스템
CN113313250B (zh) 采用混合精度量化与知识蒸馏的神经网络训练方法及***
CN114491289A (zh) 一种双向门控卷积网络的社交内容抑郁检测方法
CN114298224A (zh) 图像分类方法、装置以及计算机可读存储介质
CN113299298A (zh) 残差单元及网络及目标识别方法及***及装置及介质
CN115376195B (zh) 训练多尺度网络模型的方法及人脸关键点检测方法
CN115995002A (zh) 一种网络构建方法及城市场景实时语义分割方法
CN113887653B (zh) 一种基于三元网络的紧耦合弱监督学习的定位方法及***
CN114298290A (zh) 一种基于自监督学习的神经网络编码方法及编码器
CN113095328A (zh) 一种基尼指数引导的基于自训练的语义分割方法