CN113705371B

CN113705371B - Water visual scene segmentation method and device

Info

Publication number: CN113705371B
Application number: CN202110914168.3A
Authority: CN
Inventors: 肖长诗; 陈芊芊; 陈华龙; 文元桥; 张帆
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2023-12-01
Anticipated expiration: 2041-08-10
Also published as: CN113705371A

Abstract

The application relates to a water visual scene segmentation method, which comprises the following steps: collecting a live-action image of a scene on water, and carrying out semantic segmentation on the live-action image by adopting a pre-training semantic segmentation network to generate a semantic label of each pixel in the live-action image; dividing the live-action image by adopting a feature clustering algorithm to obtain a plurality of super-pixel areas; counting the proportion of pixels corresponding to various semantic tags in each super-pixel region, taking the semantic tag of the pixel with the largest proportion as the semantic tag of the corresponding super-pixel region, and calculating the confidence weight of the semantic tag of the corresponding super-pixel region according to the proportion; establishing a live-action training sample set according to the live-action image marked with the semantic tag and the confidence weight; training the deep convolutional neural network through a live-action training sample set to obtain a semantic segmentation network; inputting the image to be identified into a semantic segmentation network to obtain a semantic segmentation result. The method and the device can automatically generate the semantic tags of the training samples of the semantic segmentation network.

Description

Water visual scene segmentation method and device

Technical Field

The application relates to the technical field of scene understanding on water, in particular to a method and a device for segmenting a visual scene on water and a computer storage medium.

Background

The traditional image semantic segmentation method mainly comprises a pixel level threshold method, a segmentation method based on pixel clustering and a segmentation method based on graph theory division. The method mainly relies on low-dimensional visual features of images to divide, adopts visual features based on colors, textures, edges and the like, uses some feature extraction algorithms to extract visual information such as the edge features, textures and the like of objects in the images, and then divides areas and objects in the images according to the low-level visual features, such as common image features including directional gradient histogram features, SIFT features, SURF features, local binary features (LBP), gabor features and the like.

With the development of neural network technology, semantic segmentation networks are also applied to image semantic segmentation. When training the semantic segmentation network, we can train U-Net offline by using public image data sets such as ImageNet to obtain the image semantic segmentation network, but because the training set is not specific to the water navigation scene, if the semantic segmentation network is directly applied to the water semantic segmentation, the error is larger, so we need to further retrain the network to adapt the network structure and weight to the new application scene. The retraining process needs to be performed on the labeled live-action training data set, and the problem that the efficiency is low and mistakes are easy to occur when the live-action training data set is generated by manual labeling.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a method, an apparatus and a computer storage medium for segmentation of a water visual scene, which are used for solving the problems of difficulty in labeling real-scene training data of a semantic segmentation network and easiness in error.

The application provides a water visual scene segmentation method, which comprises the following steps:

collecting a live-action image of a scene on water, and carrying out semantic segmentation on the live-action image by adopting a pre-training semantic segmentation network to generate a semantic label of each pixel in the live-action image;

dividing the live-action image by adopting a feature clustering algorithm to obtain a plurality of super-pixel areas;

counting the proportion of pixels corresponding to various semantic tags in each super-pixel region, taking the semantic tag of the pixel with the largest proportion as the semantic tag of the corresponding super-pixel region, and calculating the confidence weight of the semantic tag of the corresponding super-pixel region according to the proportion;

establishing a live-action training sample set according to the live-action image marked with the semantic tag and the confidence weight;

training the deep convolutional neural network through the live-action training sample set to obtain a semantic segmentation network;

inputting the image to be identified into the semantic segmentation network to obtain a semantic segmentation result.

Further, the feature clustering algorithm is adopted to segment the live-action image, so that a plurality of super-pixel areas are obtained, specifically:

distinguishing different scene areas by using life cycle semantic feature points among the live-action image sequences, modeling feature time statistics of the feature points by using Gaussian distribution of different parameters to obtain a double Gaussian model, and taking the double Gaussian model as a likelihood function of the extracted feature points;

calculating likelihood functions of other pixel points except the feature points in the live-action image by adopting a clustering algorithm;

under a Bayesian framework, taking a loss value of convolutional neural network segmentation as a priori probability, and calculating classification probabilities of all pixel points in an image by combining the likelihood functions:

P _r (X _i,j ＝o|Y)∝P(Y|X _i,j ＝o)×P(X _i,j ＝o)；

wherein P is _r (X _i,j =o|y) is the classification probability, P (y|x _i,j O) is a priori probability, P (X _i,j =o) is a likelihood function;

and the classification probability is in direct proportion to the product of the prior probability and the likelihood function, and the semantic segmentation of the live-action image is completed according to the classification probability, so that a plurality of super-pixel areas are obtained.

Further, a clustering algorithm is adopted to calculate likelihood functions of other pixel points except the feature points in the live-action image, and the likelihood functions are specifically as follows:

calculating the distance and gray level difference between other pixel points and the feature point based on a clustering algorithm model by taking the extracted feature point as a center, and assuming that the clustering algorithm model is as follows:

P(Y _i |X _i,j ＝o)＝K·exp(-(ΔI _i,j )·(Δd _i,j ))；

wherein P (Y) _i |X _i,j =o) is a likelihood function of other pixels than the feature point, X _i,j E { o, w }, o represents that the pixel belongs to an obstacle, w represents that the pixel belongs to the water surface, Y _i Represents the observed value, K is the scaling factor, ΔI _i,j Represents the distance between the pixel point and the feature point, Δd _i,j Representing the gray scale difference between the pixel point and the feature point.

Further, calculating the confidence weight of the semantic label of the corresponding superpixel region according to the proportion, wherein the confidence weight is specifically as follows:

acquiring the proportion as a first weight factor;

acquiring the feature quantity of the life cycle of the super pixel area as a second weight factor;

acquiring the coverage proportion of radar echo signals in the super-pixel area as a third weight factor;

and normalizing the first weight factor, the second weight factor and the third weight factor to obtain three corresponding probabilities, and taking the difference value of the maximum probability and the second maximum probability as the confidence weight.

Further, a live-action training sample set is established according to the live-action image marked with the semantic tag and the confidence weight, and specifically comprises the following steps:

constructing a generated countermeasure network, and training the generated countermeasure network by utilizing the live-action image;

and automatically generating training samples by using the trained generating countermeasure network, and constructing the live-action training sample set.

Further, constructing a generated countermeasure network, and training the generated countermeasure network by using the live-action image specifically comprises the following steps:

constructing a generating network by adopting a structure without interlayer connection U-Net;

constructing a discrimination network by adopting a triple network structure;

inputting the live-action image as an input image into the generation network to obtain a generated image;

the Triplet network comprises three feature extraction networks, wherein an input image, a generated image and a reference image are respectively input into the three feature extraction networks and transformed into the same deep feature space, and the distance of a feature vector is used as a loss function to calculate a loss value;

training the generation of the antagonism network by back-propagating the loss values.

Further, the loss function is:

G ^* ＝arg min _G max _D (L _CGAN )+αL _content +βL _environment ；

wherein G is ^* Representing loss value, alpha and beta are super parameters, L _CGAN Representing generation of a loss function against the network, L _content To input constraint items of a scene, L _environment Constraint item, max, related to migration characteristics of reference image _D Represents the maximum value, min _G Representing taking the minimum value.

Further, before training the generating countermeasure network by using the live-action image, the method further includes:

and pre-training the identification network for generating the countermeasure network by using the manually marked sample image set.

The application also provides a water visual scene segmentation device, which comprises a processor and a memory, wherein the memory is stored with a computer program, and the water visual scene segmentation method is realized when the computer program is executed by the processor.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the above-mentioned method of water visual scene segmentation.

The beneficial effects are that: the application firstly carries out parallel segmentation on the input image by using two segmentation methods, the generated areas of the two segmentation methods cannot be completely overlapped, and particularly the difference between the junction parts of the different areas is larger. The semantic tag of the superpixel is then determined using the distribution characteristics of the semantics of the pixels contained within the superpixel. The super-pixel region segmentation map weighted with semantic tags and confidence will be used as new training data for training the semantic segmentation network online. The application automatically generates the training data of the segmentation network, and has high efficiency and low error rate.

Drawings

Fig. 1 is a flowchart of a method of a first embodiment of a method for segmenting a water visual scene.

Detailed Description

The following detailed description of preferred embodiments of the application is made in connection with the accompanying drawings, which form a part hereof, and together with the description of the embodiments of the application, are used to explain the principles of the application and are not intended to limit the scope of the application.

Example 1

As shown in fig. 1, embodiment 1 of the present application provides a method for segmenting a water visual scene, which is characterized by comprising the following steps:

s1, acquiring a live-action image of a scene on water, and performing semantic segmentation on the live-action image by adopting a pre-training semantic segmentation network to generate a semantic label of each pixel in the live-action image;

s2, segmenting the live-action image by adopting a feature clustering algorithm to obtain a plurality of super-pixel areas;

s3, counting the proportion of pixels corresponding to various semantic labels in each super-pixel region, taking the semantic label of the pixel with the largest proportion as the semantic label of the corresponding super-pixel region, and calculating the confidence weight of the semantic label of the corresponding super-pixel region according to the proportion;

s4, building a live-action training sample set according to the live-action image marked with the semantic tag and the confidence weight;

s5, training the deep convolutional neural network through the live-action training sample set to obtain a semantic segmentation network;

s6, inputting the image to be identified into the semantic segmentation network to obtain a semantic segmentation result.

In order to automatically generate the meaning label of the live-action image, the present embodiment first performs parallel segmentation on the current input image by two segmentation methods. The first method is as follows: the pre-trained semantic segmentation network is used for segmenting each pixel in the image to generate a semantic label, wherein the semantic label can be a water surface, a sky, a shoreline and the like. The second method is as follows: and (3) adopting characteristic cluster segmentation, such as scale self-adaptive clustering or graph segmentation, to generate a super-pixel region without semantic information, and further optimizing a cluster segmentation result by using a random Markov random field method according to requirements. The two segmentation methods can not completely coincide with each other, and particularly the difference between the junctions of the different regions is large, and the semantic label of the super pixel is determined by utilizing the semantic distribution characteristics of the pixels contained in the super pixel. For example, if a majority of the pixels in a superpixel region are "water" labels, then the superpixel region will also be labeled "water", with the confidence of this label being proportional to the proportion of "water" label pixels in the superpixel region. The super-pixel region segmentation map weighted with semantic tags and confidence will be used as new training data for training the semantic segmentation network online. The effect of the confidence weight is reflected in the loss function of the network training, and the effect of the superpixel with higher confidence on the loss function value is larger.

In the above-described live-action training data generation scheme, the influence data quality is determined by two factors: the first factor is the quality of feature cluster segmentation; the second factor is the calculation that generates the tag confidence weights. The present embodiment proposes an improvement for these two influencing factors, and will be described in detail below.

Preferably, the feature clustering algorithm is adopted to segment the live-action image, so as to obtain a plurality of super-pixel areas, which specifically are:

P _r (X _i,j ＝o|Y)∝P(Y|X _i,j ＝o)×P(X _i,j ＝o)；

The double gaussian model is:

wherein t is the observed characteristic point characteristic time, mu is the average value of the life cycle model, sigma ² Is the standard deviation of life cycle model, f _μ,σ (t) is a likelihood function of the extracted feature points.

And combining the double Gaussian model and the clustering algorithm model to calculate likelihood function distribution of all pixel points in the image.

Preferably, a clustering algorithm is adopted to calculate likelihood functions of other pixel points except the feature points in the live-action image, and the likelihood functions are specifically as follows:

and calculating the distance and gray level difference between other pixel points and the feature points based on a clustering algorithm model by taking the extracted feature points as the center, wherein the clustering algorithm model is as follows:

P(Y _i |X _i,j ＝o)＝K·exp(-(ΔI _i,j )·(Δd _i,j ))；

The embodiment also improves the initialization method of the feature cluster segmentation algorithm. In the region growing image segmentation algorithm, the number and position distribution of the region growing seeds are parameters preset by relying on priori knowledge, and the segmentation results are greatly different due to different initial seed numbers and distribution parameters. In the embodiment, by utilizing the characteristic of position distribution aggregation in the current input image based on the characteristics of different life cycles, the proper number of seeds and the proper positions of the seeds in the region are selected, and the dependence of the segmentation result on prior parameters is reduced. The method comprises the following specific steps: firstly, carrying out feature extraction tracking, counting a three-dimensional histogram of life cycle and position distribution of features in a current image, then setting the number and position distribution of region growing seeds according to the number and the peak position of peaks in the histogram, and finally completing current image segmentation by using a region growing algorithm.

Preferably, the confidence weight of the semantic label of the corresponding superpixel region is calculated according to the proportion, specifically:

acquiring the proportion as a first weight factor;

According to the embodiment, confidence weight is calculated by fusing radar and AIS heterogeneous sensor information according to the proportion of the semantic label corresponding to the pixels in the super pixel region.

The calculation scheme of the label confidence weight of the live-action training data comprises the following steps: the real training data as described above generates a basic scheme, and the generated training data is attached with weight information of the confidence level of the label in addition to the label information of each pixel. The distribution method of the weight directly influences the network training effect through the loss function, and how to reasonably calculate the weight is one of key factors of the network training effect. This embodiment introduces three factors that affect the confidence weight: semantic distribution characteristics in a super-pixel area generated by cluster segmentation, characteristic life cycle distribution characteristics in the super-pixel area, and distribution characteristics of radar and AIS signal back projection signals in the super-pixel area.

1. Semantic tag confidence weighting factors: the statistical proportion of each semantic pixel in the super pixel area is calculated as the probability that the super pixel belongs to a certain class of targets (water surface, sky and obstacle), and the proportion is a softmax loss value based on Convolutional Neural Network (CNN) segmentation.

2. Feature lifecycle confidence weight factor: the number of features of the life cycle of the super pixel area is counted, and the larger the number is, the larger the probability that the super pixel belongs to the obstacle area is.

3. Radar AIS signal backprojection confidence weighting factor: and (3) counting the coverage proportion of radar echo signals in the super-pixel area, wherein the probability that the super-pixel belongs to the obstacle area is larger as the proportion is larger.

And fusing the three weight influence factors for each type of scene target semantic label, and normalizing to obtain the probability of the super pixel belonging to each semantic label. The label category with the highest probability is used as the semantic label of the superpixel, and the confidence weight is determined by the difference between the maximum probability and the second highest probability.

Preferably, a live-action training sample set is established according to the live-action image marked with the semantic tag and the confidence weight, specifically:

The expansion of training data sets with data enhancement is a common approach in deep learning. Common data enhancement techniques fall into three categories: the first method is to make translation, selection, stretching, twisting, noise adding and other treatments on the existing training data, so as to multiply increase the training data set; the second way is 3D digital model scene virtual camera imaging; the third way is to randomly generate specific image data from a certain random distribution using a data generation network.

The third way and improvement of this embodiment is that the most common data generation network is the generation countermeasure network (Generative Adversarial Network), the basic idea is: GAN consists of a arbiter D and a generator G, both of which are structured as CNNs. The generator G generates a virtual image through a CNN network by a random vector, and inputs the virtual image and the virtual image into a discriminator, and the discriminator judges whether the input image is real or virtual through another CNN network; in the training process, alternately training the generating network and the judging network: the design of the network loss function enables the discriminator to open virtual data and real data as far as possible, and the generated network generates data which is as close to the real data as possible to reduce the accuracy of the discriminator. The two networks achieve Nash equilibrium through interactive competition learning, and the two networks achieve the optimal simultaneously.

Preferably, a generated countermeasure network is constructed, and the generated countermeasure network is trained by using the live-action image, specifically:

constructing a discrimination network by adopting a triple network structure;

In the embodiment, an extended GAN is adopted to generate virtual data, namely a condition generation countermeasure network, the main idea is that a semantic tag image generated by an image segmentation network is used as a generation constraint condition, a specific texture is generated in a specific tag area on the premise of semantic tags, so that the texture features of the generated virtual image are respectively consistent with semantics, and the texture features are as close to the original segmentation network input image as possible on the level of the intrinsic features. The present embodiment is particularly directed to a meteorological feature migration technology of a natural scene image, such as generating virtual water surfaces with different wave heights from calm water surface textures in a live-action image, or generating virtual water surface flare textures from water surface textures under a live-action shadow, or adding virtual fog in a real scene, or virtually generating obstacles such as an island reef, a ship, and the like. The detailed scheme is as follows:

generation network of cgan: an encoder/decoder architecture without an interlayer connection U-Net is employed. Scene meteorological feature migration is a non-linear pixel-to-pixel transformation mapping, so the final label output layer of the network needs to be modified to generate output for RGB three-dimensional virtual pixels.

Discrimination network of cgan: three feature extraction networks have common Network structures and Network parameters by adopting a triple Network (extension of a Siamese Network), meanwhile, an input image, a generated image and a reference image are transformed into the same deep feature space, then a corresponding loss function is calculated by calculating distance measurement of feature vectors, the Network is trained and optimized through counter propagation errors, the image generated by the generated Network is guided to be close to the input image on intrinsic features and close to the reference image on meteorological features, and the semantic segmentation graph constraint of the input image is met.

Preferably, the loss function is:

G ^* ＝arg min _G max _D (L _CGAN )+αL _content +βL _environment ；

Network training loss function design: the loss function used for training the meteorological feature migration CGAN network consists of three parts: the first part is a normal CGAN loss function L _CGAN The second part is the constraint L of the input scene _content Namely, the intrinsic scene of the generated image is as close as possible to the input intrinsic scene; the third part is a constraint item L related to the meteorological features of the target meteorological scene reference image to be migrated _environment 。

The three constraint terms are calculated as follows:

L _CGAN ＝E _x,y [log D(x,y)]+E _x,z [log(1-D(x,G(x,z)))]；

where D (x, y) represents the ability of the discriminator to discriminate between real and virtual scenes given the semantic tag x and G (x, z) represents the ability of the generator to generate virtual scene y from random noise z given the semantic tag x.

The meteorological environment concerned in this embodiment has regional characteristics on the visual characteristics of the scene, for example, under high storm weather, the characteristics of the water surface and the characteristics of the on-shore target or sky are greatly different, so that the drawing style characteristic statistical method based on the whole image is not applicable. According to the embodiment, semantic information of the segmented regions is introduced into network input, so that the features have the characteristic of spatial region limitation, and the network is guided to learn different meteorological features of different regions in the mode, so that the purpose of generating a virtual image of a realistic scene as training data is achieved.

The three loss functions are integrated, and the training optimization target of the generator G is as follows:

G ^* ＝arg min _G max _D (L _CGAN )+αL _content +βL _environment 。

the network optimization training adopts a random gradient descent method SGD with momentum, and the network generalization method adopts common methods such as drop-out and L2/L1 constraint.

Preferably, before training the generating countermeasure network by using the live-action image, the method further includes:

To reduce the complexity of the CGAN network training, the discrimination network may be pre-trained using manually labeled datasets of different meteorological scenarios. The supervised pre-training scheme adds a full-connection layer and an output softmax layer after the feature extraction network for semantic classification of meteorological scenes, and the meteorological semantic dictionary covers common navigation scenes and meteorological scenes. The other scheme is that different artificially marked meteorological navigation scene data sets are utilized to jointly perform supervised learning together with an image semantic segmentation network, and a meteorological feature extraction network is partially overlapped with an encoder of the semantic segmentation network.

In the embodiment, when training is performed on the semantic segmentation network, a training data set which is not required to be marked manually is generated on line in real time, and high-quality and scene-adaptive training data with labels is automatically generated. In order to improve network precision, pixel areas with high semantic confidence are required to be considered in a training process, the influence degree of pixels with low semantic confidence is reduced, therefore, in the embodiment, a weight item is calculated on the semantic label of each pixel, so that the network loss function is automatically adjusted, generated training data is obtained through extracting image features, fusing radar and AIS data, and finally, the obtained automatically generated label has different confidence degrees in different areas.

Example 2

Embodiment 2 of the present application provides a water visual scene segmentation apparatus, including a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, the water visual scene segmentation method provided in embodiment 1 is implemented.

The device for dividing the water visual scene provided by the embodiment of the application is used for realizing the method for dividing the water visual scene, so that the device for dividing the water visual scene has the technical effects as well and is not described in detail herein.

Example 3

Embodiment 3 of the present application provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method for water visual scene segmentation provided in embodiment 1.

The computer storage medium provided by the embodiment of the application is used for realizing the method for dividing the water visual scene, so that the technical effects of the method for dividing the water visual scene are achieved, and the computer storage medium is also provided and will not be described herein.

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application.

Claims

1. The method for segmenting the water visual scene is characterized by comprising the following steps of:

inputting the image to be identified into the semantic segmentation network to obtain a semantic segmentation result;

the live-action image is segmented by adopting a feature clustering algorithm to obtain a plurality of super-pixel areas, which are specifically as follows:

；

wherein,for classification probability->For the prior probability->Is a likelihood function;

the classification probability is in direct proportion to the product of the prior probability and the likelihood function, and the semantic segmentation of the live-action image is completed according to the classification probability, so that a plurality of super-pixel areas are obtained;

calculating likelihood functions of other pixel points except the feature points in the live-action image by adopting a clustering algorithm, wherein the likelihood functions are specifically as follows:

；

wherein,for likelihood functions of other pixels than feature points, +.>，Indicating that the pixel belongs to an obstacle +.>Indicating that the pixel belongs to the water surface,/->Representing observations->For scaling factor +.>Represents the distance between the pixel point and the feature point, < >>Representing a gray difference value between the pixel point and the characteristic point;

calculating the confidence weight of the semantic label of the corresponding superpixel region according to the proportion, wherein the confidence weight is specifically as follows:

acquiring the proportion as a first weight factor;

obtaining likelihood function distribution of all pixel points of the super pixel area as a second weight factor;

2. The method for segmenting the water visual scene according to claim 1, wherein the real-scene training sample set is established according to the real-scene image marked with the semantic tag and the confidence weight, specifically:

3. The method of segmentation of a visual scene on water according to claim 2, characterized in that a generated countermeasure network is constructed, which is trained with the live-action image, in particular:

constructing a discrimination network by adopting a triple network structure;

4. A method of segmentation of a visual scene in water according to claim 3, wherein said loss function is:

；

wherein,indicating a loss value->、/>Is super-parameter (herba Cinchi Oleracei)>Representing the generation of a loss function against the network,for entering constraint items of a scene +.>Constraint item related to migration characteristics of reference image, < ->Indicating that the maximum value is taken>Representing taking the minimum value.

5. The method of claim 2, further comprising, prior to training the generating an countermeasure network using the live-action image:

6. A water visual scene segmentation apparatus comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements a water visual scene segmentation method according to any of claims 1-5.

7. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method of water visual scene segmentation as defined in any one of claims 1-5.