CN110751153A

CN110751153A - Semantic annotation method for RGB-D image of indoor scene

Info

Publication number: CN110751153A
Application number: CN201910886599.6A
Authority: CN
Inventors: 王立春; 刘甜; 王少帆; 孔德慧; 李敬华
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2020-02-04
Anticipated expiration: 2039-09-19
Also published as: CN110751153B

Abstract

A semantic annotation method for an indoor scene RGB-D image can enable a receptive field in the indoor scene semantic annotation method not to be limited to superpixels, construct semantic feature representation of superpixel groups and further optimize the superpixel group features based on metric learning, and therefore accuracy of indoor scene understanding is improved. The semantic annotation method comprises the following steps: (1) performing superpixel segmentation on the RGB-D indoor scene image by adopting an gPb/UCM algorithm; (2) super-pixel feature extraction: executing Patch feature calculation and superpixel feature representation; (3) super-pixel group feature extraction: executing example superpixel groups and feature extraction thereof, and class superpixel groups and feature extraction thereof; (4) superpixel group feature vectorization: defining constant distance between Gaussian components, executing example super-pixel group characteristic vectorization and executing class super-pixel group characteristic vectorization; (5) metric learning: and learning the optimization matrix L and labeling the test sample based on the optimization matrix L.

Description

Semantic annotation method for RGB-D image of indoor scene

Technical Field

The invention relates to the technical field of multimedia technology and computer graphics, in particular to a semantic annotation method for an indoor scene RGB-D image.

Background

Scene understanding is a very important task in the field of computer vision. With the development of the field of artificial intelligence in recent years, a plurality of scene understanding methods and technologies emerge, and currently, mainstream scene understanding tasks can be divided into outdoor scene understanding and indoor scene understanding. The outdoor scene understanding can be applied to the field of transportation such as unmanned vehicles and unmanned planes, and the indoor scene understanding includes but is not limited to intelligent robots and indoor public place management. Because outdoor scene data is simpler compared to indoor scenes, a great deal of research is based on outdoor scenes at present. However, people have much more time indoors than outdoors, and the indoor scene understanding can bring more convenience to our life and work if the machine can understand the indoor scene, so the indoor scene understanding is an important research hotspot.

From a data perspective, indoor scene understanding may utilize data of multiple modalities. The original data form is mostly an RGB image, with R, G, B three channels; after the Kinect camera is on the market, the depth acquisition does not depend on laser any more, the data acquisition cost is reduced, and therefore the RGB-D data are popularized and applied. Due to the fact that depth information is added and depth completion is benefited, the three-dimensional data point cloud and the triangular mesh can be generated through reconstruction based on RGB-D data, and therefore richer source data are provided for indoor scene understanding.

In practice, the difficulty of understanding indoor scenes is much greater than outdoor scenes. One reason is due to the complexity of the indoor scene itself. Objects in an indoor scene are densely distributed, the shielding between the objects is serious, the appearance shapes of the similar objects are changeable, the similar objects have abundant and various texture characteristics, the different placing angles can cause the imaging with the great difference, and the light source change also presents complex and various expression forms. These features add unique complexity and difficulty to indoor scene understanding.

Most of semantic labeling of the traditional method is based on a segmentation region, superpixels are obtained by utilizing over segmentation, then characteristics are extracted from the superpixels, and labeling is carried out according to the characteristics. Ren et al over-segments the image using depth weighting gPb/UCM and describes sub-features using five kernels, namely depth gradient, color gradient, local binary pattern, color feature and global geometric feature, which are trained directly using SVM to obtain a class average accuracy of 71.4% in NYUv1, and after using context information and markov optimization, a class average accuracy of 76.1% is achieved. The class average accuracy rate is improved by 3% through the context information alone, and the context only utilizes the characteristics of the superpixel under the over-segmentation of different thresholds, but not the characteristics with larger receptive field based on the superpixel characteristics. Deep learning is mostly end-to-end semantic labeling, a few are performed on superpixels, the end-to-end semantic labeling has a continuously expanded receptive field, and the Park et al propose that RDF-Net performs multi-mode fusion of a residual error network to achieve 62.8% of class average accuracy in NYUv2 and achieve 60.1% in a SUN-RGBD data set. Fan et al select RNN as the base network, extend the single-mode RNN network to the multi-mode RNN network, and integrate HHA depth information and color information. Both networks use upsampling to expand the receptive field. The field of view can also be expanded using hole convolution without loss of resolution.

In general, the traditional method and the deep learning can jointly process a plurality of tasks in scene understanding, and meanwhile, the deep learning has certain advantages in the aspect of accuracy in the aspect of classification tasks compared with the traditional method. One reason for this is that deep learning has an ever-expanding field of view, which is mainly achieved by hole convolution and pooling layers. Most of the traditional methods are region-based, depend on superpixel segmentation, do not have continuously expanded receptive fields, and lack the utilization of global context information.

Disclosure of Invention

In order to overcome the defects of the prior art, the technical problem to be solved by the invention is to provide the semantic annotation method of the RGB-D image of the indoor scene, which can enable the receptive field in the semantic annotation method of the indoor scene not to be limited to the superpixel, optimize the semantic features of the superpixel group and improve the accuracy of understanding the indoor scene.

The technical scheme of the invention is as follows: the semantic annotation method for the RGB-D image of the indoor scene comprises the following steps:

(1) performing superpixel segmentation on the RGB-D indoor scene image by adopting an gPb/UCM algorithm;

(2) super-pixel feature extraction: executing Patch feature calculation and superpixel feature representation;

(3) super-pixel group feature extraction: executing example superpixel groups and feature extraction thereof, and class superpixel groups and feature extraction thereof;

(4) superpixel group feature vectorization: defining constant distance between Gaussian components, executing example super-pixel group characteristic vectorization and executing class super-pixel group characteristic vectorization;

(5) metric learning: and learning the optimization matrix L and labeling the test sample based on the optimization matrix L.

The method adopts gPb-ucm algorithm to carry out superpixel segmentation on RGB-D indoor scene images, a plurality of superpixels which are most likely to form an example are called superpixel groups, and a Gaussian mixture model is utilized to establish semantic feature representation of the superpixel groups based on superpixel features; mapping the Gaussian component of the Riemannian manifold space to an Euclidean space by using a Kullback-Leibler Divergence kernel distance to obtain the characteristic vector representation of the super pixel group; optimizing the feature vectors by using a large-interval nearest neighbor metric learning method, and finally performing semantic annotation on the superpixel groups based on optimized feature representation; therefore, the receptive field in the indoor scene semantic annotation method is not limited to the super-pixels, the semantic features of the super-pixel groups are optimized, and the accuracy of indoor scene understanding is improved.

Drawings

FIG. 1 is a flow chart of a semantic annotation method for an RGB-D image of an indoor scene according to the invention.

FIG. 2 is a flowchart of an embodiment of a semantic annotation method for an RGB-D image of an indoor scene according to the invention.

Detailed Description

As shown in fig. 1, the semantic annotation method for the RGB-D image of the indoor scene includes the following steps:

Preferably, the superpixel segmentation in the step (1) uses gPb/UCM algorithm to calculate probability values of the pixels belonging to the boundary through local and global features of the imageThe gPb/UCM algorithm is applied to the color image and the depth image respectively, and the calculation is carried out according to the formula (1)

Wherein the content of the first and second substances,is a probability value calculated based on the color image that a pixel belongs to the boundary,

is a probability value calculated based on the depth image that a pixel belongs to the boundary,

probability value obtained according to formula (1)

And setting different probability threshold values tr to obtain a multi-level segmentation result.

Preferably, the probability threshold tr is 0.06 and 0.08, and the pixels with the probability values smaller than the set threshold are connected into a region according to the eight-way principle, wherein each region is a super pixel.

Preferably, in step (2), Patch is defined as a 16 × 16-sized grid, n pixels are taken as steps, n is 2, sliding is performed from the upper left corner of the color image RGB and the Depth image Depth to the right and downward, finally, a dense grid is formed on the color image RGB and the Depth image Depth, and four types of features are calculated for each Patch: depth gradient features, color features, texture features.

Preferably, the superpixel feature F in the step (2)_segDefined by formula (5):

respectively representing a super-pixel depth gradient feature, a super-pixel color feature and a super-pixel texture feature, and defining a formula (6):

wherein, F_{g_d}(i),F_{g_c}(i),F_col(i),F_tex(i) Watch (A)A feature indicating that the ith center position falls within the super pixel seg, and n indicates the number of the latches whose center positions fall within the super pixel seg;

superpixel geometry

And

defined according to equation (7):

wherein the components are defined as follows:

super pixel area A^seg＝∑_s∈seg1, s are pixels within the super-pixel seg; super pixel perimeter P^segDefined by formula (8):

m, N represents the horizontal and vertical resolution of the RGB scene image respectively; seg, seg' represent different superpixels; n is a radical of₄(s) is a set of four-neighbor domains of pixel s; b is_segIs the set of boundary pixels of the super-pixel seg;

area to perimeter ratio R of super pixel^segDefined by formula (9):

is based on the x-coordinate s of the pixel s_xY coordinate s_yAnd the second-order (2+0 ═ 2 or 0+2 ═ 2) Hu moments calculated by multiplying the x and y coordinates, respectively, are defined as equations (10), (11), (12)

Wherein

Respectively representing the mean value of x coordinates, the mean value of y coordinates, the square of the mean value of x coordinates and the square of the mean value of y coordinates of the pixels contained in the super pixels, and defining the following formula (13):

width and Height respectively represent the Width and Height of the image, i.e.

Performing a calculation based on the normalized pixel coordinate values;

D_varrespectively representing the depth values s of the pixels s within the superpixel seg_dAverage value of (1), depth value s_dMean of squares, variance of depth values, defined as (14):

D_missthe proportion of pixels in the super-pixel that lose depth information is defined as (15):

N^segis the norm vector modulo length of the point cloud corresponding to the superpixel, where the norm vector of the point cloud corresponding to the superpixel is divided by the principal componentAnalytical PCA estimation.

Preferably, the example superpixel group and its feature extraction performed in the step (3) is:

k superpixels assumed to be an example on an image form an example superpixel group, and the kth superpixel is characterized by F_segkThe characteristic of the example superpixel group is denoted as F ═ F_seg1,F_seg2,…,F_segKAnd establishing a Gaussian mixture model formula (16) based on the super-pixel group feature F by using a maximum expectation algorithm EM, wherein the super-pixel group feature of the example uses a Gaussian component set G ═ G₁,g₂,…,g_mMeans for

Gaussian mixture model G (F)_seg) From several Gaussian components g_i(F_seg) Weighted sum representation, wherein F_segIs a random variable, the ith Gaussian component g_iObeying a common Gaussian distribution of g_i(F_seg)～N(F_seg|μ_i,∑_i) Ith weight ω_iObtained by the maximum expectation algorithm EM, μ_iIs the expectation of the ith gaussian component, is a vector; sigma_iIs the variance of the ith Gaussian component, is a square matrix;

the class superpixel group and the feature extraction in the step (3) are as follows:

only training samples can construct a class superpixel group, given all training sample images, a set of a plurality of superpixel blocks marked as a jth class is called as a class superpixel group, the jth class contains P superpixels, and the class characteristics are represented as

Also using EM algorithm for F^jEstablishing a Gaussian mixture model to obtain mj Gaussian components, and expressing the characteristic of the super-pixel group as a set

Training samples have N-class, class supergramsThe characteristics of the element group are expressed as a set

Is totally composed of

A gaussian component.

Preferably, the constant distance between gaussian components defined in step (4) is:

the distance between the two gaussian components is taken as the Kullback-Leibler Divergence distance,

gaussian component g_iAnd g_jThe distance therebetween is calculated according to the formula (17)

Substituting equation (17) into equation (18) yields two Gaussian components g_iAnd g_jConstant distance K (g)_i,g_j)

K(g_i,g_j)＝exp{-[KLD(g_i||g_j)+KLD(g_j||g_i)]/2t²} (18)

Wherein t is an empirical parameter, and takes a value of 70 in a verification experiment;

perform example superpixel group feature vector quantization as:

computing the ith Gaussian component G of the example superpixel group feature G_iAnd superpixel group feature H_allAs a feature vector of the example superpixel group

Calculated according to equation (19):

this exampleFeature vector set representation of superpixel groups

1) Example superpixel feature vectorization of training samples: extracting example superpixel group characteristics from all training sample images according to the step (3), vectorizing Gaussian component characteristics of each example superpixel group according to a formula (19), and forming a training sample example characteristic set by using the vectorized example superpixel group characteristic vectors

Is a formula (20)

Where T is the number of instances in all training samples;

2) example superpixel feature vectorization of test samples: all Gaussian component features of an example superpixel group are vectorized according to formula (19) to form a vector set

WhereinVectorized features representing an nth gaussian component; performing superpixel-like group feature vector quantization as:

compute superpixel group-like features H_allEach gaussian component h of_kAnd superpixel group feature H_allAs a feature vector of a gaussian component of the super-pixel-like groupCalculated by equation (21):

all vectorized class superpixel group-like features form a training sample class feature set

Is a formula (22)

Preferably, the learning optimization matrix L in the step (5) is:

formula (23) is an objective function of metric learning, where M is a positive semidefinite mahalanobis matrix to be optimized, and M is optimized by using sample points i, where the feature vector corresponding to the sample points i is

Indicating that sample j is an ideal neighbor of sample point i, the eigenvector of sample j is noted as

l is the sample too close to sample i, whose eigenvector is noted

ξ_ijlIs a constraint term, is larger than zero, regardless of the sample, and μ ═ 0.5 balances the weight between pulling force and repulsion force, y is when the samples i, l label are consistent_il1 or else y_il＝0，

(2)ξ_ijl≥0

Solving the formula (23) to obtain a semi-positive definite matrix M, which is decomposed into LL^TL is an optimization matrix, take

And

is obtained by union of

Based on S^trainLearning an optimization matrix L;

labeling test samples based on the optimization matrix L as follows:

the test case superpixel group to be labeled is represented as

Calculating the class label of the test sample according to the formula (24):

is the feature vector of the test case and,

training sample feature vector with class label class, find

Andat a minimum distanceIs denoted by v_i，v_iIs the class of the test case.

The invention tests on a NYU v1 RGB-D data set, which contains 2284 scenes, 13 categories in total. The data set is partitioned into two disjoint subsets for training and testing, respectively. The training set contains 1370 scenarios and the test set contains 914 scenarios.

The method provided by the invention comprises the following specific steps:

1. superpixel segmentation

The super-pixel segmentation of the invention uses gPb/UCM algorithm to calculate the probability value of the pixel belonging to the boundary through the local and global characteristics of the imageThe gPb/UCM algorithm is applied to the color image and the depth image respectively, and the calculation is carried out according to the formula (1)

In the formula (1), the reaction mixture is,

is a probability value calculated based on the color image that a pixel belongs to the boundary,

is a probability value of a pixel belonging to a boundary calculated based on the depth image.

Probability value obtained according to formula (1)And setting different probability threshold values tr to obtain a multi-level segmentation result.

The probability threshold tr set in the invention is 0.06 and 0.08, and the pixels with the probability values smaller than the set threshold are connected into a region according to the eight-connection principle, wherein each region is a super pixel.

2. Superpixel feature extraction

2.1Patch feature calculation

Patch is defined as a 16x 16-sized grid (the grid size can be modified according to actual data), and the grid is slid from the upper left corner of the color image (RGB) and the Depth image (Depth) to the right and downwards in steps of n pixels (the invention takes 2 n values in the experiment), and finally a dense grid is formed on the color image (RGB) and the Depth image (Depth). Taking the diagram with size of N × M as an example, the number of Patch obtained finally is

Four types of features are calculated for each Patch: depth gradient features, color features, texture features.

2.1.1 depth gradient feature

Patch in depth image is noted as Z^dFor each Z^dComputing depth gradient feature F_{g_d}Wherein the value of the t-th component is defined by equation (1):

in the formula (1), Z ∈ Z^dRepresents the relative two-dimensional coordinate position of pixel z in depth Patch;

and

respectively representing the depth gradient direction and the gradient magnitude of the pixel z;

and

the depth gradient base vectors and the position base vectors are respectively, and the two groups of base vectors are predefined values; d_gAnd d_sRespectively representing the number of depth gradient base vectors and the number of position base vectors;is thatApplying mapping coefficient of t-th principal component obtained by Kernel Principal Component Analysis (KPCA),representing the kronecker product.

And

respectively a depth gradient gaussian kernel function and a position gaussian kernel function,

and

are parameters corresponding to a gaussian kernel function. Finally, the EMK (empirical model) algorithm is used for transforming the depth gradient feature, and the transformed feature vector is still marked as F_{g_d}。

2.1.2 color gradient feature

Patch in color image is noted as Z^cFor each Z^cCalculating color gradient feature F_{g_c}Wherein the value of the t-th component is defined by equation (2):

in the formula (2), Z ∈ Z^cRepresents the relative two-dimensional coordinate position of a pixel z in the color image Patch;

and

respectively representing the gradient direction and the gradient magnitude of the pixel z;and

color gradient base vectors and position base vectors are respectively, and the two groups of base vectors are predefined values; c. C_gAnd c_sRespectively representing the number of color gradient base vectors and the number of position base vectors;

is that

Applying mapping coefficient of t-th principal component obtained by Kernel Principal Component Analysis (KPCA),

representing the kronecker product.

And

respectively a color gradient gaussian kernel function and a position gaussian kernel function,and

are parameters corresponding to a gaussian kernel function. Finally, the color gradient feature is transformed by using an EMK (efficient Match kernel) algorithm, and the transformed feature vector is still marked as F_{g_c}。

2.1.3 color characteristics

Patch in color image is noted as Z^cFor each Z^cCalculating color characteristics F_colWherein the value of the t-th component is defined by equation (3):

in the formula (3), Z ∈ Z^cRepresents the relative two-dimensional coordinate position of pixel z in the color image Patch; r (z) is a three-dimensional vector, which is the RGB value of pixel z;

and

color basis vectors and position basis vectors are respectively adopted, and the two groups of basis vectors are predefined values; c. C_cAnd c_sRespectively representing the number of the color basis vectors and the number of the position basis vectors;is thatApplying mapping coefficient of t-th principal component obtained by Kernel Principal Component Analysis (KPCA),

representing the kronecker product.

Andrespectively a color gaussian kernel function and a position gaussian kernel function,

and

are parameters corresponding to a gaussian kernel function. Finally, the color features are transformed by using an EMK (efficient Match kernel) algorithm, and the transformed feature vectors are still marked as F_col。

2.1.4 textural features

Firstly, an RGB scene image is converted into a gray scale image, and Patch in the gray scale image is recorded as Z^gFor each Z^gCalculating texture feature F_texWherein the value of the t-th component is defined by equation (4):

in the formula (4), Z ∈ Z^gRepresents the relative two-dimensional coordinate position of pixel z in the color image Patch; s (z) represents the standard deviation of the pixel gray values in a 3 × 3 region centered on pixel z; LBP (z) is the Local Binary Pattern feature (LBP) of pixel z;

and

respectively are a local binary pattern base vector and a position base vector, and the two groups of base vectors are predefined values; g_bAnd g_sRespectively representing the number of the base vectors of the local binary pattern and the number of the position base vectors;

is that

Applying mapping coefficient of t-th principal component obtained by Kernel Principal Component Analysis (KPCA),representing the kronecker product.

And

respectively a local binary pattern gaussian kernel function and a position gaussian kernel function,andare parameters corresponding to a gaussian kernel function. Finally, the texture features are transformed by using an EMK (efficient Match kernel) algorithm, and the transformed feature vectors are still marked as F_tex。

2.2. Super pixel feature

Super pixel feature F_segIs defined as formula (5):

respectively representing a super-pixel depth gradient characteristic, a super-pixel color characteristic and a super-pixel texture characteristic, and defining the following formula (6):

(6) in the formula, F_{g_d}(i),F_{g_c}(i),F_col(i),F_tex(i) Indicates the characteristic of the Patch whose ith center position falls within the super pixel seg, and n indicates the number of the patches whose center positions fall within the super pixel seg.

Superpixel geometryAnd

is defined by the formula (7):

(7) the components in the formula are defined as follows:

super pixel area A^seg＝∑_s∈seg1, s are pixels within the super-pixel seg; super pixel perimeter P^segIs defined as formula (8):

in formula (8), M, N represents the horizontal and vertical resolutions of the RGB scene image, respectively; seg, seg' represent different superpixels; n is a radical of₄(s) is a set of four-neighbor domains of pixel s; b is_segIs the set of boundary pixels of the super-pixel seg.

Area to perimeter ratio R of super pixel^segIs defined as formula (9):

is based on the x-coordinate s of the pixel s_xY coordinate s_yAnd a second-order (2+0 ═ 2 or 0+2 ═ 2) Hu moment calculated by multiplying the x coordinate by the y coordinate, respectively, as defined in equations (10), (11), (12)

In the formula (13)

width and Height respectively represent the Width and Height of the image, i.e.

The calculation is based on the normalized pixel coordinate values.

N^segis the principal normal vector modulo length of the point cloud corresponding to the superpixel, where the principal normal vector of the point cloud corresponding to the superpixel is estimated by Principal Component Analysis (PCA).

3 superpixel group feature extraction

Equation (16) is a general gaussian mixture model expression. Gaussian mixture model G (F)_seg) From several Gaussian components g_i(F_seg) Weighted sum representation, wherein F_segIs a random variable, the ith Gaussian component g_iObeying a common Gaussian distribution, i.e. g_i(F_seg)～N(F_seg|μ_i,∑_i) Ith weight ω_iAutomatically calculated by the EM algorithm. (mu.) a_iIs the expectation of the ith gaussian component, is a vector; sigma_iIs the variance of the ith gaussian component, which is a square matrix. )

3.1 example superpixel groups and feature extraction thereof

K superpixels most likely to become an example on an image form an example superpixel group, and the kth superpixel is characterized by F_segkThe characteristic of the example superpixel group is denoted as F ═ F_seg1,F_seg2,…,F_segKAnd (4) establishing a Gaussian mixture model (as shown in the formula 16) based on the super-pixel group characteristics F by utilizing a maximum Expectation algorithm (EM algorithm). The example superpixel group feature is the set of available gaussian components G ═ G₁,g₂,…,g_mRepresents it.

Class 3.2 superpixel group and feature extraction thereof

Only training samples can construct a class superpixel group, and given all training sample images, a set of a plurality of superpixel blocks labeled as the jth class is called a class superpixel group. Class j contains P superpixels and the class is characterized by

Training samples have N classes, and the class superpixel group characteristics are expressed as a set

Is totally composed of

A gaussian component.

4. Superpixel group feature vectorization

This section vectorizes the gaussian mixture model features. The scalar distance of the two gaussian components, i.e. the gaussian mixture model, is represented as a vector. First, a constant distance between gaussian components is defined.

4.1 constant distance between Gaussian components

1) The distance between two Gaussian components is KLD (Kullback-Leibler Divergence) distance, and the Gaussian component g_iAnd g_jThe distance therebetween is calculated according to equation (17).

2) Substituting equation (17) into equation (18) yields two Gaussian components g_iAnd g_jConstant distance K (g)_i,g_j)。

K(g_i,g_j)＝exp{-[KLD(g_i||g_j)+KLD(g_j||g_i)]/2t²} (18)

(18) In the formula, t is an empirical parameter, and the value is 70 in the verification experiment of the invention.

4.2 example superpixel group feature vectorization

The calculation formula is shown in formula (19):

feature vector set representation of the example superpixel group

1) Example superpixel feature vectorization of training samples: extracting example superpixel group characteristics from all training sample images according to 3.1, vectorizing Gaussian component characteristics of each example superpixel group according to a formula (19), and vectorizing the vectorized exampleThe super pixel group feature vectors form a training sample example feature set

As shown in equation (20).

Where T is the number of instances in all training samples. (20)

2) Example superpixel feature vectorization of test samples: vectorizing all Gaussian component features of an example superpixel group according to equation (19) can form a vector set

Wherein

Vectorized feature representing the nth Gaussian component

Class 4.3 superpixel group feature vectorization

Compute superpixel group-like features H_allEach gaussian component h of_kAnd superpixel group feature H_allAs a feature vector of a gaussian component of the super-pixel-like group

The calculation formula is shown in formula (21):

(21)

As shown in equation (22).

5 metric learning

After vector feature expression of the super-pixel group is obtained, the feature is optimized by utilizing metric learning, the feature distance of the same type of sample is reduced by utilizing pulling force, and the feature distance of different types of samples is increased by utilizing repulsion force.

A certain sample x in the class to be optimized_iIs characterized by

In other words, the surrounding positive samples can determine a local neighbor range, i.e., interval, that minimally bounds all positive samples x_jThe positive samples may be referred to as ideal neighbors. Within this range, there are some negative examples x_lKnown as fake neighbors. The feature space is converted by learning the linear conversion matrix, so that the fake-fake neighbor is pushed away from x to the maximum extent by using repulsive force_iSo that the ideal neighbor is pulled to the x degree by the pulling force to the maximum extent_i。

5.1 learning optimization matrix L

Equation (23) is an objective function of metric learning. Where M is the Markov matrix to be optimized, μ ═ 0.5 to balance the weight between pull and repulsion forces,

represents the ideal neighbor of sample point i, l is a fake neighbor, ξ_ijlIs a constraint term, is greater than zero, regardless of the sample.

(2)ξ_ijl≥0

Solving equation (23) yields a semi-positive definite matrix M, which can be decomposed into LL^TL isTo optimize the matrix.

Get

And

is obtained by union of

Based on S^trainThe optimization matrix L is learned.

5.2 labeling test samples based on optimization matrix L

The test case superpixel group to be labeled is represented as section 4.2

The class label of the test sample is calculated according to equation (24):

is the feature vector of the test case and,

training sample feature vector with class label class, find

And

at a minimum distance

Is denoted by v_i，v_iIs the class of the test case.

In order to ensure that the maximum extent of the example superpixel group is close to that of the example, a ground truth is used for delineating the example superpixel group in the verification experiment so as to obtain the upper limit of the model. Meanwhile, a typical superpixel segmentation SLIC algorithm is selected to divide the picture into 30 areas, and each area is calculated as a superpixel group so as to test the general performance of the algorithm. The specific process of obtaining a superpixel group based on regions and superpixels is as follows:

inputting: the region set obtained by segmenting an image by using SLIC algorithm is recorded as

The ith division area is recorded as

The super-pixel set obtained by the same image by using gpb/ucm algorithm is recorded as

The jth super pixel is noted as

And (3) outputting: a mapping table map. map (j) ═ i denotes the jth super pixel

The category label of (1) is i.

The pseudo code describes:

the result (GT) represents the example superpixel set for which the test sample was built using a ground truth, the recognition result of which is the theoretical upper limit of the proposed model; the results (SLICs) represent example sets of superpixels for which samples are created using SLICs, the identification of which is relative to a particular superpixel segmentation (referred to herein as SLICs). The experimental results listed in table 1 show that the accuracy of the proposed algorithm reaches 82.1% under the condition that the example is accurate, and the accuracy is 52.2% when the example is determined using the SLIC segmentation result, i.e. whether the superpixel group determined based on the superpixel segmentation result is an accurate example or not has a great influence on the proposed model.

TABLE 1

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent variations and modifications made to the above embodiment according to the technical spirit of the present invention still belong to the protection scope of the technical solution of the present invention.

Claims

1. A semantic annotation method for an indoor scene RGB-D image is characterized by comprising the following steps: the method comprises the following steps:

2. The method for semantic annotation of the RGB-D image of the indoor scene as claimed in claim 1, wherein: in the step (1), the gPb/UCM algorithm is used for superpixel segmentation, and probability values of the pixels belonging to the boundary are calculated through local and global features of the image

The gPb/UCM algorithm is applied to the color image and the depth image respectively, and the calculation is carried out according to the formula (1)

Wherein the content of the first and second substances,

probability value obtained according to formula (1)

3. The method for semantic annotation of the RGB-D images of the indoor scene according to claim 2, wherein: the probability threshold tr is 0.06 and 0.08, and the pixels with the probability values smaller than the set threshold are connected into a region according to the eight-connection principle, wherein each region is a super pixel.

4. The method for semantic annotation of the RGB-D image of the indoor scene as claimed in claim 3, wherein: in the step (2), Patch is defined as a 16x 16-sized grid, n pixels are used as step lengths, the value of n is 2, sliding is performed from the upper left corner of the color image RGB and the Depth image Depth to the right and downwards, finally, dense grids are formed on the color image RGB and the Depth image Depth, and four types of features are calculated for each Patch: depth gradient features, color features, texture features.

5. The method as claimed in claim 4, wherein the method for semantic annotation of RGB-D images of indoor scene is characterized in thatIn the following steps: the super pixel characteristic F in the step (2)_segDefined by formula (5):

wherein, F_{g_d}(i),F_{g_c}(i),F_col(i),F_tex(i) Indicates the characteristic of the Patch whose ith center position falls within the super pixel seg, and n indicates the number of the patches whose center positions fall within the super pixel seg;

superpixel geometry

And

defined according to equation (7):

wherein the components are defined as follows:

wherein M, N represents the horizontal and vertical components of an RGB scene image, respectivelyResolution; seg, seg' represent different superpixels; n is a radical of₄(s) is a set of four-neighbor domains of pixel s; b is_segIs the set of boundary pixels of the super-pixel seg;

area to perimeter ratio R of super pixel^segDefined by formula (9):

Wherein

width and Height respectively represent the Width and Height of the image, i.e.

Based on normalizationCalculating pixel coordinate values;

6. The method for semantic annotation of the RGB-D image of the indoor scene as claimed in claim 5, wherein: the execution example superpixel group and the feature extraction thereof in the step (3) are as follows:

Gaussian mixture model G (F)_seg) From several Gaussian components g_i(F_seg) Weighted sum representation, wherein F_segIs a random variable, the ith Gaussian component g_iObeying a common Gaussian distribution of g_i(F_seg)～N(F_seg|μ_i,Σ_i) Ith weight ω_iObtained by the maximum expectation algorithm EM, μ_iIs the expectation of the ith gaussian component, is a vector; sigma_iIs the variance of the ith Gaussian component, is a square matrix;

Also using EM algorithm for F^jEstablishing a Gaussian mixture model to obtain mj Gaussian components, and expressing the characteristic of the super-pixel group as a setTraining samples have N classes, and the class superpixel group characteristics are expressed as a set

Is totally composed of

A gaussian component.

7. The method for semantic annotation of the RGB-D image of the indoor scene as claimed in claim 6, wherein: defining the constant distance between the Gaussian components in the step (4) as: the distance between two Gaussian components adopts the Kullback-LeiblerDrigence distance, and the Gaussian component g_iAnd g_jThe distance between them is according to the formula(17) Computing

K(g_i,g_j)＝exp{-[KLD(g_i||g_j)+KLD(g_j||g_i)]/2t²} (18)

perform example superpixel group feature vector quantization as:

Calculated according to equation (19):

feature vector set representation of the example superpixel group

Is a formula (20)

Where T is the number of instances in all training samples;

Wherein

Vectorized features representing an nth gaussian component; performing superpixel-like group feature vector quantization as:

Calculated by equation (21):

Is a formula (22)

8. The method for semantic annotation of the RGB-D images of indoor scenes according to claim 7, wherein: the learning optimization matrix L in the step (5) is as follows: