CN112784902A

CN112784902A - Two-mode clustering method with missing data

Info

Publication number: CN112784902A
Application number: CN202110095029.2A
Authority: CN
Inventors: 彭玺; 林义杰; 杨谋星; 李云帆
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-05-11
Anticipated expiration: 2041-01-25
Also published as: CN112784902B

Abstract

The invention discloses a two-modal clustering method with missing data, which is based on a self-encoder, learns modal special representation of each modal data through intra-modal reconstruction loss, learns modal consistency representation through cross-modal comparison learning loss, recovers information of inconsistency between a lost mode and a abandoned mode through cross-modal dual prediction loss, further improves consistency, performs unified processing on data recovery and consistency learning, and has better clustering effect.

Description

Two-mode clustering method with missing data

Technical Field

The invention relates to the field of big data analysis, in particular to a two-modal clustering method with missing data.

Background

At present, the multi-modal data clustering technology is widely applied to various fields. In commodity recommendation, massive commodity images and text attributes are combined, semantic feature expression of the images is learned, and commodity recommendation degree meeting user requirements is improved; in multi-round conversation with the intelligent customer service, a multi-mode clustering technology of vision and language is integrated, and automatic text, picture or video response can be automatically realized for a user. The success of these multi-modal techniques has mainly been benefited by consistent learning of multi-modal data, i.e., exploring and exploiting the inherent correlations and invariance of data between different modalities. However, consistency learning is based on the completeness of multi-modal data-all data samples cover all modalities, and there cannot be a situation of modality data missing. However, due to the complexity of the data acquisition environment, there are often situations of modality deficiency in the actual data, for example, in an online conference, some video frames may lose visual or auditory signals due to the damage of the sensor. For example, in medical diagnosis, patients often do not perform all physical examinations, but only perform partial examinations, and how to perform etiology diagnosis by using partial examination information is the essential problem of multi-modal clustering with missing data. On the basis of the current technology, for clustering real multi-modal data, data needs to be supplemented in advance to ensure the completeness of an object to be clustered. The current completion method mainly aims at the similarity among samples rather than missing data samples, such as a matrix decomposition-based double-end-aligned incomplete multi-modal clustering (DAIMC), partial multi-modal clustering (PVC) and incomplete multi-modal visualized data grouping (IMG) method.

Data clustering methods of incomplete multi-modal can be roughly divided into two categories: the method is based on a shallow model, for example, the DAIMC method proposed by Menglei Hu et al models the high-order correlation among multiple modes through low-rank matrix decomposition, and effectively utilizes the consistent information among the multiple modes by combining with related prior information to realize the multi-mode subspace learning. The other is a method based on deep learning, for example, the DM2C method proposed by Yangbangyan Jiang et al first obtains a modal-specific representation of each mode through a self-encoder, then adopts a cyclic generation countermeasure network (cyclic generation adaptive Networks), generates missing modal data by using complete modal data, and splices the modal-specific representations of each mode to obtain a common representation.

Second, almost all existing methods view data recovery and consistency learning as two separate problems or steps, lacking a unified theoretical understanding. Such as deep mixed modal clustering (DM2C) and Antagonistic Incomplete Multimodal Clustering (AIMC) based on generating antagonistic networks. Therefore, under the condition of modal data missing, the research on a unified data completion and consistency learning data clustering technology has a very high application prospect and a very high practical value.

Disclosure of Invention

Aiming at the defects in the prior art, the two-modal clustering method with missing data provided by the invention solves the problem that data recovery and consistency learning are not uniformly processed in the prior art.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

a two-mode clustering method with missing data is provided, which comprises the following steps:

s1, respectively sending two modal data of the sample with two modes into corresponding self-coders to obtain corresponding hidden representations;

s2, respectively acquiring corresponding cross-modal contrast learning loss and intra-modal reconstruction loss according to the hidden representations corresponding to the two modal data;

s3, reversely propagating the current self-encoder according to the cross-modal contrast learning loss and the intra-modal reconstruction loss to update the parameters and the weight of the current self-encoder;

s4, judging whether the reverse propagation times reach a threshold value, if so, entering a step S5, otherwise, returning to the step S1;

s5, acquiring corresponding cross-modal contrast learning loss, intra-modal reconstruction loss and cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data;

s6, reversely propagating the current self-encoder according to the current latest cross-modal contrast learning loss, cross-modal dual prediction loss and intra-modal reconstruction loss to update the parameters and the weight of the current self-encoder;

s7, judging whether the current self-encoder is converged, if yes, entering the step S8, otherwise, returning to the step S5;

s8, sending a set of samples with two simultaneous modes, a sample with only a first mode and a sample with only a second mode into a current latest self-encoder as a two-mode data set with missing data to obtain a hidden representation corresponding to the two-mode data set with the missing data;

s9, acquiring a representation of a missing mode corresponding to a hidden representation corresponding to a sample set only having a first mode and a representation of a missing mode corresponding to a hidden representation corresponding to a sample set only having a second mode in the two-mode data sets based on dual mapping;

and S10, splicing different modal representations corresponding to each sample and using the spliced modal representations as common representations of the modal representations, clustering the common representations, and finishing the two-modal clustering with missing data.

Further, the self-encoder in step S1 includes an encoder and a decoder, where the encoder includes a first fully-connected layer, a first batch of normalization layers, a first activation function, a second fully-connected layer, a second batch of normalization layers, a second activation function, a third fully-connected layer, a third batch of normalization layers, a third activation function, a fourth fully-connected layer, and a fourth activation function, which are connected in sequence; the input dimension of the first fully-connected layer is the dimension of input modal data; the output dimensionalities of the first full connection layer, the second full connection layer and the third full connection layer are all 1024; the first activation function, the second activation function and the third activation function are all ReLU; the output dimension of the fourth fully-connected layer is 128, and the fourth activation function is Softmax;

the decoder comprises a fifth full-connection layer, a fourth batch of normalization layers, a fifth activation function, a sixth full-connection layer, a fifth batch of normalization layers, a sixth activation function, a seventh full-connection layer, a sixth batch of normalization layers, a seventh activation function, an eighth full-connection layer, a seventh batch of normalization layers and an eighth activation function which are connected in sequence; the input dimensionality of the fifth full connection layer is 128, the output dimensionalities of the fifth full connection layer, the sixth full connection layer and the seventh full connection layer are 1024, and the fifth activation function, the sixth activation function, the seventh activation function and the eighth activation function are ReLU; the output dimension of the eighth fully connected layer is the dimension of the input modal data.

Further, the specific method for obtaining the corresponding cross-modal contrast learning loss according to the hidden representation corresponding to the two modal data in step S2 is as follows:

according to the formula:

obtaining cross-modal contrast learning loss l_cl(ii) a Where m is the total number of samples in which both modalities exist simultaneously, and t represents the t-th sample;

the mutual information is represented by a representation of the mutual information,

for the corresponding hidden representation of the first modality data in the tth sample of simultaneous two modalities,

the hidden representation corresponding to the second modal data in the tth sample with two simultaneous modals; h (·) represents information entropy; alpha is the equilibrium parameter of entropy.

Further, the specific method for acquiring the corresponding intra-modality reconstruction loss according to the hidden representation corresponding to the two modality data in step S2 is as follows:

according to the formula:

obtaining intra-modal reconstruction loss l_rec(ii) a Where m is the total number of samples in which both modalities exist simultaneously, and t represents the t-th sample;

representing the v-th modal data in the t-th sample; f. of^(v)(. and g)^(v)() represents the encoder and decoder, respectively, to which the v-th modality data currently corresponds;

is a norm.

Further, the specific method of step S3 is:

will l_cl+0.1l_recThe calculation result is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weight of the current self-encoder are updated; wherein l_clComparing learning loss for cross-modal; l_recIs intra-modal reconstruction loss.

Further, in step S5, the specific method for obtaining the corresponding cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data is as follows:

according to the formula:

obtaining Cross-modal Dual prediction loss l_pre(ii) a Wherein Z¹A hidden representation set corresponding to all first modality data in a sample with two modalities simultaneously; z²A hidden representation set corresponding to all second modality data in a sample with two modalities simultaneously; g⁽²⁾(Z²) To be Z²Performing mapping, G⁽¹⁾(Z¹) To be Z¹Performing mapping, G⁽²⁾(. and G)⁽¹⁾(. to) form a dual map;

is a norm.

Further, the specific method of step S6 is:

will be given by formula l_cl+0.1l_pre+0.1l_recThe calculation result is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weight of the current self-encoder are updated; wherein l_clComparing learning loss for cross-modal; l_prePredicting loss for cross-modal duality; l_recIs intra-modal reconstruction loss.

Further, the specific method of step S8 is:

according to the formula:

obtaining a hidden representation corresponding to the two-modal data set with missing data, including a sample set X for the simultaneous existence of the two-modal data¹Corresponding hidden representation

Sample set X for simultaneous presence of two modal data²Corresponding hidden representation

For sample sets X in which only the first modality exists⁽¹⁾Corresponding hidden representation

And for sample set X where only the second modality is present⁽²⁾Corresponding hidden representation

Wherein

An encoder of the latest self-encoder corresponding to the 1 st modality data;

represents the encoder of the latest auto-encoder corresponding to the 2 nd modality data.

Further, the specific method of step S9 is:

according to the formula:

obtaining hidden representations corresponding to sample sets where only the first modality is present, respectively

Representation of the corresponding missing modality

Implicit representation corresponding to a set of samples in which only a second modality is present

Representation of the corresponding missing modality

G⁽¹⁾(. represents a mapping corresponding to a first modality, G⁽²⁾(. represents a mapping corresponding to a second modality, G⁽²⁾(. and G)⁽¹⁾(. cndot.) constitutes a dual map.

Further, the specific method of splicing the different modality representations corresponding to each sample in step S10 and using the spliced different modality representations as a common representation includes:

will be provided with

As a common representation of samples of both modalities present simultaneously; will be provided with

As a common representation of samples where only the first modality is present; will be provided with

As a common representation of samples where only the second modality is present.

The invention has the beneficial effects that: the method is based on the self-encoder, learns the modal special representation of each modal data through intra-modal reconstruction loss, learns the consistency representation of the modalities through cross-modal comparison learning loss, recovers the information of the lost modalities and abandons the inconsistency among the modalities through cross-modal dual prediction loss, further improves the consistency, performs unified processing on data recovery and consistency learning, and has better clustering effect.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a block diagram of a model of the present invention;

FIG. 3 is a graph comparing the accuracy of the change in the deletion rate from 0 to 0.8 in example 1;

FIG. 4 is a graph comparing normalized mutual information for the missing rate varying from 0 to 0.8 in example 1;

FIG. 5 is a graph showing the comparison of adjusted Reed coefficients for the deletion ratio of example 1 varying from 0 to 0.8;

FIG. 6 is a graph comparing the accuracy of the change in the deletion rate from 0 to 0.8 in example 2;

FIG. 7 is a graph comparing normalized mutual information for the loss rates varying from 0 to 0.8 in example 2;

FIG. 8 is a graph comparing the adjusted Lande ratios for the loss rates from 0 to 0.8 in example 2.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1 and fig. 2, the two-modal clustering method with missing data includes the following steps:

s4, judging whether the reverse propagation times reach a threshold value, if so, entering a step S5, otherwise, returning to the step S1; the threshold is 100;

The self-encoder in the step S1 includes an encoder and a decoder, the encoder includes a first full-connection layer, a first batch of normalization layers, a first activation function, a second full-connection layer, a second batch of normalization layers, a second activation function, a third full-connection layer, a third batch of normalization layers, a third activation function, a fourth full-connection layer and a fourth activation function, which are connected in sequence; the input dimension of the first fully-connected layer is the dimension of input modal data; the output dimensionalities of the first full connection layer, the second full connection layer and the third full connection layer are all 1024; the first activation function, the second activation function and the third activation function are all ReLU; the output dimension of the fourth fully-connected layer is 128, and the fourth activation function is Softmax;

The specific method for obtaining the corresponding cross-modal contrast learning loss according to the hidden representation corresponding to the two modal data in step S2 is as follows: according to the formula:

represents the corresponding hidden representation of the v-th modal data in the t-th sample, v ∈ {1,2}, i.e., the

The specific method for acquiring the corresponding intra-modal reconstruction loss according to the hidden representation corresponding to the two modal data in step S2 is as follows: according to the formula:

is a norm.

The specific method of step S3 is: will l_cl+0.1l_recThe calculation result is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weight of the current self-encoder are updated; wherein l_clComparing learning loss for cross-modal; l_recIs intra-modal reconstruction loss.

In step S5, the specific method for obtaining the corresponding cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data is as follows: according to the formula:

is a norm.

The specific method of step S6 is: will be given by formula l_cl+0.1l_pre+0.1l_recThe calculation result is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weight of the current self-encoder are updated; wherein l_clComparing learning loss for cross-modal; l_prePredicting loss for cross-modal duality; l_recIs intra-modal reconstruction loss.

The specific method of step S8 is: according to the formula:

Wherein

An encoder of the latest self-encoder corresponding to the 1 st modality data;

The specific method of step S9 is: according to the formula:

Representation of the corresponding missing modality

Representation of the corresponding missing modality

In step S10, the specific method of splicing the different modal representations corresponding to each sample and using the spliced modal representations as a common representation includes: will be provided with

In the specific implementation process, the entropy is regularized, the parameter alpha is fixed to be 10, and the design of cross-modal contrast learning loss has two advantages: on one hand, from the information theory, the information entropy is the average information quantity transmitted by an event, and a larger entropy represents a more information-quantity representation; on the other hand, maximizing

And

a trivial solution of assigning all samples to the same cluster can be avoided. To be constructed with

The joint probability distribution p (z, z ') of the variables z and z' can be defined first, since the Softmax activation function is stacked in the last layer of the encoder, and therefore

And

can be regarded as an over-clustering probability, i.e.

And

it can be understood that the distribution of the two discrete cluster-assigned variables z and z' over D classes, D being

And

of (c) is calculated. The joint probability p (z, z') is thus defined as a matrix

Let P_dAnd P_d'The edge probability distributions P (z ═ d) and P (z ' ═ d ') are represented, respectively, and can be obtained by summing up the d-th row and d ' -th column of the joint probability distribution matrix P, respectively. For discrete variables, the cross-modal contrast learning penalty function can be redefined as:

wherein P is_dd'Is the element in row d, column d' of P.

To infer the modality of the missing, the present invention proposes a dual prediction mechanism. In particular, in a potential space parameterized by a neural network, by minimizing the conditional entropy H (Z)ⁱ|Z^j) Representation of a particular mode ZⁱCan be covered by Z^jPredicting, wherein i ═ 1, j ═ 2, or j ═ 1, i ═ 2; namely ZⁱIs totally composed of ZⁱDetermining if and only if conditional entropy

One common approach to optimizing this goal is to introduce a variational distribution

And maximize

Lower boundary of

Wherein

The variation distribution Q described above may be of any type, such as gaussian distribution, class distribution, and laplacian distribution. In particular, the method may use the distribution Q as a Gaussian distribution N (Z)ⁱ|G^(j)(Z^j) σ I), σ I is a variance matrix. By omitting constant derivation in Gaussian distribution, maximization

Is equivalent to

For a given bimodal data, a cross-over can be obtainedModal dual prediction penalty

It is noted that the above dual prediction penalty may result in a trivial solution, Z, if there is no intra-modal reconstruction penalty¹And Z²Equal to one and the same constant. After the model converges, we can easily predict the missing mode representation corresponding to the hidden representation corresponding to the sample set with only the first mode through the dual mapping.

When the whole model is trained on the data with complete two modes to be converged, the whole data set is directly sent into the network model, and the network model can execute missing mode completion and deduce corresponding representation. And then directly splicing the representations of different modes together to obtain a common representation, and clustering the common representation by using a traditional clustering method such as k-means to complete the two-mode clustering with missing data. Similarly, the method is applied to any two-modal clustering, so that the method can be directly popularized to multi-modal clustering.

Mapping model G⁽²⁾And G⁽¹⁾The same network structure is adopted, and the network structure is 6 layers:

a first layer: a fully-connected layer, input 128, output 128, immediately following batch normalization layer BatchNorm1 d; the activation function is ReLU;

a second layer: a fully-connected layer, input 128, output 256, immediately following batch normalization layer BatchNorm1 d; the activation function is ReLU;

and a third layer: a fully connected layer, 256 inputs and 128 outputs, immediately following batch normalization layer BatchNorm1 d; the activation function is ReLU;

a fourth layer: a fully-connected layer, input 128, output 256, immediately following batch normalization layer BatchNorm1 d; the activation function is ReLU;

and a fifth layer: a fully connected layer, 256 inputs and 128 outputs, immediately following batch normalization layer BatchNorm1 d; the activation function is ReLU;

a sixth layer: the fully connected layer, input 128, output 128, and the activation function is Softmax.

In one embodiment of the invention, a data set Caltech-101-20 is used, which contains 2386 pictures from 20 object classes, using 2 extracted image features as 2 modalities, including (HOG, GIST). The experimental data category information and sample number distribution are shown in table 1.

Table 1: experimental data Classification information and sample number distribution

The experiments were performed at different deletion rates, defined as η ═ n-m/n, where n is the size of the data set and m is the number of modal complete samples. To verify the superiority of this solution, we compared this solution (complete) with other 10 multimodal clustering techniques, namely partial multimodal clustering (PVC), incomplete multimodal visualization data grouping (IMG), Unified Embedded Alignment Framework (UEAF), double-aligned incomplete multimodal clustering (DAIMC), incomplete multimodal clustering of spectral Perturbation (PIC), efficient regularization incomplete multimodal clustering (EERIMVC), Deep Canonical Correlation Analysis (DCCA), Deep Canonical Correlation Autocoder (DCCAE), binary multimodal clustering (BMVC), and dual autocoder network (AE)² Nets)。

The test results at a deletion rate of 0.5 are shown in table 2.

Table 2: test results when the deletion rate η is 0.5

The test results at a deletion rate of 0 are shown in table 3.

Table 3: test results when the deletion rate η is 0

As can be seen from tables 2 and 3, compared with other clustering methods, the method has a large improvement in the two indexes of standardized mutual information and adjusted reed coefficient, which means that the data of the object picture can be clustered correctly in practical application, and a large amount of manpower resources are not consumed for picture classification.

To further explore the effectiveness of our method, we varied the deletion rate η from 0 to 0.8 on Caltech101-20, with 0.1 as the interval, as shown in fig. 3, 4 and 5. From the results in fig. 3-5 it can be observed that: i) COMPLETER (method) is significantly better than all comparative methods in all deletion rate settings ii) as the deletion rate increases, the performance of the comparative method drops by a much greater amount than our method. For example, with η ═ 0, complete and PIC achieve NMI of 0.6806 and 0.6793, respectively, while complete is significantly better than PIC with increasing miss rates.

In another embodiment of the invention, a Scene-15 dataset is used, containing 4485 pictures from 15 Scene categories, using 2 extracted image features as 2 modalities, including (PHOG, GIST). The experimental data category information and sample number distribution are shown in table 4.

Table 4: experimental data Classification information and sample number distribution

Office room	Kitchen cabinet	Parlor	Bedroom	Shop
					215	210	289	216	315
Industrial process	High-rise building	City	Street	Highway with a light-emitting diode
					311	356	308	292	260
Coast of the ocean	Open field	Mountain	Forest (forest)	Suburb
					360	410	374	328	241

The results of the experiment when the deletion rate η was 0.5 are shown in table 5.

Table 5: experimental results when the deletion rate η is 0.5

The results of the experiment when the deletion rate η is 0 are shown in table 6.

Table 6: experimental results when the deletion rate η is 0

As can be seen from tables 5 and 6, compared with other clustering methods, the method has great improvement in two indexes of accuracy and standardized mutual information, which means that the object picture data can be clustered correctly in practical application, and the consumption of a large amount of human resources for picture classification is avoided. Meanwhile, the method achieves the best effect under the two conditions of deletion and non-deletion.

As shown in fig. 6, 7 and 8, in order to further investigate the effectiveness of the method, experiments were performed by changing the deletion rate η from 0 to 0.8 at intervals of 0.1. From the results in fig. 6-8, it can be observed that complete (the present method) outperforms all comparative methods in almost all miss rate settings.

In summary, the present invention learns the modality specific representation of each modality data through intra-modality reconstruction loss, learns the consistency representation of the modalities through cross-modality contrast learning loss, and recovers the information of inconsistency between the lost modalities and the abandoned modalities through cross-modality dual prediction loss based on the self-encoder, thereby further improving consistency, and performing unified processing on data recovery and consistency learning, and achieving a better clustering effect.

Claims

1. A two-mode clustering method with missing data is characterized by comprising the following steps:

2. The two-modal clustering method with missing data of claim 1, wherein the self-encoder in step S1 comprises an encoder and a decoder, the encoder comprises a first fully-connected layer, a first batch normalization layer, a first activation function, a second fully-connected layer, a second batch normalization layer, a second activation function, a third fully-connected layer, a third batch normalization layer, a third activation function, a fourth fully-connected layer and a fourth activation function which are connected in sequence; the input dimension of the first fully-connected layer is the dimension of input modal data; the output dimensionalities of the first full connection layer, the second full connection layer and the third full connection layer are all 1024; the first activation function, the second activation function and the third activation function are all ReLU; the output dimension of the fourth fully-connected layer is 128, and the fourth activation function is Softmax;

3. The two-modal clustering method with missing data according to claim 1, wherein the specific method for obtaining the corresponding cross-modal contrast learning loss according to the hidden representation corresponding to the two-modal data in step S2 is as follows:

according to the formula:

4. The two-modality clustering method with missing data according to claim 1, wherein the specific method for obtaining the corresponding intra-modality reconstruction loss according to the hidden representation corresponding to the two-modality data in step S2 is as follows:

according to the formula:

is a norm.

5. The method for two-modal clustering with missing data according to claim 1, wherein the specific method of step S3 is:

6. The two-modal clustering method with missing data according to claim 1, wherein the specific method for obtaining the corresponding cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two-modal data in step S5 is as follows:

according to the formula:

is a norm.

7. The method for two-modal clustering with missing data according to claim 1, wherein the specific method of step S6 is:

will be given by formula l_cl+0.1l_pre+0.1l_recThe calculation result is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weight of the current self-encoder are updated; wherein l_clComparing learning loss for cross-modal;l_prepredicting loss for cross-modal duality; l_recIs intra-modal reconstruction loss.

8. The method for two-modal clustering with missing data according to claim 1, wherein the specific method of step S8 is:

according to the formula:

Wherein

An encoder of the latest self-encoder corresponding to the 1 st modality data;

9. The method for two-modal clustering with missing data according to claim 8, wherein the specific method of step S9 is:

according to the formula:

Representation of the corresponding missing modality

Representation of the corresponding missing modality

G⁽¹⁾(. table)Showing the mapping for the first modality, G⁽²⁾(. represents a mapping corresponding to a second modality, G⁽²⁾(. and G)⁽¹⁾(. cndot.) constitutes a dual map.

10. The two-modal clustering method for missing data according to claim 9, wherein the specific method of splicing the different modal representations corresponding to each sample and using the spliced modal representations as a common representation in step S10 is as follows:

will be provided with