CN110717599A

CN110717599A - Dissociation characterization learning method and device integrating multiple modes

Info

Publication number: CN110717599A
Application number: CN201910934657.8A
Authority: CN
Inventors: 朱文武; 王鑫; 马坚鑫
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-01-21
Anticipated expiration: 2039-09-29
Also published as: CN110717599B

Abstract

The invention discloses a dissociation characterization learning method and device integrating multiple modes, wherein the method comprises the following steps: acquiring image information and text information of a target to obtain an image non-dissociation characteristic and a text non-dissociation characteristic; obtaining an image representation and a text representation according to the image non-dissociation representation and the text non-dissociation representation; obtaining a commodity representation and a user preference representation which depict the user consumption preference according to the commodity representation and the text representation; and predicting the commodity interestingness of the user according to the commodity characterization and the user preference characterization. The method can fuse data of three different modes of images, texts and relations, and obtains more accurate and effective dissociation representation by utilizing complementarity among the modes.

Description

Dissociation characterization learning method and device integrating multiple modes

Technical Field

The invention relates to the technical field of characterization learning, in particular to a dissociation characterization learning method and device integrating multiple modes.

Background

The characterization learning is a core task of deep learning, complex feature engineering is omitted, samples are automatically mapped to points in a continuous vector space (called as characterization of the samples), and great convenience is provided for subsequent tasks such as retrieval and prediction. The de-ionization is widely considered as a property that a good representation should have, because robustness and interpretability brought by the de-ionization can narrow the gap between a training environment with limited data and a complex online environment. The dissociation is intended to separate various factors behind the formation of the sample (such as price, color, material, etc. involved in purchasing a garment) and further to encode in various parts of the characterization.

However, the currently dissociated characterization learning technology mainly focuses on simple single-modal data, such as image data, and cannot be effectively applied to complex multi-modal data in a real application scene, such as two modalities of images and texts of commodities in a shopping mall scene, and a third modality of relationship data between a user and the commodities, which are to be solved.

Disclosure of Invention

The present application is based on the recognition and discovery by the inventors of the following problems:

in the related art, (1) a multimodal deep learning technique that provides a framework for fusing bimodal tokens, but tokens are not dissociated; and the fusion framework provided by this work cannot process relational data. (2) A visual basic concept learning technique based on a constraint variational framework, which only processes a single modality of an image. (3) A text unwrapping representation technology for learning and applying to biomedical abstracts only deals with a single mode of text, and data needs to follow a fixed format with medical files generally and is not suitable for short texts with free expression modes such as commodity titles and the like. (4) An disentanglement graph convolution network technology is only suitable for homogeneous networks such as social networks, and is not suitable when a relationship network contains heterogeneous information such as non-homogeneous nodes (commodities, users and the like) and non-homogeneous edges (clicks, friends and the like).

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present invention is to provide a dissociative characterization learning method that integrates multiple modalities, which can integrate data of three different modalities, namely, images, texts, and relationships, and obtain a more accurate and effective dissociative characterization by using complementarity between the modalities.

Another objective of the present invention is to provide a dissociation characterization learning apparatus that integrates multiple modalities.

In order to achieve the above object, an embodiment of an aspect of the present invention provides a dissociation characterization learning method fusing multiple modalities, including the following steps: acquiring image information and text information of a target to obtain an image non-dissociation characteristic and a text non-dissociation characteristic; obtaining an image representation and a text representation according to the image non-dissociation representation and the text non-dissociation representation; obtaining a commodity representation and a user preference representation which depict the consumption preference of the user according to the commodity representation and the text representation; and according to the commodity characteristics and the user preference characteristics, predicting the commodity interestingness of the user.

According to the dissociation characterization learning method fusing multiple modes, information spanning three modes can be fused through the characterization learning technology, and the purpose of information complementation is achieved; the characterization of the dissociation can be obtained through a deep learning technology, so that the interpretability of the system is improved; therefore, data of three different modes including images, texts and relations can be fused, and more accurate and effective dissociation representation can be obtained by utilizing complementarity among the modes.

In addition, the dissociation characterization learning method fusing multiple modalities according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the obtaining an image representation and a text representation according to the image non-dissociated representation and the text non-dissociated representation includes: performing F test on any two different components of the image non-dissociation characterization and the text non-dissociation characterization, and punishing an F test index to force the different components to correspond to different aspects of information; and fusing the two independent representations into a unified representation, and restoring the input image and the text from the unified representation respectively to obtain the image representation and the text representation.

Further, in an embodiment of the present invention, the obtaining, according to the commodity characterization, a commodity characterization and a user preference characterization that characterize the user consumption preference includes: and taking the commodity representation as input, and reasoning the commodity representation and the user preference representation which describe the user consumption preference from the commodity-user relationship and the user-user relationship by the two dissociation graph convolution networks in sequence.

Further, in an embodiment of the present invention, the predicting the interest level of the commodity which the user is according to the commodity characteristics and the user preference characteristics includes: and on the basis of the dissociated commodity representation and the user preference representation, obtaining the commodity interestingness by measuring the matching degree of the commodity and the user in various aspects.

In order to achieve the above object, another embodiment of the present invention provides a dissociation characterization learning apparatus fusing multiple modalities, including: the dissociation encoder is used for acquiring image information and text information of a target to obtain an image non-dissociation characteristic and a text non-dissociation characteristic and obtaining the image characteristic and the text characteristic according to the image non-dissociation characteristic and the text non-dissociation characteristic; the dissociation graph convolution encoder is used for obtaining a commodity representation and a user preference representation which depict the user consumption preference according to the commodity representation and the text representation; and the predictor is used for predicting the commodity interestingness of the user according to the commodity characteristics and the user preference characteristics.

According to the dissociation characterization learning device fusing multiple modes, information spanning three modes can be fused through the characterization learning technology, and the purpose of information complementation is achieved; the characterization of the dissociation can be obtained through a deep learning technology, so that the interpretability of the system is improved; therefore, data of three different modes including images, texts and relations can be fused, and more accurate and effective dissociation representation can be obtained by utilizing complementarity among the modes.

In addition, the dissociation characterization learning apparatus fusing multiple modalities according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the dissociation encoder is further configured to perform F-test on any two different components of the image non-dissociation characterization and the text non-dissociation characterization, and penalize an F-test index to force the different components to correspond to different aspects of information; and two independent representations are fused into a unified representation, and the input image and the text are restored from the unified representation respectively to obtain the image representation and the text representation.

Further, in an embodiment of the present invention, the dissociation graph convolution encoder is further configured to use the commodity characterization as an input, and the two dissociation graph convolution networks are further configured to infer the commodity characterization and the user preference characterization describing the user consumption preference from a commodity-to-user relationship and a user-to-user relationship in sequence.

Further, in an embodiment of the present invention, the predictor is further configured to obtain the interest level of the product by measuring the matching degree of the product and the user in various aspects based on the dissociated product characteristics and the user preference characteristics.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a method of dissociative characterization learning incorporating multiple modalities according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a dissociation encoder according to an embodiment of the invention;

FIG. 3 is a schematic structural diagram of a deconvolution graph encoder for mining user preferences from a user's click history according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a convolutional encoder for a de-ionized graph for mining user preferences from friend relationships of a user according to an embodiment of the present invention;

FIG. 5 is a block diagram of a predictor according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a dissociation characterization learning apparatus fusing multiple modalities according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

At present, the dissociation characterization learning has the following defects in a multi-modal real scene:

(1) the dissociation characterization techniques suitable for different modes have huge differences, and a single technique cannot be simply reused to uniformly learn the dissociation characterization. The image data is continuous in value and regular in structure, and the text and relation data are high-dimensional and sparse, discrete in value and irregular; image, text data allows independent processing on a sample-by-sample basis, while the strong coupling brought about by relational data requires algorithms to comprehensively process multiple samples simultaneously.

(2) When a plurality of modalities are fused into a unified representation, the information of a single modality must be prevented from being lost in the fusion process. Although it is theoretically beneficial to retain all modalities, the complexity of the neural network causes the algorithm to fall into a local solution at a large probability, and in particular, the neural network is easy to learn the behavior of a modality regardless of when not constrained, such as only looking at images and not looking at text.

Aiming at the problems of the dissociation characterization learning in the multi-modal real scene, (1) the embodiment of the invention designs a uniform constraint mechanism to enable the characterization to show the dissociation property while introducing the most advanced conventional characterization learning technology of each modality. (2) According to the embodiment of the invention, a series of subtasks for reconstructing pictures and texts and predicting the relation are synchronously executed during the mode fusion, and the subtasks can be well represented only under the condition that all modes are completely reserved.

The embodiment of the invention considers the importance of multiple modes in a real application scene, develops the dissociation characterization learning technology of three modes of cross-image, text and relation, and obtains the dissociation characterization which is more comprehensive in covering information and can be explained more strongly, thereby effectively solving the technical problems of how to fuse and utilize the three modes of image, text and relation (including user behaviors and the like) to carry out deep characterization learning and ensuring that the learned characterization after fusion is dissociation.

The method and the device for learning the dissociation characterization fusing multiple modalities according to the embodiments of the present invention are described below with reference to the drawings, and first, the method for learning the dissociation characterization fusing multiple modalities according to the embodiments of the present invention will be described with reference to the drawings.

Fig. 1 is a flow chart of a method for learning a dissociated representation that fuses multiple modalities according to an embodiment of the present invention.

As shown in fig. 1, the method for learning the dissociation characterization fused with multiple modalities includes the following steps:

in step S101, image information and text information of the target are acquired to obtain an image non-dissociation characteristic and a text non-dissociation characteristic.

It can be understood that, as shown in fig. 2, after the image and the text of one commodity are input, the two encoders for processing the image and the text respectively output the non-dissociated representations of the image and the text.

In step S102, an image representation and a text representation are obtained according to the image non-dissociated representation and the text non-dissociated representation.

In one embodiment of the present invention, obtaining the image representation and the text representation from the image non-dissociated representation and the text non-dissociated representation includes: performing F test on any two different components of the image non-dissociation characterization and the text non-dissociation characterization, and punishing an F test index to force the different components to correspond to different aspects of information; and fusing the two independent representations into a unified representation, and restoring the input image and text from the unified representation respectively to obtain the image representation and the text representation.

Specifically, as shown in fig. 2, in order to de-characterize the characterization, the embodiment of the present invention performs an F-test on any two different components of the characterization (measure the similarity of the distributions of the two components), and then penalizes the F-test index to force the different components to correspond to different aspects of information (e.g., one component corresponds to a color, and the other corresponds to a material); then, two independent characterizations are fused into a unified characterization through a neural network module; finally, the two decoders (training only and not prediction) respectively restore the input image and text from the unified representation, so as to ensure that the unified representation still retains all information of the two modalities.

In step S103, a product representation and a user preference representation characterizing the user consumption preference are obtained according to the product representation and the text representation.

Further, in an embodiment of the present invention, as shown in fig. 3 and fig. 4, deriving the product characterization and the user preference characterization characterizing the user consumption preference according to the product characterization includes: and the commodity representation and the user preference representation which characterize the consumption preference of the user are deduced from the commodity-user relationship and the user-user relationship in sequence by taking the commodity representation as input through the two dissociation graph convolution networks.

In step S104, the commodity interestingness of the user is predicted according to the commodity characterization and the user preference characterization.

Further, in one embodiment of the present invention, predicting the interest level of the commodity which the user is according to the commodity characterization and the user preference characterization comprises: on the basis of the dissociated commodity characterization and the user preference characterization, commodity interestingness is obtained by measuring the matching degree of the commodity and the user in various aspects.

It will be appreciated that based on the dissociated product and user preference, a predictor predicts whether the user will be interested in the product by measuring how well the product and the user match in various aspects, as shown in fig. 5.

The dissociation characterization learning method for fusing multiple modalities will be further described below by taking an e-commerce recommendation system as an example, and specifically includes:

(1) learning dissociation representation of the commodity by image-text information and user behaviors of the commodity (each component of the representation respectively describes different aspects of the commodity, such as price, color, material and the like), and dissociation representation of user preference (preference on price and the like);

(2) then, calculating the matching degree of the user and the commodity in all aspects based on the dissociated representation, and further recommending the commodity;

(3) the de-ionized characterization will allow the system to highlight the aspect of the good that best suits the user, as well as allow the system to focus on a particular aspect of the user's preferences when placing an advertisement (e.g., price emphasis, or material emphasis, etc.).

In summary, (1) the embodiment of the present invention obtains a characterization learning technique capable of fusing three modalities, namely, an image, a text, and a relationship (including user behavior, etc.), based on the feature that the characterization is dissociated. (2) After obtaining the undissociated characterization using the depth encoder that is best suited for each modality, embodiments of the present invention develop a uniform way to dissociate the characterization, the core of which is a penalty function based on F-test. (3) In order to ensure that the unified representation obtained by fusing all the modalities does not lose any original single modality information, the representation is required to have the capability of reconstructing back images, texts and relationships forcibly.

Next, a proposed dissociation characterization learning apparatus fusing multiple modalities according to an embodiment of the present invention is described with reference to the drawings.

Fig. 6 is a schematic structural diagram of a dissociative characterization learning apparatus incorporating multiple modalities according to an embodiment of the present invention.

As shown in fig. 6, the dissociation characterization learning apparatus 10 that fuses multiple modalities includes: a dissociation encoder 100, a dissociation map convolution encoder 200 and a predictor 300.

The dissociation encoder 100 is configured to obtain image information and text information of a target to obtain an image non-dissociation characteristic and a text non-dissociation characteristic, and obtain the image characteristic and the text characteristic according to the image non-dissociation characteristic and the text non-dissociation characteristic; the dissociation graph convolution encoder 200 is used for obtaining a commodity representation and a user preference representation which describe the user consumption preference according to the commodity representation and the text representation; predictor 300 is configured to predict the interest level of the item that the user is based on the item characterization and the user preference characterization. The device 10 of the embodiment of the invention can fuse data of three different modes, namely images, texts and relations, and obtain more accurate and effective dissociation representation by utilizing complementarity among the modes.

Specifically, the dissociation encoder 100 is used to fuse two modalities, i.e., an image and a text of a product, as shown in fig. 2; the dissociation graph convolution encoder 200 includes a dissociation graph convolution encoder mining user preferences from a user's click history as shown in fig. 3, and a dissociation graph convolution encoder mining user preferences from a user's friend relationship as shown in fig. 4; predictor 300, shown in FIG. 5, is used to predict click-through rates from the dissociated commodity, user representations.

Further, in an embodiment of the present invention, the dissociation encoder 100 is further configured to perform F-test on any two different components of the image non-dissociation characterization and the text non-dissociation characterization, and penalize an F-test index to force the different components to correspond to different aspects of information; and the two independent representations are fused into a unified representation, and the input image and the text are restored from the unified representation respectively to obtain the image representation and the text representation.

Further, in an embodiment of the present invention, the dissociation graph convolution encoder 200 is further configured to use the commodity characterization as an input, and the two dissociation graph convolution networks are further configured to sequentially infer the commodity characterization and the user preference characterization characterizing the user consumption preference from the commodity-to-user relationship and the user-to-user relationship.

Further, in an embodiment of the present invention, the predictor 300 is further configured to obtain the interest level of the product by measuring the matching degree of the product and the user in various aspects based on the dissociated product characteristics and the user preference characteristics.

It should be noted that all modules are jointly trained simultaneously (although each module may be pre-trained separately to promote convergence speed during joint training). In addition, the explanation of the embodiment of the method for learning a dissociation characterization fused with multiple modalities is also applicable to the device for learning a dissociation characterization fused with multiple modalities, and is not repeated here.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A dissociation characterization learning method fusing multiple modes is characterized by comprising the following steps:

acquiring image information and text information of a target to obtain an image non-dissociation characteristic and a text non-dissociation characteristic;

obtaining an image representation and a text representation according to the image non-dissociation representation and the text non-dissociation representation;

obtaining a commodity representation and a user preference representation which depict the consumption preference of the user according to the commodity representation and the text representation;

and according to the commodity characteristics and the user preference characteristics, predicting the commodity interestingness of the user.

2. The method of claim 1, wherein deriving image and text representations from the image and text non-dissociated representations comprises:

performing F test on any two different components of the image non-dissociation characterization and the text non-dissociation characterization, and punishing an F test index to force the different components to correspond to different aspects of information;

and fusing the two independent representations into a unified representation, and restoring the input image and the text from the unified representation respectively to obtain the image representation and the text representation.

3. The method of claim 1, wherein deriving the commodity characterization and the user preference characterization characterizing the user consumption preference from the commodity characterization comprises:

and taking the commodity representation as input, and reasoning the commodity representation and the user preference representation which describe the user consumption preference from the commodity-user relationship and the user-user relationship by the two dissociation graph convolution networks in sequence.

4. The method of claim 1, wherein said characterizing the predicted interest level of the item that the user is based on the item characterization and the user preference characterization comprises:

and on the basis of the dissociated commodity representation and the user preference representation, obtaining the commodity interestingness by measuring the matching degree of the commodity and the user in various aspects.

5. A dissociated feature learning device that fuses multiple modalities, comprising:

the dissociation encoder is used for acquiring image information and text information of a target to obtain an image non-dissociation characteristic and a text non-dissociation characteristic and obtaining the image characteristic and the text characteristic according to the image non-dissociation characteristic and the text non-dissociation characteristic;

the dissociation graph convolution encoder is used for obtaining a commodity representation and a user preference representation which depict the user consumption preference according to the commodity representation and the text representation;

and the predictor is used for predicting the commodity interestingness of the user according to the commodity characteristics and the user preference characteristics.

6. The apparatus of claim 5, wherein the de-ionization encoder is further configured to perform an F-test on any two different components of the image non-dissociated token and the text non-dissociated token, and penalize an F-test index to force the different components to correspond to different aspects of information; and two independent representations are fused into a unified representation, and the input image and the text are restored from the unified representation respectively to obtain the image representation and the text representation.

7. The apparatus of claim 6, wherein the dissociation graph convolution encoder is further configured to use the commodity characterization as an input, and the two dissociation graph convolution networks are further configured to sequentially infer the commodity characterization and the user preference characterization characterizing the user consumption preference from the commodity-to-user relationship and the user-to-user relationship.

8. The apparatus of claim 6, wherein the predictor is further configured to derive the interest level of the product by measuring how well the product and the user match in various aspects based on the dissociated product and user preference.