CN116910341A - Label prediction method and device and electronic equipment - Google Patents

Label prediction method and device and electronic equipment Download PDF

Info

Publication number
CN116910341A
CN116910341A CN202211585181.XA CN202211585181A CN116910341A CN 116910341 A CN116910341 A CN 116910341A CN 202211585181 A CN202211585181 A CN 202211585181A CN 116910341 A CN116910341 A CN 116910341A
Authority
CN
China
Prior art keywords
user data
data
sample
module
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211585181.XA
Other languages
Chinese (zh)
Inventor
陈星宇
徐红蕾
郭叶
黄志勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Communications Ltd Research Institute filed Critical China Mobile Communications Group Co Ltd
Priority to CN202211585181.XA priority Critical patent/CN116910341A/en
Publication of CN116910341A publication Critical patent/CN116910341A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a label prediction method, a label prediction device and electronic equipment, relates to the technical field of data processing, and aims to solve the problem that the confidence of a prediction result in the prior art is poor. The method comprises the following steps: obtaining user data to be predicted, wherein the user data to be predicted belongs to first-class user data; respectively forming a plurality of prediction sample pairs by the user data to be predicted and each second user data in the second type of user data acquired in advance; respectively inputting a plurality of prediction sample pairs into a data fusion model to obtain the similarity of each prediction sample pair, wherein the data fusion model is obtained by training a pre-constructed initial data fusion model by utilizing each sample in a training sample set, the training sample set comprises a plurality of sample pairs, and each sample pair comprises two different types of user data; and determining the label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data. The embodiment of the application can improve the confidence of the label prediction result.

Description

Label prediction method and device and electronic equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a tag prediction method, a tag prediction device, and an electronic device.
Background
The user tag is an abstraction and description of user attributes, behaviors, interests and other features, and is one of the core factors for constructing user portraits. The user labels can help research and development personnel fully discover the differences of targets, behaviors and views among different user groups, and the requirements of the research and development personnel are insight, so that the technical fields of accurate marketing, personalized product design, user experience optimization and the like are further served.
In the prior art, the prediction of the user label is usually based on the prediction of a single data source, however, the data characteristics in the single data source are limited, and the comprehensive attribute and behavior of the user cannot be fully reflected, so that the confidence of the result obtained by label prediction is poor.
Disclosure of Invention
The embodiment of the application provides a label prediction method, a label prediction device and electronic equipment, which are used for solving the problem of poor confidence of a prediction result of the existing user label prediction scheme.
In a first aspect, an embodiment of the present application provides a tag prediction method, including:
obtaining user data to be predicted, wherein the user data to be predicted belongs to first-class user data;
respectively forming a plurality of prediction sample pairs by the user data to be predicted and each second user data in the second type of user data acquired in advance;
The method comprises the steps of respectively inputting a plurality of prediction sample pairs into a data fusion model to obtain the similarity of each prediction sample pair output by the data fusion model, wherein the data fusion model is obtained by training a pre-built initial data fusion model through each sample pair in a pre-obtained training sample set, the training sample set comprises a plurality of sample pairs, each sample pair comprises two different types of user data, and each sample pair is marked with a true value, and the true value is used for indicating whether the sample pairs correspond to the same user;
and determining the label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data.
Optionally, the determining the label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data includes:
taking the similarity of each prediction sample pair as the weight of the label value of the second user data in the corresponding prediction sample pair, and weighting the label value of each second user data to obtain a weighted label value;
and determining the weighted tag value as the tag value of the user data to be predicted.
Optionally, the initial data fusion model includes a first data processing module, a second data processing module, a first cross-modal processing module, a second cross-modal processing module, and a feature fusion module;
the processing the plurality of prediction sample pairs to be respectively input into a data fusion model to obtain the similarity of each prediction sample pair output by the data fusion model comprises the following steps:
performing feature extraction processing on one user data in the input sample pair through the first data processing module, and performing feature extraction processing on the other user data in the input sample pair through the second data processing module;
performing feature extraction and fusion processing on the output features of the first data processing module and the output features of the second data processing module based on an attention mechanism through the first cross-mode processing module;
performing feature extraction and fusion processing on the output features of the second data processing module and the output features of the first data processing module based on an attention mechanism through the second cross-mode processing module;
and fusing the output characteristics of the first cross-mode processing module and the output characteristics of the second cross-mode processing module through the characteristic fusion module, and outputting the similarity of two user data in the input sample pair.
Optionally, the first data processing module is a questionnaire encoder, and the questionnaire encoder comprises an encoding module, a vector normalization module, a pooling module and a splicing module; one user data in the input sample pair is questionnaire investigation data;
the feature extraction processing is performed on the user data in the input sample pair by the first data processing module, including:
the questionnaire investigation data is encoded through the encoding module, the output vector of the encoding module is normalized through the vector normalization module, the output vector of the encoding module is pooled through the pooling module, and the output vector of the vector normalization module and the output vector of the pooling module are spliced through the splicing module.
Optionally, the first cross-modality processing module includes a first self-attention module, a first mutual-attention module, and a first feed-forward network; the feature extraction and fusion processing of the output features of the first data processing module and the output features of the second data processing module based on an attention mechanism by the first cross-modal processing module comprises:
Performing data internal feature extraction processing on the output features of the first data processing module through the first self-attention module, performing inter-data feature extraction processing on the output features of the first self-attention module and the output features of the second data processing module through the first mutual-attention module, and performing connection processing on the output features of the first mutual-attention module through the first feedforward network;
and/or the second cross-modality processing module includes a second self-attention module, a second mutual-attention module, and a second feed-forward network; the performing, by the second cross-modal processing module, feature extraction and fusion processing on the output features of the second data processing module and the output features of the first data processing module based on an attention mechanism includes:
and carrying out data internal feature extraction processing on the output features of the second data processing module through the second self-attention module, carrying out data inter-feature extraction processing on the output features of the second self-attention module and the output features of the first data processing module through the second mutual-attention module, and carrying out connection processing on the output features of the second mutual-attention module through the second feedforward network.
Optionally, the training sample set includes a plurality of first-type sample pairs, where the first-type sample pairs include a first user data and a second user data, the first user data belongs to the first-type user data, the second user data belongs to the second-type user data, and the first-type user data and the second-type user data are data from different sources; the first data processing module is used for processing second user data in the input first type sample pair, and the second data processing module is used for processing the first user data in the input first type sample pair;
before the plurality of prediction samples are processed by the respective input data fusion model, the method further comprises:
obtaining M first user data and N second user data, wherein N and M are integers greater than 1;
determining L target first user data similar to target second user data according to first association features between the first user data and the second user data, wherein the target first user data are any one of the M first user data, the target second user data are second user data in the N second user data, and L is a positive integer;
Respectively forming L first-class sample pairs by the target second user data and the L target first user data;
and determining the true value of each first-class sample pair in the L first-class samples according to whether each first-class sample pair in the L first-class sample pairs corresponds to the same user.
Optionally, the first type of user data is user business behavior data, the second type of user data is questionnaire investigation data, the training sample set further comprises a plurality of second type of sample pairs, and the second type of sample pairs comprise first user data and third user data; the second data processing module is further used for processing third user data in the input second class sample pair;
before the plurality of prediction samples are processed by the respective input data fusion model, the method further comprises:
performing sample expansion according to the first type of sample pairs, generating third user data corresponding to first user data in the first type of sample pairs, forming the second type of sample pairs by each third user data and the corresponding first user data, and determining true values of the second type of sample pairs;
and training the initial data fusion model by utilizing each first type sample and each second type sample in the training sample set.
Optionally, the training the initial data fusion model using each first type sample and each second type sample in the training sample set includes:
inputting each first type sample pair in the training sample set into the initial data fusion model to obtain a first model predicted value output by the initial data fusion model, wherein the first model predicted value is used for representing the similarity of first user data and second user data in the input first type sample pair;
inputting each second sample pair in the training sample set into a shared data fusion model to obtain a second model predicted value output by the shared data fusion model, wherein the second model predicted value is used for representing the similarity of first user data and third user data in the input second sample pair, the shared data fusion model comprises two third cross-mode processing modules, and the third cross-mode processing modules and the second cross-mode processing modules have the same structure and share weights;
determining a first penalty value based on the first model predictive value and a true value of the first class of sample pairs;
determining a second penalty value based on the second model predictive value and a true value of the second class of sample pairs;
Determining a weighted loss value according to the first loss value and the second loss value;
and adjusting structural parameters of the initial data fusion model based on the weighted loss value.
Optionally, the performing sample expansion according to the first type of sample pair, generating third user data corresponding to the first user data in the first type of sample pair, forming each third user data and the corresponding first user data into the second type of sample pair, and determining the true value of the second type of sample pair includes:
respectively counting characteristic average values of positive sample pairs and negative sample pairs in the plurality of first sample pairs, and determining a difference value set; the true value of the positive sample pair is a first value, the true value of the negative sample pair is a second value, the first value represents that the sample pair corresponds to the same user, and the second value represents that the sample pair does not correspond to the same user; the difference set comprises the difference value of the first user data in each positive class sample pair and the characteristic average value of the negative class sample pair, and the difference value of the first user data in each negative class sample pair and the characteristic average value of the positive class sample pair;
extracting third user data from a first gaussian distribution and a second gaussian distribution respectively, forming a first pair of samples from the third user data extracted from the first gaussian distribution and corresponding first user data, determining a true value of the first pair of samples as the first value, and forming a second pair of samples from the third user data extracted from the second gaussian distribution and corresponding first user data, determining a true value of the second pair of samples as the second value; the second class of sample pairs comprises the first class of sample pairs and the second class of sample pairs, the first Gaussian distribution is obtained by constructing according to characteristic mean values and covariance of P first user data corresponding to P minimum difference values in the difference value set and first user data in the first class of sample pairs, the second Gaussian distribution is obtained by constructing according to characteristic mean values and covariance of Q first user data corresponding to Q maximum difference values in the difference value set and first user data in the first class of sample pairs, and P and Q are positive integers.
In a second aspect, an embodiment of the present application further provides a label prediction apparatus, including:
the first acquisition module is used for acquiring user data to be predicted, wherein the user data to be predicted belongs to first-class user data;
the first processing module is used for respectively forming a plurality of prediction sample pairs by the user data to be predicted and each second user data in the second type of user data acquired in advance;
the model prediction module is used for respectively inputting the plurality of prediction sample pairs into the data fusion model to obtain the similarity of each prediction sample pair output by the data fusion model, wherein the data fusion model is obtained by training a pre-built initial data fusion model by utilizing each sample in a pre-acquired training sample set, the training sample set comprises a plurality of sample pairs, each sample pair comprises two different types of user data, each sample pair is marked with a true value, and the true value is used for indicating whether the sample pairs correspond to the same user;
and the first determining module is used for determining the label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data.
Optionally, the first determining module includes:
the weighting processing unit is used for weighting the label value of each second user data by taking the similarity of each prediction sample pair as the weight of the label value of the second user data in the corresponding prediction sample pair to obtain a weighted label value;
and the first determining unit is used for determining the weighted tag value as the tag value of the user data to be predicted.
Optionally, the initial data fusion model includes a first data processing module, a second data processing module, a first cross-modal processing module, a second cross-modal processing module, and a feature fusion module;
wherein the model prediction module is used for:
performing feature extraction processing on one user data in the input sample pair through the first data processing module, and performing feature extraction processing on the other user data in the input sample pair through the second data processing module;
performing feature extraction and fusion processing on the output features of the first data processing module and the output features of the second data processing module based on an attention mechanism through the first cross-mode processing module;
Performing feature extraction and fusion processing on the output features of the second data processing module and the output features of the first data processing module based on an attention mechanism through the second cross-mode processing module;
and fusing the output characteristics of the first cross-mode processing module and the output characteristics of the second cross-mode processing module through the characteristic fusion module, and outputting the similarity of two user data in the input sample pair.
Optionally, the first data processing module is a questionnaire encoder, and the questionnaire encoder comprises an encoding module, a vector normalization module, a pooling module and a splicing module; one user data in the input sample pair is questionnaire investigation data;
the first data processing module is used for:
the questionnaire investigation data is encoded through the encoding module, the output vector of the encoding module is normalized through the vector normalization module, the output vector of the encoding module is pooled through the pooling module, and the output vector of the vector normalization module and the output vector of the pooling module are spliced through the splicing module.
Optionally, the first cross-modality processing module includes a first self-attention module, a first mutual-attention module, and a first feed-forward network; the first cross-modal processing module is configured to:
performing data internal feature extraction processing on the output features of the first data processing module through the first self-attention module, performing inter-data feature extraction processing on the output features of the first self-attention module and the output features of the second data processing module through the first mutual-attention module, and performing connection processing on the output features of the first mutual-attention module through the first feedforward network;
and/or the second cross-modality processing module includes a second self-attention module, a second mutual-attention module, and a second feed-forward network; the second cross-modal processing module is configured to:
and carrying out data internal feature extraction processing on the output features of the second data processing module through the second self-attention module, carrying out data inter-feature extraction processing on the output features of the second self-attention module and the output features of the first data processing module through the second mutual-attention module, and carrying out connection processing on the output features of the second mutual-attention module through the second feedforward network.
Optionally, the training sample set includes a plurality of first-type sample pairs, where the first-type sample pairs include a first user data and a second user data, the first user data belongs to the first-type user data, the second user data belongs to the second-type user data, and the first-type user data and the second-type user data are data from different sources; the first data processing module is used for processing second user data in the input first type sample pair, and the second data processing module is used for processing the first user data in the input first type sample pair;
the tag prediction apparatus further includes:
the second acquisition module is used for acquiring M pieces of first user data and N pieces of second user data, wherein N and M are integers larger than 1;
the second determining module is configured to determine, according to a first association characteristic between the first type of user data and the second type of user data, L target first user data similar to target second user data, where the target first user data is any one of the M first user data, the target second user data is second user data in the N second user data, and L is a positive integer;
The second processing module is used for respectively forming L first sample pairs by the target second user data and the L target first user data;
and the third determining module is used for determining the true value of each first type sample pair in the L first type samples according to whether each first type sample pair in the L first type samples corresponds to the same user.
Optionally, the first type of user data is user business behavior data, the second type of user data is questionnaire investigation data, the training sample set further comprises a plurality of second type of sample pairs, and the second type of sample pairs comprise first user data and third user data; the second data processing module is further used for processing third user data in the input second class sample pair;
the tag prediction apparatus further includes:
the sample expansion module is used for carrying out sample expansion according to the first type of sample pairs, generating third user data corresponding to first user data in the first type of sample pairs, forming the second type of sample pairs by each third user data and the corresponding first user data, and determining true values of the second type of sample pairs;
And the training module is used for training the initial data fusion model by utilizing each first type sample and each second type sample in the training sample set.
Optionally, the training module includes:
the first processing unit is used for inputting each first type of sample pair in the training sample set into the initial data fusion model to obtain a first model predicted value output by the initial data fusion model, wherein the first model predicted value is used for representing the similarity of first user data and second user data in the input first type of sample pair;
the second processing unit is used for inputting each second type of sample pair in the training sample set into a shared data fusion model to obtain a second model predicted value output by the shared data fusion model, wherein the second model predicted value is used for representing the similarity of first user data and third user data in the input second type of sample pair, the shared data fusion model comprises two third cross-mode processing modules, and the third cross-mode processing modules and the second cross-mode processing modules have the same structure and share weights;
a second determining unit, configured to determine a first loss value based on the first model predicted value and a true value of the first sample pair;
A third determining unit configured to determine a second loss value based on the second model predicted value and a true value of the second sample pair;
a fourth determining unit configured to determine a weighted loss value according to the first loss value and the second loss value;
and the adjusting unit is used for adjusting the structural parameters of the initial data fusion model based on the weighted loss value.
Optionally, the sample expansion module includes:
a fifth determining unit, configured to respectively count feature averages of positive and negative sample pairs in the plurality of first sample pairs, and determine a difference set; the true value of the positive sample pair is a first value, the true value of the negative sample pair is a second value, the first value represents that the sample pair corresponds to the same user, and the second value represents that the sample pair does not correspond to the same user; the difference set comprises the difference value of the first user data in each positive class sample pair and the characteristic average value of the negative class sample pair, and the difference value of the first user data in each negative class sample pair and the characteristic average value of the positive class sample pair;
a sixth determining unit configured to extract third user data from a first gaussian distribution and a second gaussian distribution, respectively, construct a first pair of samples from the third user data extracted from the first gaussian distribution and the corresponding first user data, and determine a true value of the first pair of samples as the first value, and construct a second pair of samples from the third user data extracted from the second gaussian distribution and the corresponding first user data, and determine a true value of the second pair of samples as the second value; the second class of sample pairs comprises the first class of sample pairs and the second class of sample pairs, the first Gaussian distribution is obtained by constructing according to characteristic mean values and covariance of P first user data corresponding to P minimum difference values in the difference value set and first user data in the first class of sample pairs, the second Gaussian distribution is obtained by constructing according to characteristic mean values and covariance of Q first user data corresponding to Q maximum difference values in the difference value set and first user data in the first class of sample pairs, and P and Q are positive integers.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the tag prediction method as described in the first aspect when the computer program is executed.
In a fourth aspect, embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the tag prediction method according to the first aspect.
In the embodiment of the application, user data to be predicted is obtained, and the user data to be predicted belongs to first-class user data; respectively forming a plurality of prediction sample pairs by the user data to be predicted and each second user data in the second type of user data acquired in advance; the method comprises the steps of respectively inputting a plurality of prediction sample pairs into a data fusion model to obtain the similarity of each prediction sample pair output by the data fusion model, wherein the data fusion model is obtained by training a pre-built initial data fusion model through each sample pair in a pre-obtained training sample set, the training sample set comprises a plurality of sample pairs, each sample pair comprises two different types of user data, and each sample pair is marked with a true value, and the true value is used for indicating whether the sample pairs correspond to the same user; and determining the label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data. In this way, the user data to be predicted and the other type of user data form the prediction sample pair, and the user label is predicted by utilizing the data fusion model fused with the characteristics of the multiple types of user data, so that the comprehensive attribute and behavior of the user can be fully mined, and the confidence of the result of predicting the user label is ensured to be higher.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
FIG. 1 is a flowchart of a tag prediction method provided by an embodiment of the present application;
FIG. 2 is a flow chart of data fusion provided by an embodiment of the present application;
FIG. 3a is a schematic diagram of a data fusion model according to an embodiment of the present application;
FIG. 3b is a schematic diagram of a questionnaire encoder according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a training data fusion model in a few sample scenario provided by an embodiment of the present application;
FIG. 5 is a flow chart of data fusion model training and user tag prediction provided by an embodiment of the present application;
FIG. 6 is a block diagram of a tag prediction apparatus according to an embodiment of the present application;
fig. 7 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In order to make the embodiments of the present application more clear, the following description will be given to the related technical knowledge related to the present application:
the tag is the abstraction and description of the user attribute, behavior, interest and other characteristics, is adjective with difference characteristics generated after big data analysis and mining are carried out on the user data, and is one of the core factors for forming the user portrait. The user labels can help us to understand users more accurately and comprehensively, fully discover differences of targets, behaviors and views among different user groups, and provide insight into demands of the users, so that the users can be served in the technical fields of accurate marketing, personalized product design, user experience optimization and the like.
Currently, a large amount of multi-source data has been accumulated in the field of user data research, including questionnaire data, interview data, user business behavior data, and the like. The user coverage of the user business behavior data is wide, the data types are rich, but flexible qualitative labels are lacking, and diversified analysis services are difficult to provide. While questionnaire investigation, interviews and the like, although the coverage of users is smaller (so called as small data to distinguish from large data of users), flexible and various qualitative labels of users can be obtained because questionnaire questions can be custom designed.
Part of the existing technical schemes related to label prediction are based on single data source prediction, for example, graph structure data are processed by utilizing a graph neural network and a poisson graph network, a supervised model is constructed for training, and further prediction of user labels is achieved; the other part of scheme relates to a plurality of data sources, the different data sources are in one-to-one correspondence on the user entity, the entity matching among the data sources is directly carried out mainly through strong association features such as user ID and the like, on the basis, sample data with a certain true value can be obtained, and a supervision model is built for training and label prediction.
However, the above method has the following problems: firstly, the data characteristics in a single data source are limited, the comprehensive attribute and behavior of a user can not be fully reflected, and the confidence of a result obtained by label prediction is poor; secondly, in the aspect of multi-source data fusion, in practical application, due to inconsistent statistical caliber of data from different sources, user entities among different data sources may not be in one-to-one correspondence, for example, user entities of small data sources can only be mapped to user entities of partial large data sources, and meanwhile, due to factors such as data islands caused by lack of the same identity authentication, correlation among all entities is difficult to directly perform through strong correlation features such as user IDs, which further results in failure to provide enough sample data with true values for a user tag prediction model to perform supervised training.
Based on the technical problems, the application provides a method for realizing user label prediction by carrying out size data fusion based on a deep learning model. Aiming at the difficulty that user entities cannot be matched one by one in size data fusion, the method models the user matching problem as two kinds of problems, uses a neural network model based on a double-tower architecture to perform model training of size data fusion, and after training is completed, the model can output the similarity of two kinds of input user data for any two kinds of user data, can also use the similarity as weight, and expands user labels in small data sources to large data source users to realize user label prediction of large data sources. The application realizes the soft matching of the user entity through the similarity of the user data, thereby avoiding the problem that the user entity cannot be completely corresponding caused by directly using the user ID to carry out the hard matching. In addition, in order to solve the problems that true samples generated by data islands are fewer and the like in the training process of the size data fusion model, the application adopts the idea of contrast learning in the training process of the size data fusion model, and realizes a semi-supervised model training architecture capable of learning based on fewer samples by means of data augmentation and adding contrast loss terms in a model loss function, thereby improving the accuracy of the model.
The size data fusion related in the application mainly refers to a multi-source data source which combines questionnaire investigation data and user business behavior big data and has obvious user entity quantity and type difference, and the big data label system of the user is expanded by an algorithm model mode, so that the potential value of the multi-source data is fully mined.
The application aims to fuse the size data through the deep learning model, integrate the respective advantages of different data sources, break the data island, further realize the accurate prediction of the user label, enrich the user label system, promote the service capacity of the power-assisted industry and optimize the profit mode. The method provided by the application can be used for fusion between various big data sources and small data sources, and the application scene of the application is described by taking questionnaire investigation data as the representation of the small data sources, and taking user big data (B domain, O domain, S domain and the like), namely user business behavior data as the representation of the big data sources.
Referring to fig. 1, fig. 1 is a flowchart of a label prediction method provided by an embodiment of the present application, as shown in fig. 1, including the following steps:
step 101, obtaining user data to be predicted, wherein the user data to be predicted belongs to first-class user data.
The user data to be predicted may be user data of a user tag to be predicted, and may be first-class user data, that is, large data source user data, for example, may be service behavior data of a user obtained from an operator.
In contrast, the second type of user data in the embodiment of the present application may refer to small data source user data, such as questionnaire investigation data about a certain user in terms of communication service usage.
Optionally, the first type of user data is user business behavior data.
Because of the limitation of the data acquisition mode, the user labels in the large data source are abundant but not flexible enough, and the qualitative labels required in the fields of user research and marketing are lacking; in the small data source, the questions and options of the questionnaire can be designed according to the needs, so that the acquired user tag is more flexible and accurate, but the user plane which can be covered by questionnaire investigation is not wide enough and is far lower than that of a large data source user due to the limitation of factors such as cost, user participation and the like. Therefore, the application maps the user labels in the small data sources to massive large data source users by fusing the large and small data sources, can integrate the advantages of the two data sources and realize the accurate prediction of the user labels.
Step 102, forming a plurality of prediction sample pairs by the user data to be predicted and each second user data in the second type of user data acquired in advance.
The second type of user data obtained in advance may refer to a set of second user data of known user tags obtained in advance, such as questionnaire investigation data of a plurality of users obtained in advance, and qualitative tags of the plurality of users can be determined based on the respective questionnaire investigation data. The pre-acquired second type of user data may also refer to second user data in each of the first type of sample pairs as a training sample set.
In this step, the user data to be predicted may be paired with each second user data in the second type of user data acquired in advance, respectively, to form a plurality of prediction sample pairs, so as to perform similarity detection and label prediction processing on each prediction sample pair subsequently.
Step 103, the plurality of prediction sample pairs are respectively input into a data fusion model to obtain the similarity of each prediction sample pair output by the data fusion model, wherein the data fusion model is obtained by training a pre-built initial data fusion model by utilizing each sample pair in a pre-acquired training sample set, the training sample set comprises a plurality of sample pairs, each sample pair comprises two different types of user data, and each sample pair is marked with a true value, and the true value is used for indicating whether the sample pairs correspond to the same user.
In the step, each prediction sample pair of the plurality of prediction sample pairs can be respectively input into a trained data fusion model, similarity calculation processing is carried out on each prediction sample pair through the data fusion model, and then the similarity of each prediction sample pair output by the data fusion model is obtained.
The data fusion model can be obtained by training a pre-built initial data fusion model by utilizing each sample in a pre-acquired training sample set. In the embodiment of the application, considering that the input of the data fusion model to be trained is a sample pair formed by user data of a plurality of data sources, the model training target is to predict whether the user data of two different data sources in the sample pair match the same user, so that the model training target can be modeled as a classification problem, and a deep learning neural network model of a double-tower architecture is adopted for processing, namely the initial data fusion model can be a deep learning neural network model of the double-tower architecture.
The training sample set may include a plurality of sample pairs of two user data, each belonging to a different category of user data, which may be from different data sources and may be of different sizes, e.g. one of the user data belonging to user data in a small data source, such as data for questionnaire or interview of a user collected, and the other user data belonging to user data in a large data source, such as user business activity data obtained from an operator.
The true value pair of each sample in the training sample set may be labeled according to whether two user data belong to the same user entity, specifically, if it is determined that two user data in a certain sample pair belong to the same user entity, that is, are data of different categories of the same user, the true value of the sample pair may be labeled as 1, to indicate that the sample pair is a positive sample pair, and if it is determined that two user data in a certain sample pair do not belong to the same user entity, that is, are data of different categories of different users, the true value of the sample pair may be labeled as 0, to indicate that the sample pair is a negative sample pair.
The process of obtaining the training sample set may be to collect a large amount of two types of user data in advance, form a positive type sample pair from two different types of user data belonging to the same user, form a negative type sample pair from two different types of user data not belonging to the same user, thereby obtaining a training sample set including a certain amount of positive type sample pairs and negative type sample pairs, use the training sample set for training a required data fusion model, fuse two different types of user data to be identified through the model, and predict the similarity of the two different types of user data, namely predict the probability that the two different types of user data belong to the same user, and predict the label of the other user data based on the similarity and the label of one user data.
Optionally, the initial data fusion model includes a first data processing module, a second data processing module, a first cross-modal processing module, a second cross-modal processing module, and a feature fusion module;
wherein, the step 103 includes:
performing feature extraction processing on one user data in the input sample pair through the first data processing module, and performing feature extraction processing on the other user data in the input sample pair through the second data processing module;
performing feature extraction and fusion processing on the output features of the first data processing module and the output features of the second data processing module based on an attention mechanism through the first cross-mode processing module;
performing feature extraction and fusion processing on the output features of the second data processing module and the output features of the first data processing module based on an attention mechanism through the second cross-mode processing module;
and fusing the output characteristics of the first cross-mode processing module and the output characteristics of the second cross-mode processing module through the characteristic fusion module, and outputting the similarity of two user data in the input sample pair.
In an embodiment, a frame diagram of a data fusion model to be trained, i.e. an initial data fusion model, may be shown in fig. 3a, i.e. the frame diagram includes two modules with approximately the same structure, where one part includes a first data processing module and a first cross-mode processing module, and is mainly used to process one user data in an input sample pair, and the other part includes a second data processing module and a second cross-mode processing module, and is mainly used to process the other user data in the input sample pair, and the initial data fusion model further includes a feature fusion module, and is used to fuse outputs of the two modules, and further output a model prediction value that is used to represent similarity between two user data in the input sample pair.
The first data processing module and the second data processing module are mainly used for carrying out primary feature extraction processing on input user data, mapping the input user data into feature vectors, and the specific structure can be correspondingly designed according to the features of the user data to be processed; the structures of the first cross-mode processing module and the second cross-mode processing module can be basically the same, and an attention mechanism is adopted to mine the internal characteristics of the respective user data and the implicit characteristics between the two user data; and the feature fusion module fuses the output features of the first cross-mode processing module and the output features of the second cross-mode processing module to obtain fusion features of the two input user data, and outputs the similarity of the two user data in the input sample pair.
Therefore, through the initial data fusion model with the structure, the finally trained data fusion model can be well used for fusing and calculating the similarity of the user data of two types of data sources with different sizes.
In addition, after the initial data fusion model to be trained is built, each sample in the obtained training sample set can be utilized to carry out iterative training on the initial data fusion model until the data fusion model with the accuracy meeting the requirement is obtained.
Specifically, in each training process, one sample pair in the training sample set is input into the initial data fusion model, wherein one user data is input into the first data processing module for processing, and the other user data is input into the second data processing module for processing; the output characteristics of the first data processing module are fed into the first cross-modal processing module, the output characteristics of the second data processing module are fed into the second cross-modal processing module, the output characteristics of the second data processing module can be fed into the first cross-modal processing module, the output characteristics of the first data processing module can be fed into the second cross-modal processing module, and in particular, as shown in fig. 3a, the output characteristics of the second data processing module can be processed by a multi-layer perceptron (Multilayer Perceptron, MLP) and then fed into the first cross-modal processing module, and the output characteristics of the first data processing module can be processed by an MLP and then fed into the second cross-modal processing module; the output characteristics of the first cross-modal processing module and the output characteristics of the second cross-modal processing module are input into the characteristic fusion module for processing, the characteristic fusion module can be a multi-layer perceptron MLP, and the output of the MLP is the model predicted value.
Then, the model loss value can be determined based on the model predicted value and the true value of the input sample pair, the model structural parameters are adjusted based on the model loss value, namely the model weight is updated, and the model training is completed through the repeated iterative training process, so that a trained data fusion model is obtained.
It should be noted that, the present application may divide the training sample set into a training set and a testing set, for example, 70% of sample pairs in the training sample set are used as the training set, another 30% of sample pairs are used as the testing set, train and test the data fusion model, calculate the difference between the model predicted value and the true value by using the cross entropy loss function, update the model weight, and complete the model training through repeated iteration. The finally obtained data fusion model can be used for judging the similarity between the user data of different subsequent data sources.
Optionally, the first data processing module is a questionnaire encoder, and the questionnaire encoder comprises an encoding module, a vector normalization module, a pooling module and a splicing module; one user data in the input sample pair is questionnaire investigation data;
the feature extraction processing is performed on the user data in the input sample pair by the first data processing module, including:
The questionnaire investigation data is encoded through the encoding module, the output vector of the encoding module is normalized through the vector normalization module, the output vector of the encoding module is pooled through the pooling module, and the output vector of the vector normalization module and the output vector of the pooling module are spliced through the splicing module.
In one embodiment, the first data processing module may be designed as a questionnaire encoder (encoder) for encoding questionnaire investigation data, i.e. one of the two user data input into the data fusion model may be questionnaire investigation data of a user.
The user size of the questionnaire investigation data is small and thus may be referred to as a small data source, and since the questionnaire investigation data is of a text type, it may be encoded first using a questionnaire encoder. The specific structure of the questionnaire encoder may be as shown in fig. 3b, and the specific structure of the questionnaire encoder includes a coding module, a vector normalization module, a pooling module and a splicing module, where the coding module may be a pretraining model based on Bert, and the pooling module may be an average pooling module.
Thus, firstly, a pretrained model based on Bert can be used for coding questions, options and user answers of a questionnaire; in consideration of the fact that the questionnaire comprises a plurality of questions, each question comprises a plurality of options, the number of the options contained in different questions may be different, and in addition, a user may select a plurality of options at the same time, so that in order to normalize the output vector length of the questionnaire encoder, a vector normalization module and an average pooling module can be used for processing a plurality of options and user response data under the same question, and then the response conditions of a user under all questions are spliced into a two-dimensional tensor through a splicing module to serve as the output of the questionnaire encoder.
Thus, through the implementation mode, the user questionnaire investigation data can be well coded, and the characteristics of the user questionnaire investigation data are extracted and obtained, so that the further analysis and the processing can be conveniently carried out.
Optionally, the second data processing module is configured to process user service behavior data.
That is, in one embodiment, the second data processing module may be designed as a preprocessing module to preprocess the user service behavior data, that is, the other of the two user data input into the data fusion model may be the user service behavior data, belonging to the large data source user data.
In contrast to small data sources, where the user data in large data sources are basically structured data, in this embodiment, the user data may be processed directly using a preprocessing module, where the processing operations specifically performed in the preprocessing module may include: one or more of missing value handling, outlier handling, class-type feature mapping, etc. Wherein the missing value processing may refer to a complement processing using a mean value or a mode (a value that appears multiple times) for the missing value, the outlier processing may refer to a filling/substitution using a mean value or a mode for an excessively large value or an excessively small value, and the category type feature mapping may refer to a conversion of a text type feature into a numerical value.
Thus, by preprocessing the user business behavior data, the user business behavior data can be processed into characteristic data suitable for subsequent analysis and processing.
Optionally, the first cross-modality processing module includes a first self-attention module, a first mutual-attention module, and a first feed-forward network; the feature extraction and fusion processing of the output features of the first data processing module and the output features of the second data processing module based on an attention mechanism by the first cross-modal processing module comprises:
Performing data internal feature extraction processing on the output features of the first data processing module through the first self-attention module, performing inter-data feature extraction processing on the output features of the first self-attention module and the output features of the second data processing module through the first mutual-attention module, and performing connection processing on the output features of the first mutual-attention module through the first feedforward network;
and/or the second cross-modality processing module includes a second self-attention module, a second mutual-attention module, and a second feed-forward network; the performing, by the second cross-modal processing module, feature extraction and fusion processing on the output features of the second data processing module and the output features of the first data processing module based on an attention mechanism includes:
and carrying out data internal feature extraction processing on the output features of the second data processing module through the second self-attention module, carrying out data inter-feature extraction processing on the output features of the second self-attention module and the output features of the first data processing module through the second mutual-attention module, and carrying out connection processing on the output features of the second mutual-attention module through the second feedforward network.
In one embodiment, the two cross-modal processing modules in the initial data fusion model may each include a self-attention module, a mutual-attention module, and a feedforward network, wherein the input of the self-attention module is the output of the corresponding data processing module, the input of the mutual-attention module includes the output of the self-attention module and the output of the opposite-side data source coding vector through a multi-layer perceptron MLP, and the feedforward network may be a fully-connected neural network for connecting the output of the mutual-attention module to the input of the feature fusion module, such as the MLP.
Thus, after the first data processing module and the second data processing module complete processing on the input user data, the output of the first data processing module and the second data processing module can be respectively unfolded into a vector form, and then the output of the first data processing module and the output of the second data processing module respectively enter the corresponding cross-mode processing module for further processing. The first cross-mode processing module and the second cross-mode processing module can fully mine hidden features inside different data sources and among different data sources through an attention mechanism to carry out deep fusion of multi-source data. After the two user data coding vectors are processed by the cross-mode processing module, the two user data coding vectors are input into a multi-layer perceptron MLP for re-fusion, and the output of the multi-layer perceptron MLP is the predicted value of the model.
In the embodiment, the deep fusion of multi-source data can be realized by fully mining hidden features inside different data sources and among different data sources through the self-attention module and the mutual-attention module in the cross-mode processing module, so that the credibility of a model prediction result is ensured.
Optionally, the training sample set includes a plurality of first-type sample pairs, where the first-type sample pairs include a first user data and a second user data, the first user data belongs to the first-type user data, the second user data belongs to the second-type user data, and the first-type user data and the second-type user data are data from different sources; the first data processing module is used for processing second user data in the input first type sample pair, and the second data processing module is used for processing the first user data in the input first type sample pair;
prior to the step 103, the method further comprises:
obtaining M first user data and N second user data, wherein N and M are integers greater than 1;
determining L target first user data similar to target second user data according to first association features between the first user data and the second user data, wherein the target first user data are any one of the M first user data, the target second user data are second user data in the N second user data, and L is a positive integer;
Respectively forming L first-class sample pairs by the target second user data and the L target first user data;
and determining the true value of each first-class sample pair in the L first-class samples according to whether each first-class sample pair in the L first-class sample pairs corresponds to the same user.
In one embodiment, the training sample set may include a plurality of first-class sample pairs formed by first user data and second user data, where the first user data is a certain user data in the first-class user data, the second user data is a certain user data in the second-class user data, the first-class user data and the second-class user data are data from different sources, for example, the first-class user data are large-data-source user data, the second-class user data are small-data-source user data, and more specifically, the first-class user data may be user business behavior data of an operator, and may include user big data such as a B-domain (i.e., a data domain of a business support system (business support system), an O-domain (i.e., a data domain of a business support system (operation support system)), an S-domain (i.e., a data domain of a management support system (management support system)), and the second-class user data may be questionnaire data.
The questionnaire investigation data may include questions, options and user response actions of the questionnaire, and the corresponding fields are as follows: the questionnaire title may be "what are applied on the line that are most frequently used daily for you? The corresponding question options include "a. Friends", "b. Shopping", "C. Long video", "d. Short video", etc., and the user's answering actions on these options like selecting "a", "C", etc. The user service behavior data of the operator includes data of B domain, O domain, S domain, etc., corresponding fields such as average income per user (Average Revenue Per User, ARPU), average internet traffic per user (Dataflow of Usage, DOU), average monthly call duration per user (Minutes of usages), cell phone brands, residences, etc.
Correspondingly, in the case that the training sample pair includes the first type sample pair, the first data processing module is configured to process the second user data in the input first type sample pair, and the second data processing module is configured to process the first user data in the input first type sample pair, for example, the first data processing module in the data fusion model may be designed to be a module for processing the user data in the large data source, such as the user business behavior data, and the second data processing module in the data fusion model may be designed to be a module for processing the user data in the small data source, such as the questionnaire investigation data.
In this embodiment, a plurality of first user data and a plurality of second user data may be acquired, where the first user data belongs to the first type of user data, and the second user data belongs to the second type of user data, for example, questionnaire data of N users and business behavior data of M users may be acquired.
And then, according to the first correlation characteristics between the first type of user data and the second type of user data, respectively determining a plurality of second user data similar to the N second user data for each first user data of the M first user data, namely, respectively matching the first correlation characteristics of the first user data with the first correlation characteristics of the N second user data, and considering that the first user data and a certain second user data are similar when the matching degree of the first user data and the second user data is larger than a certain value. Wherein the first association characteristic may be a weak association characteristic such as one or more of age, gender, city, cell phone brand, APP usage behavior, etc. The strong correlation features commonly used in the prior art, such as user ID, mobile phone number, etc., are difficult to be directly used for matching multiple source data users in a large range due to the limitation of factors such as different data source acquisition modes and privacy protection.
Thus, for each similar first user data and second user data, a sample pair may be formed, i.e. a first type of sample pair, and a corresponding true value may be determined based on whether the user data in each first type of sample pair corresponds to the same user entity.
In this embodiment, it is considered that the training samples of the data fusion model to be trained are composed of two pairwise pairs of user data of different sources, and are mostly composed of two pairwise pairs of large data source user data (e.g., first type user data) and small data source user data (e.g., second type user data). If all the target user data in the small data source are paired with all the user data in the large data source as training data, the training sample size will be too large, and even more, a large number of meaningless simple samples will be contained therein, because most of the user data in the large data source is obviously different from the target user data in the small data source, and the samples have little positive effect on the training of the model. Thus, in this embodiment, the architecture shown in FIG. 2 may be employed to train a size data fusion model.
Specifically, in order to reject a large number of nonsensical sample pairs, a recall module may be first used to perform pre-screening of user data, and a plurality of similar user data matched with target user data in second type user data in first type user data are filtered through weak correlation features that can be matched between two data sources, where the recalled similar user data and the target user data are paired in pairs to form a first type sample pair for training of a size data fusion model.
For example, user data in the second type of user data, i.e., small data source user data, is denoted as q 1 ,q 2 ,…q m User data of the first type of user data, namely user data of big data source, is denoted as d 1 ,d 2 ,…d n Then for each target user data q in the second class of user data i (1.ltoreq.i.ltoreq.m) several user data similar thereto, i.e. target user data q, can be recalled from the first type of user data based on the weakly associated features i Is { d } i_1 ,d i_2 …,d i_ni }。
The recall user data set { d ] returned by the recall module may then be recalled i_1 ,d i_2 …,d i_ni Individual user data in } and target user data q i Pairing every twoConstitute a training sample pair (q i ,d i_ji ) Wherein j is more than or equal to 1 and ni is more than or equal to ni. If q i And d i_ji Belonging to a true matching user, i.e. corresponding to the same user entity, then the sample pair (q i ,d i_ji ) Is 1, otherwise the sample pair (q i ,d i_ji ) Is 0.
In the embodiment, the recall module, namely the use of similar matching between two types of user data, reduces the scale of the user data, can greatly reduce the training time of the data fusion model, optimizes the constitution of training sample pairs, and can improve the training precision of the data fusion model to a certain extent.
Optionally, the first type of user data is user business behavior data, the second type of user data is questionnaire investigation data, the training sample set further comprises a plurality of second type of sample pairs, and the second type of sample pairs comprise first user data and third user data; the second data processing module is further used for processing third user data in the input second class sample pair;
prior to the step 103, the method further comprises:
performing sample expansion according to the first type of sample pairs, generating third user data corresponding to first user data in the first type of sample pairs, forming the second type of sample pairs by each third user data and the corresponding first user data, and determining true values of the second type of sample pairs;
And training the initial data fusion model by utilizing each first type sample and each second type sample in the training sample set.
In one embodiment, the first type of user data is user business behavior data, namely, belongs to a large data source, and the second type of user data is questionnaire investigation data, namely, belongs to a small data source. And because of factors such as data island and privacy protection, the real matching user data corresponding to the user data in all small data sources in the large data source can not be known in advance, so that the number of training samples with true values in a large data fusion model can be influenced, and the accuracy and the robustness of the model can be influenced to a certain extent due to the fact that the number of the training samples is too small. Therefore, in the embodiment, a semi-supervised learning method for a few-sample scene is provided based on a contrast learning idea for size data fusion model training in a scene with insufficient truth training samples.
In order to compensate for the negative effects caused by less truth training samples, the large data source user data in the first type sample pair, namely the first user data, can be subjected to sample expansion in a data augmentation mode based on the thought of comparison learning, so that a plurality of third user data corresponding to each first user data are generated, and each first user data and the corresponding third user data are further respectively formed into a second type sample pair with a true value.
In this way, a sufficient number of training sample pairs are obtained through sample expansion, and when the model is trained, the initial data fusion model is trained by utilizing each first type sample and each second type sample pair in the training sample set, so that the training precision of the model is ensured.
Optionally, the performing sample expansion according to the first type of sample pair, generating third user data corresponding to the first user data in the first type of sample pair, forming each third user data and the corresponding first user data into the second type of sample pair, and determining the true value of the second type of sample pair includes:
respectively counting characteristic average values of positive sample pairs and negative sample pairs in the plurality of first sample pairs, and determining a difference value set; the true value of the positive sample pair is a first value, the true value of the negative sample pair is a second value, the first value represents that the sample pair corresponds to the same user, and the second value represents that the sample pair does not correspond to the same user; the difference set comprises the difference value of the first user data in each positive class sample pair and the characteristic average value of the negative class sample pair, and the difference value of the first user data in each negative class sample pair and the characteristic average value of the positive class sample pair;
Extracting third user data from a first gaussian distribution and a second gaussian distribution respectively, forming a first pair of samples from the third user data extracted from the first gaussian distribution and corresponding first user data, determining a true value of the first pair of samples as the first value, and forming a second pair of samples from the third user data extracted from the second gaussian distribution and corresponding first user data, determining a true value of the second pair of samples as the second value; the second class of sample pairs comprises the first class of sample pairs and the second class of sample pairs, the first Gaussian distribution is obtained by constructing according to characteristic mean values and covariance of P first user data corresponding to P minimum difference values in the difference value set and first user data in the first class of sample pairs, the second Gaussian distribution is obtained by constructing according to characteristic mean values and covariance of Q first user data corresponding to Q maximum difference values in the difference value set and first user data in the first class of sample pairs, and P and Q are positive integers.
In one embodiment, the following data augmentation scheme may be employed:
firstly, counting the characteristic mean and covariance of the user on the positive and negative classes of all sample pairs with true values (i.e. a plurality of acquired first sample pairs), namely respectively counting the characteristic mean and covariance of the user data in the positive class sample pairs and the characteristic mean and covariance of the user data in the negative class sample pairs;
Then, calculating the difference value of the characteristic mean value of all samples in the positive and negative categories, namely, calculating the difference value of the characteristic mean value of the user data in the negative category for each first user data in the positive category sample pair, and calculating the difference value of the characteristic mean value of the user data in the positive category sample pair for each first user data in the negative category sample pair;
then selecting the nearest P samples and the farthest Q samples, namely selecting P minimum differences from the calculated differences, wherein the corresponding P first user data are the nearest P samples, selecting the Q maximum differences from the calculated differences, and the corresponding Q first user data are the farthest Q samples;
finally, calculating user characteristic mean and covariance of the P nearest samples and the current samples (namely first user data in a first type sample pair), constructing a Gaussian distribution according to the mean and covariance, and extracting a plurality of samples from the Gaussian distribution, wherein classification of class samples formed by the newly-built samples and the current samples is positive, namely true value is 1; and similarly, the Q farthest samples can be similarly processed, and the classification of the sample formed by the obtained newly-built sample and the current sample is negative, namely the true value is 1.
Thus, with this embodiment, the second type of sample pairs from which the truth values are derived can be extended to compensate for the less negative effects of the truth training samples.
Optionally, the training the initial data fusion model using each first type sample and each second type sample in the training sample set includes:
inputting each first type sample pair in the training sample set into the initial data fusion model to obtain a first model predicted value output by the initial data fusion model, wherein the first model predicted value is used for representing the similarity of first user data and second user data in the input first type sample pair;
inputting each second sample pair in the training sample set into a shared data fusion model to obtain a second model predicted value output by the shared data fusion model, wherein the second model predicted value is used for representing the similarity of first user data and third user data in the input second sample pair, the shared data fusion model comprises two third cross-mode processing modules, and the third cross-mode processing modules and the second cross-mode processing modules have the same structure and share weights;
Determining a first penalty value based on the first model predictive value and a true value of the first class of sample pairs;
determining a second penalty value based on the second model predictive value and a true value of the second class of sample pairs;
determining a weighted loss value according to the first loss value and the second loss value;
and adjusting structural parameters of the initial data fusion model based on the weighted loss value.
In one embodiment, the size data fusion model architecture in the scenario with few samples may be as shown in fig. 4, where DNN in fig. 4 is a cross-mode processing module in fig. 3a, and includes three sub-modules of a self-attention module, a mutual-attention module, and a feed-forward network. And respectively processing the coded vectors of the questionnaire investigation data and the user business behavior data through a cross-mode processing module.
In this embodiment, the questionnaire investigation data and the user business behavior data in the first type sample pair in the training sample set may be respectively input into the questionnaire encoder and the preprocessing module in the initial data fusion model for processing, and after being processed by the respective cross-mode processing modules, the first model prediction value, that is, the similarity between the questionnaire investigation data and the user business behavior data in the input first type sample pair is output.
For the second sample pair in the training sample set, the user service behavior data and the corresponding expanded user service behavior data in the second sample pair can be respectively sent to two third cross-mode processing modules which have the same structure as the second cross-mode processing module in the initial data fusion model and share weights after being processed by the preprocessing module, and a second model predicted value, namely the similarity of the two user service behavior data in the input second sample pair, is output after the processing.
That is, these artificially structured second sample pairs may be imported into two structurally identical deep neural networks (Deep Neural Networks, DNN) that are structurally identical and weight-shared with DNNs that handle large data source user vectors.
In order to reduce the influence of artificial construction of pairs of samples (i.e. pairs of second type of samples) on pairs of original samples (i.e. pairs of first type of samples), this embodiment divides model training into two tasks, namely a main task and an auxiliary task; the primary task is still processing and comparing the original sample pair, and this part is performing supervised training, and the LOSS of primary task (LOSS) function can be as follows:
wherein q i Namely, the code vector is the code vector after the user data (such as questionnaire investigation data) of the small data source is processed by DNN, and d i For the coding vector of big data source user data (such as user service behavior data) processed by DNN, tau is the super parameter of the model, c (q) i ,d j ) Representing the calculation of the inner product of the two vectors divided by the modular length, N being the number of pairs of samples. This loss function L 1 Measure small data source user data q i With the current big data source user data d i Similarity of (c) and small data source user data q i Gap in similarity to all large data source user data.
The auxiliary task is to learn a second sample pair consisting of large data source user data and a new sample obtained through data augmentation. Similarly, the loss function of the auxiliary task is as follows:
wherein d i ' d is i And (5) obtaining a new sample through data augmentation.
The loss function of the overall model may be as follows:
L=L 1 +αL 2
wherein, alpha is a weight parameter ranging from-1 to 1, and the specific value can be determined according to actual requirements.
In this embodiment, the model loss value may be calculated based on the calculation formula of the loss function, and the structural parameters of the initial data fusion model may be adjusted based on the model loss value, such as adjusting the DNN model weight in fig. 4.
Therefore, according to the embodiment, the data fusion model can be respectively subjected to supervised training and self-supervised training based on the original sample pair and the expanded sample pair, so that the training effect of the model can be further ensured, and the model training precision is improved.
Step 104, determining the label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data.
The determining the label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data may be that the label of the second user data in one prediction sample pair with the highest similarity among the similarities of each prediction sample pair is used as the label of the user data to be predicted, or that the labels of the second user data in a plurality of prediction sample pairs with the highest similarity among the similarities of each prediction sample pair are used as the labels of the user data to be predicted, determining an integrated label, or weighting the labels of the second user data in each prediction sample pair according to the similarity, determining a target label, and the like.
The tag may refer to any tag word that can reflect characteristics of a user attribute (such as age, gender, occupation, region, etc.), behavior, or interests, etc.
Optionally, the step 104 includes:
Taking the similarity of each prediction sample pair as the weight of the label value of the second user data in the corresponding prediction sample pair, and weighting the label value of each second user data to obtain a weighted label value;
and determining the weighted tag value as the tag value of the user data to be predicted.
In one embodiment, the similarity of each prediction sample pair may be used as a weight of a tag value of the second user data in the corresponding prediction sample pair, the tag value of each second user data is weighted to obtain a weighted tag value, and a corresponding user tag is determined according to the weighted tag value and used as the tag of the user data to be predicted. The method comprises the steps of converting different text labels into different label values, and participating in the weight calculation of the label values.
Thus, according to the embodiment, the label result obtained by prediction based on the similarity between the user data to be predicted and each second user data can be ensured to have high confidence.
That is, in this embodiment, after training of the data fusion model is completed according to the embodiment shown in fig. 1, the user data d in one large data source may be obtained s Leading the data into a label prediction module, and judging the data and all user data { q } in a small data source through the data fusion model 1 ,q 2 ,…,q m Similarity { f (d) s ,q 1 ),f(d s ,q 2 ),…,f(d s ,q m ) Then taking the weight as the weight and combining the user label { k } in the small data source 1 ,k 2 ,…,k m Weighting calculation is carried out to obtain big data source user data d s Label prediction result K of (a).
According to the label prediction method, user data to be predicted is obtained, and the user data to be predicted belongs to first-class user data; respectively forming a plurality of prediction sample pairs by the user data to be predicted and each second user data in the second type of user data acquired in advance; the method comprises the steps of respectively inputting a plurality of prediction sample pairs into a data fusion model to obtain the similarity of each prediction sample pair output by the data fusion model, wherein the data fusion model is obtained by training a pre-built initial data fusion model through each sample pair in a pre-obtained training sample set, the training sample set comprises a plurality of sample pairs, each sample pair comprises two different types of user data, and each sample pair is marked with a true value, and the true value is used for indicating whether the sample pairs correspond to the same user; and determining the label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data. In this way, the user data to be predicted and the other type of user data form the prediction sample pair, and the user label is predicted by utilizing the data fusion model fused with the characteristics of the multiple types of user data, so that the comprehensive attribute and behavior of the user can be fully mined, and the confidence of the result of predicting the user label is ensured to be higher.
Taking the fusion of small data source user data and large data source user data as an example, the overall process flow of the present application can be as shown in fig. 5. Because the number of users in the small data source is far smaller than that in the large data source, the application can make the user data { q ] in the small data source 1 ,…,q m Using recall module as target user data to recall several similar user data from large data source by matching of some weak correlation features, target user data q i Is { d } i_1 ,d i_2 ,…,d i_ni }. And then, training a size data fusion model by taking the data as a sample pair, adopting a neural network model based on a double-tower architecture as a model architecture, and performing semi-supervised learning with few samples by using a comparison learning idea. After the size data fusion model is trained, user data d in a large data source is subjected to s Leading the data into a label prediction module, and judging all user data { q) in the data source and the small data source through a size data fusion model 1 ,…,q m And the similarity is used as a weight to map flexible and rich user labels in the small data source to the user data of the large data source so as to realize label prediction of the user data in the large data source.
The application provides a method for realizing user label prediction by fusing different user data based on a deep learning model. Compared with the existing technical scheme for carrying out user tag prediction based on a single data source, the method and the device can carry out fusion of a plurality of data sources, such as questionnaire investigation data and user business behavior data, through a data fusion model based on deep learning, so that the advantages of the two data sources can be combined simultaneously, namely, massive users of a large data source and tags of a small data source are flexible and rich, the user tag prediction is carried out on the basis, the result confidence is higher, and meanwhile, implicit modes and rules inside the data can be found in the fusion process, so that new value is created. Secondly, compared with the existing technical scheme of directly using strong association features such as user ID and the like to carry out multi-source data hard matching, the method adopts the neural network fusion model based on the double-tower architecture, realizes the user soft matching of multi-source data according to the user similarity score predicted by the fusion model, can be used for large and small data fusion scenes where two data source user entities cannot correspond one to one, and has more flexible application range and form. In addition, in order to cope with a few sample scene with insufficient truth samples caused by data island factors in a multi-source data fusion process, the application also provides a semi-supervised training process based on contrast learning, so that sample expansion is realized, and a data fusion model with higher precision can be obtained based on a small number of truth samples.
The embodiment of the application also provides a label prediction device. Referring to fig. 6, fig. 6 is a block diagram of a tag prediction apparatus according to an embodiment of the present application. Because the principle of solving the problem of the label prediction device is similar to that of the label prediction method in the embodiment of the present application, the implementation of the label prediction device can refer to the implementation of the method, and the repetition is omitted.
As shown in fig. 6, the tag prediction apparatus 600 includes:
a first obtaining module 601, configured to obtain user data to be predicted, where the user data to be predicted belongs to a first type of user data;
a first processing module 602, configured to form a plurality of prediction sample pairs from the user data to be predicted and each second user data in the second type of user data acquired in advance, respectively;
the model prediction module 603 is configured to process the plurality of prediction sample pairs respectively input into a data fusion model to obtain a similarity of each prediction sample pair output by the data fusion model, where the data fusion model is obtained by training a pre-constructed initial data fusion model with each sample in a pre-acquired training sample set, the training sample set includes a plurality of sample pairs, each sample pair includes two different types of user data, and each sample pair is labeled with a true value, where the true value is used to indicate whether the sample pairs correspond to the same user;
A first determining module 604, configured to determine a label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data.
Optionally, the first determining module 604 includes:
the weighting processing unit is used for weighting the label value of each second user data by taking the similarity of each prediction sample pair as the weight of the label value of the second user data in the corresponding prediction sample pair to obtain a weighted label value;
and the first determining unit is used for determining the weighted tag value as the tag value of the user data to be predicted.
Optionally, the initial data fusion model includes a first data processing module, a second data processing module, a first cross-modal processing module, a second cross-modal processing module, and a feature fusion module;
wherein the model prediction module is used for:
performing feature extraction processing on one user data in the input sample pair through the first data processing module, and performing feature extraction processing on the other user data in the input sample pair through the second data processing module;
performing feature extraction and fusion processing on the output features of the first data processing module and the output features of the second data processing module based on an attention mechanism through the first cross-mode processing module;
Performing feature extraction and fusion processing on the output features of the second data processing module and the output features of the first data processing module based on an attention mechanism through the second cross-mode processing module;
and fusing the output characteristics of the first cross-mode processing module and the output characteristics of the second cross-mode processing module through the characteristic fusion module, and outputting the similarity of two user data in the input sample pair.
Optionally, the first data processing module is a questionnaire encoder, and the questionnaire encoder comprises an encoding module, a vector normalization module, a pooling module and a splicing module; one user data in the input sample pair is questionnaire investigation data;
the first data processing module is used for:
encoding the questionnaire investigation data through the encoding module, normalizing the output vector of the encoding module through the vector normalization module, and pooling the output vector of the encoding module through the pooling module; and the splicing module is used for splicing the output vector of the vector normalization module and the output vector of the pooling module.
Optionally, the first cross-modal processing module includes a first self-attention module, a first mutual-attention module, and a first feed-forward network; the first cross-modal processing module is configured to:
performing data internal feature extraction processing on the output features of the first data processing module through the first self-attention module, performing inter-data feature extraction processing on the output features of the first self-attention module and the output features of the second data processing module through the first mutual-attention module, and performing connection processing on the output features of the first mutual-attention module through the first feedforward network;
and/or the second cross-modality processing module includes a second self-attention module, a second mutual-attention module, and a second feed-forward network; the second cross-modal processing module is configured to:
and carrying out data internal feature extraction processing on the output features of the second data processing module through the second self-attention module, carrying out data inter-feature extraction processing on the output features of the second self-attention module and the output features of the first data processing module through the second mutual-attention module, and carrying out connection processing on the output features of the second mutual-attention module through the second feedforward network.
Optionally, the training sample set includes a plurality of first-type sample pairs, where the first-type sample pairs include a first user data and a second user data, the first user data belongs to the first-type user data, the second user data belongs to the second-type user data, and the first-type user data and the second-type user data are data from different sources; the first data processing module is used for processing second user data in the input first type sample pair, and the second data processing module is used for processing the first user data in the input first type sample pair;
the tag prediction apparatus 600 further includes:
the second acquisition module is used for acquiring M pieces of first user data and N pieces of second user data, wherein N and M are integers larger than 1;
the second determining module is configured to determine, according to a first association characteristic between the first type of user data and the second type of user data, L target first user data similar to target second user data, where the target first user data is any one of the M first user data, the target second user data is second user data in the N second user data, and L is a positive integer;
The second processing module is used for respectively forming L first sample pairs by the target second user data and the L target first user data;
and the third determining module is used for determining the true value of each first type sample pair in the L first type samples according to whether each first type sample pair in the L first type samples corresponds to the same user.
Optionally, the first type of user data is user business behavior data, the second type of user data is questionnaire investigation data, the training sample set further comprises a plurality of second type of sample pairs, and the second type of sample pairs comprise first user data and third user data; the second data processing module is further used for processing third user data in the input second class sample pair;
the tag prediction apparatus 600 further includes:
the sample expansion module is used for carrying out sample expansion according to the first type of sample pairs, generating third user data corresponding to first user data in the first type of sample pairs, forming the second type of sample pairs by each third user data and the corresponding first user data, and determining true values of the second type of sample pairs;
And the training module is used for training the initial data fusion model by utilizing each first type sample and each second type sample in the training sample set.
Optionally, the training module includes:
the first processing unit is used for inputting each first type of sample pair in the training sample set into the initial data fusion model to obtain a first model predicted value output by the initial data fusion model, wherein the first model predicted value is used for representing the similarity of first user data and second user data in the input first type of sample pair;
the second processing unit is used for inputting each second type of sample pair in the training sample set into a shared data fusion model to obtain a second model predicted value output by the shared data fusion model, wherein the second model predicted value is used for representing the similarity of first user data and third user data in the input second type of sample pair, the shared data fusion model comprises two third cross-mode processing modules, and the third cross-mode processing modules and the second cross-mode processing modules have the same structure and share weights;
a second determining unit, configured to determine a first loss value based on the first model predicted value and a true value of the first sample pair;
A third determining unit configured to determine a second loss value based on the second model predicted value and a true value of the second sample pair;
a fourth determining unit configured to determine a weighted loss value according to the first loss value and the second loss value;
and the adjusting unit is used for adjusting the structural parameters of the initial data fusion model based on the weighted loss value.
Optionally, the sample expansion module includes:
a fifth determining unit, configured to respectively count feature averages of positive and negative sample pairs in the plurality of first sample pairs, and determine a difference set; the true value of the positive sample pair is a first value, the true value of the negative sample pair is a second value, the first value represents that the sample pair corresponds to the same user, and the second value represents that the sample pair does not correspond to the same user; the difference set comprises the difference value of the first user data in each positive class sample pair and the characteristic average value of the negative class sample pair, and the difference value of the first user data in each negative class sample pair and the characteristic average value of the positive class sample pair;
a sixth determining unit configured to extract third user data from a first gaussian distribution and a second gaussian distribution, respectively, construct a first pair of samples from the third user data extracted from the first gaussian distribution and the corresponding first user data, and determine a true value of the first pair of samples as the first value, and construct a second pair of samples from the third user data extracted from the second gaussian distribution and the corresponding first user data, and determine a true value of the second pair of samples as the second value; the second class of sample pairs comprises the first class of sample pairs and the second class of sample pairs, the first Gaussian distribution is obtained by constructing according to characteristic mean values and covariance of P first user data corresponding to P minimum difference values in the difference value set and first user data in the first class of sample pairs, the second Gaussian distribution is obtained by constructing according to characteristic mean values and covariance of Q first user data corresponding to Q maximum difference values in the difference value set and first user data in the first class of sample pairs, and P and Q are positive integers.
The label predicting device 600 provided in the embodiment of the present application may execute the method embodiment shown in fig. 1, and its implementation principle and technical effects are similar, and this embodiment will not be described herein again.
The label predicting device 600 of the embodiment of the application obtains user data to be predicted, wherein the user data to be predicted belongs to first-class user data; respectively forming a plurality of prediction sample pairs by the user data to be predicted and each second user data in the second type of user data acquired in advance; the method comprises the steps of respectively inputting a plurality of prediction sample pairs into a data fusion model to obtain the similarity of each prediction sample pair output by the data fusion model, wherein the data fusion model is obtained by training a pre-built initial data fusion model through each sample pair in a pre-obtained training sample set, the training sample set comprises a plurality of sample pairs, each sample pair comprises two different types of user data, and each sample pair is marked with a true value, and the true value is used for indicating whether the sample pairs correspond to the same user; and determining the label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data. In this way, the user data to be predicted and the other type of user data form the prediction sample pair, and the user label is predicted by utilizing the data fusion model fused with the characteristics of the multiple types of user data, so that the comprehensive attribute and behavior of the user can be fully mined, and the confidence of the result of predicting the user label is ensured to be higher.
The embodiment of the application also provides electronic equipment. Because the principle of solving the problem of the electronic device is similar to that of the label prediction method in the embodiment of the application, the implementation of the electronic device can be referred to the implementation of the method, and the repetition is omitted. As shown in fig. 7, the electronic device of the embodiment of the present application includes a processor 700 and a memory 720.
The processor 700 is configured to read the program in the memory 720, and execute the following procedures:
obtaining user data to be predicted, wherein the user data to be predicted belongs to first-class user data;
respectively forming a plurality of prediction sample pairs by the user data to be predicted and each second user data in the second type of user data acquired in advance;
the method comprises the steps of respectively inputting a plurality of prediction sample pairs into a data fusion model to obtain the similarity of each prediction sample pair output by the data fusion model, wherein the data fusion model is obtained by training a pre-built initial data fusion model through each sample pair in a pre-obtained training sample set, the training sample set comprises a plurality of sample pairs, each sample pair comprises two different types of user data, and each sample pair is marked with a true value, and the true value is used for indicating whether the sample pairs correspond to the same user;
And determining the label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data.
Wherein in fig. 7, a bus architecture may comprise any number of interconnected buses and bridges, and in particular one or more processors represented by processor 700 and various circuits of memory represented by memory 720, linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The processor 700 is responsible for managing the bus architecture and general processing, and the memory 720 may store data used by the processor 700 in performing operations.
Optionally, the processor 700 is further configured to read the program in the memory 720, and perform the following steps:
taking the similarity of each prediction sample pair as the weight of the label value of the second user data in the corresponding prediction sample pair, and weighting the label value of each second user data to obtain a weighted label value;
and determining the weighted tag value as the tag value of the user data to be predicted.
Optionally, the initial data fusion model includes a first data processing module, a second data processing module, a first cross-modal processing module, a second cross-modal processing module, and a feature fusion module;
the processor 700 is further configured to read the program in the memory 720, and perform the following steps:
performing feature extraction processing on one user data in the input sample pair through the first data processing module, and performing feature extraction processing on the other user data in the input sample pair through the second data processing module;
performing feature extraction and fusion processing on the output features of the first data processing module and the output features of the second data processing module based on an attention mechanism through the first cross-mode processing module;
performing feature extraction and fusion processing on the output features of the second data processing module and the output features of the first data processing module based on an attention mechanism through the second cross-mode processing module;
and fusing the output characteristics of the first cross-mode processing module and the output characteristics of the second cross-mode processing module through the characteristic fusion module, and outputting the similarity of two user data in the input sample pair.
Optionally, the first data processing module is a questionnaire encoder, and the questionnaire encoder comprises an encoding module, a vector normalization module, a pooling module and a splicing module; one user data in the input sample pair is questionnaire investigation data;
the processor 700 is further configured to read the program in the memory 720, and perform the following steps:
the questionnaire investigation data is encoded through the encoding module, the output vector of the encoding module is normalized through the vector normalization module, the output vector of the encoding module is pooled through the pooling module, and the output vector of the vector normalization module and the output vector of the pooling module are spliced through the splicing module.
Optionally, the first cross-modality processing module includes a first self-attention module, a first mutual-attention module, and a first feed-forward network; the processor 700 is further configured to read the program in the memory 720, and perform the following steps:
performing data internal feature extraction processing on the output features of the first data processing module through the first self-attention module, performing inter-data feature extraction processing on the output features of the first self-attention module and the output features of the second data processing module through the first mutual-attention module, and performing connection processing on the output features of the first mutual-attention module through the first feedforward network;
And/or the second cross-modality processing module includes a second self-attention module, a second mutual-attention module, and a second feed-forward network; the processor 700 is further configured to read the program in the memory 720, and perform the following steps:
and carrying out data internal feature extraction processing on the output features of the second data processing module through the second self-attention module, carrying out data inter-feature extraction processing on the output features of the second self-attention module and the output features of the first data processing module through the second mutual-attention module, and carrying out connection processing on the output features of the second mutual-attention module through the second feedforward network.
Optionally, the training sample set includes a plurality of first-type sample pairs, where the first-type sample pairs include a first user data and a second user data, the first user data belongs to the first-type user data, the second user data belongs to the second-type user data, and the first-type user data and the second-type user data are data from different sources; the first data processing module is used for processing second user data in the input first type sample pair, and the second data processing module is used for processing the first user data in the input first type sample pair;
The processor 700 is further configured to read the program in the memory 720, and perform the following steps:
obtaining M first user data and N second user data, wherein N and M are integers greater than 1;
determining L target first user data similar to target second user data according to first association features between the first user data and the second user data, wherein the target first user data are any one of the M first user data, the target second user data are second user data in the N second user data, and L is a positive integer;
respectively forming L first-class sample pairs by the target second user data and the L target first user data;
and determining the true value of each first-class sample pair in the L first-class samples according to whether each first-class sample pair in the L first-class sample pairs corresponds to the same user.
Optionally, the first type of user data is user business behavior data, the second type of user data is questionnaire investigation data, the training sample set further comprises a plurality of second type of sample pairs, and the second type of sample pairs comprise first user data and third user data; the second data processing module is further used for processing third user data in the input second class sample pair;
The processor 700 is further configured to read the program in the memory 720, and perform the following steps:
performing sample expansion according to the first type of sample pairs, generating third user data corresponding to first user data in the first type of sample pairs, forming the second type of sample pairs by each third user data and the corresponding first user data, and determining true values of the second type of sample pairs;
and training the initial data fusion model by utilizing each first type sample and each second type sample in the training sample set.
Optionally, the processor 700 is further configured to read the program in the memory 720, and perform the following steps:
inputting each first type sample pair in the training sample set into the initial data fusion model to obtain a first model predicted value output by the initial data fusion model, wherein the first model predicted value is used for representing the similarity of first user data and second user data in the input first type sample pair;
inputting each second sample pair in the training sample set into a shared data fusion model to obtain a second model predicted value output by the shared data fusion model, wherein the second model predicted value is used for representing the similarity of first user data and third user data in the input second sample pair, the shared data fusion model comprises two third cross-mode processing modules, and the third cross-mode processing modules and the second cross-mode processing modules have the same structure and share weights;
Determining a first penalty value based on the first model predictive value and a true value of the first class of sample pairs;
determining a second penalty value based on the second model predictive value and a true value of the second class of sample pairs;
determining a weighted loss value according to the first loss value and the second loss value;
and adjusting structural parameters of the initial data fusion model based on the weighted loss value.
Optionally, the processor 700 is further configured to read the program in the memory 720, and perform the following steps:
respectively counting characteristic average values of positive sample pairs and negative sample pairs in the plurality of first sample pairs, and determining a difference value set; the true value of the positive sample pair is a first value, the true value of the negative sample pair is a second value, the first value represents that the sample pair corresponds to the same user, and the second value represents that the sample pair does not correspond to the same user; the difference set comprises the difference value of the first user data in each positive class sample pair and the characteristic average value of the negative class sample pair, and the difference value of the first user data in each negative class sample pair and the characteristic average value of the positive class sample pair;
extracting third user data from a first gaussian distribution and a second gaussian distribution respectively, forming a first pair of samples from the third user data extracted from the first gaussian distribution and corresponding first user data, determining a true value of the first pair of samples as the first value, and forming a second pair of samples from the third user data extracted from the second gaussian distribution and corresponding first user data, determining a true value of the second pair of samples as the second value; the second class of sample pairs comprises the first class of sample pairs and the second class of sample pairs, the first Gaussian distribution is obtained by constructing according to characteristic mean values and covariance of P first user data corresponding to P minimum difference values in the difference value set and first user data in the first class of sample pairs, the second Gaussian distribution is obtained by constructing according to characteristic mean values and covariance of Q first user data corresponding to Q maximum difference values in the difference value set and first user data in the first class of sample pairs, and P and Q are positive integers.
The electronic device provided in the embodiment of the present application may execute the method embodiment shown in fig. 1, and its implementation principle and technical effects are similar, and this embodiment will not be described herein.
Furthermore, a computer readable storage medium of an embodiment of the present application is used for storing a computer program, where the computer program can be executed by a processor to implement the steps of the method embodiment shown in fig. 1.
In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may be physically included separately, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform part of the steps of the transceiving method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims (12)

1. A tag prediction method, comprising:
obtaining user data to be predicted, wherein the user data to be predicted belongs to first-class user data;
respectively forming a plurality of prediction sample pairs by the user data to be predicted and each second user data in the second type of user data acquired in advance;
The method comprises the steps of respectively inputting a plurality of prediction sample pairs into a data fusion model to obtain the similarity of each prediction sample pair output by the data fusion model, wherein the data fusion model is obtained by training a pre-built initial data fusion model through each sample pair in a pre-obtained training sample set, the training sample set comprises a plurality of sample pairs, each sample pair comprises two different types of user data, and each sample pair is marked with a true value, and the true value is used for indicating whether the sample pairs correspond to the same user;
and determining the label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data.
2. The method of claim 1, wherein the determining the label of the user data to be predicted based on the similarity of each of the predicted sample pairs and the label of each of the second user data comprises:
taking the similarity of each prediction sample pair as the weight of the label value of the second user data in the corresponding prediction sample pair, and weighting the label value of each second user data to obtain a weighted label value;
And determining the weighted tag value as the tag value of the user data to be predicted.
3. The method of claim 1, wherein the initial data fusion model comprises a first data processing module, a second data processing module, a first cross-modal processing module, a second cross-modal processing module, and a feature fusion module;
the processing the plurality of prediction sample pairs to be respectively input into a data fusion model to obtain the similarity of each prediction sample pair output by the data fusion model comprises the following steps:
performing feature extraction processing on one user data in the input sample pair through the first data processing module, and performing feature extraction processing on the other user data in the input sample pair through the second data processing module;
performing feature extraction and fusion processing on the output features of the first data processing module and the output features of the second data processing module based on an attention mechanism through the first cross-mode processing module;
performing feature extraction and fusion processing on the output features of the second data processing module and the output features of the first data processing module based on an attention mechanism through the second cross-mode processing module;
And fusing the output characteristics of the first cross-mode processing module and the output characteristics of the second cross-mode processing module through the characteristic fusion module, and outputting the similarity of two user data in the input sample pair.
4. The method of claim 3, wherein the first data processing module is a questionnaire encoder comprising an encoding module, a vector normalization module, a pooling module, and a stitching module; one user data in the input sample pair is questionnaire investigation data;
the feature extraction processing is performed on the user data in the input sample pair by the first data processing module, including:
the questionnaire investigation data is encoded through the encoding module, the output vector of the encoding module is normalized through the vector normalization module, the output vector of the encoding module is pooled through the pooling module, and the output vector of the vector normalization module and the output vector of the pooling module are spliced through the splicing module.
5. The method of claim 3, wherein the first cross-modality processing module comprises a first self-attention module, a first mutual-attention module, and a first feed-forward network; the feature extraction and fusion processing of the output features of the first data processing module and the output features of the second data processing module based on an attention mechanism by the first cross-modal processing module comprises:
Performing data internal feature extraction processing on the output features of the first data processing module through the first self-attention module, performing inter-data feature extraction processing on the output features of the first self-attention module and the output features of the second data processing module through the first mutual-attention module, and performing connection processing on the output features of the first mutual-attention module through the first feedforward network;
and/or the second cross-modality processing module includes a second self-attention module, a second mutual-attention module, and a second feed-forward network; the performing, by the second cross-modal processing module, feature extraction and fusion processing on the output features of the second data processing module and the output features of the first data processing module based on an attention mechanism includes:
and carrying out data internal feature extraction processing on the output features of the second data processing module through the second self-attention module, carrying out data inter-feature extraction processing on the output features of the second self-attention module and the output features of the first data processing module through the second mutual-attention module, and carrying out connection processing on the output features of the second mutual-attention module through the second feedforward network.
6. The method according to any one of claims 3 to 5, wherein the training sample set comprises a plurality of first type sample pairs, the first type sample pairs comprising one first type of user data and one second type of user data, the first type of user data belonging to a first type of user data, the second type of user data belonging to a second type of user data, the first type of user data and the second type of user data being data of different sources; the first data processing module is used for processing second user data in the input first type sample pair, and the second data processing module is used for processing the first user data in the input first type sample pair;
before the plurality of prediction samples are processed by the respective input data fusion model, the method further comprises:
obtaining M first user data and N second user data, wherein N and M are integers greater than 1;
determining L target first user data similar to target second user data according to first association features between the first user data and the second user data, wherein the target first user data are any one of the M first user data, the target second user data are second user data in the N second user data, and L is a positive integer;
Respectively forming L first-class sample pairs by the target second user data and the L target first user data;
and determining the true value of each first-class sample pair in the L first-class samples according to whether each first-class sample pair in the L first-class sample pairs corresponds to the same user.
7. The method of claim 6, wherein the first type of user data is user business behavior data and the second type of user data is questionnaire data, the training sample set further comprising a plurality of second type of sample pairs, the second type of sample pairs comprising a first user data and a third user data; the second data processing module is further used for processing third user data in the input second class sample pair;
before the plurality of prediction samples are processed by the respective input data fusion model, the method further comprises:
performing sample expansion according to the first type of sample pairs, generating third user data corresponding to first user data in the first type of sample pairs, forming the second type of sample pairs by each third user data and the corresponding first user data, and determining true values of the second type of sample pairs;
And training the initial data fusion model by utilizing each first type sample and each second type sample in the training sample set.
8. The method of claim 7, wherein training the initial data fusion model using each first type of sample and each second type of sample in the training sample set comprises:
inputting each first type sample pair in the training sample set into the initial data fusion model to obtain a first model predicted value output by the initial data fusion model, wherein the first model predicted value is used for representing the similarity of first user data and second user data in the input first type sample pair;
inputting each second sample pair in the training sample set into a shared data fusion model to obtain a second model predicted value output by the shared data fusion model, wherein the second model predicted value is used for representing the similarity of first user data and third user data in the input second sample pair, the shared data fusion model comprises two third cross-mode processing modules, and the third cross-mode processing modules and the second cross-mode processing modules have the same structure and share weights;
Determining a first penalty value based on the first model predictive value and a true value of the first class of sample pairs;
determining a second penalty value based on the second model predictive value and a true value of the second class of sample pairs;
determining a weighted loss value according to the first loss value and the second loss value;
and adjusting structural parameters of the initial data fusion model based on the weighted loss value.
9. The method of claim 7, wherein the performing sample expansion based on the first type of sample pairs, generating third user data corresponding to first user data in the first type of sample pairs, and constructing each of the third user data and the corresponding first user data into the second type of sample pairs, determining a true value for the second type of sample pairs, comprises:
respectively counting characteristic average values of positive sample pairs and negative sample pairs in the plurality of first sample pairs, and determining a difference value set; the true value of the positive sample pair is a first value, the true value of the negative sample pair is a second value, the first value represents that the sample pair corresponds to the same user, and the second value represents that the sample pair does not correspond to the same user; the difference set comprises the difference value of the first user data in each positive class sample pair and the characteristic average value of the negative class sample pair, and the difference value of the first user data in each negative class sample pair and the characteristic average value of the positive class sample pair;
Extracting third user data from a first gaussian distribution and a second gaussian distribution respectively, forming a first pair of samples from the third user data extracted from the first gaussian distribution and corresponding first user data, determining a true value of the first pair of samples as the first value, and forming a second pair of samples from the third user data extracted from the second gaussian distribution and corresponding first user data, determining a true value of the second pair of samples as the second value; the second class of sample pairs comprises the first class of sample pairs and the second class of sample pairs, the first Gaussian distribution is obtained by constructing according to characteristic mean values and covariance of P first user data corresponding to P minimum difference values in the difference value set and first user data in the first class of sample pairs, the second Gaussian distribution is obtained by constructing according to characteristic mean values and covariance of Q first user data corresponding to Q maximum difference values in the difference value set and first user data in the first class of sample pairs, and P and Q are positive integers.
10. A tag predicting apparatus, comprising:
the first acquisition module is used for acquiring user data to be predicted, wherein the user data to be predicted belongs to first-class user data;
The first processing module is used for respectively forming a plurality of prediction sample pairs by the user data to be predicted and each second user data in the second type of user data acquired in advance;
the model prediction module is used for respectively inputting the plurality of prediction sample pairs into the data fusion model to obtain the similarity of each prediction sample pair output by the data fusion model, wherein the data fusion model is obtained by training a pre-built initial data fusion model by utilizing each sample in a pre-acquired training sample set, the training sample set comprises a plurality of sample pairs, each sample pair comprises two different types of user data, each sample pair is marked with a true value, and the true value is used for indicating whether the sample pairs correspond to the same user;
and the first determining module is used for determining the label of the user data to be predicted based on the similarity of each prediction sample pair and the label of each second user data.
11. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor; -characterized in that the processor is arranged to read a program in a memory for implementing the steps in the tag prediction method according to any of claims 1 to 9.
12. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps in the tag prediction method according to any one of claims 1 to 9.
CN202211585181.XA 2022-12-09 2022-12-09 Label prediction method and device and electronic equipment Pending CN116910341A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211585181.XA CN116910341A (en) 2022-12-09 2022-12-09 Label prediction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211585181.XA CN116910341A (en) 2022-12-09 2022-12-09 Label prediction method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN116910341A true CN116910341A (en) 2023-10-20

Family

ID=88353635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211585181.XA Pending CN116910341A (en) 2022-12-09 2022-12-09 Label prediction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN116910341A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118013201A (en) * 2024-03-07 2024-05-10 暨南大学 Flow anomaly detection method and system based on improved BERT fusion contrast learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118013201A (en) * 2024-03-07 2024-05-10 暨南大学 Flow anomaly detection method and system based on improved BERT fusion contrast learning

Similar Documents

Publication Publication Date Title
CN113822494B (en) Risk prediction method, device, equipment and storage medium
US12033083B2 (en) System and method for machine learning architecture for partially-observed multimodal data
EP3985578A1 (en) Method and system for automatically training machine learning model
CN110221965B (en) Test case generation method, test case generation device, test case testing method, test case testing device, test equipment and test system
CN111582409B (en) Training method of image tag classification network, image tag classification method and device
CN111695415A (en) Construction method and identification method of image identification model and related equipment
CN110598070B (en) Application type identification method and device, server and storage medium
CN111581966A (en) Context feature fusion aspect level emotion classification method and device
Esquivel et al. Spatio-temporal prediction of Baltimore crime events using CLSTM neural networks
CN113761250A (en) Model training method, merchant classification method and device
CN113822315A (en) Attribute graph processing method and device, electronic equipment and readable storage medium
CN113779225B (en) Training method of entity link model, entity link method and device
Mameli et al. Social media analytics system for action inspection on social networks
CN116910341A (en) Label prediction method and device and electronic equipment
CN114445121A (en) Advertisement click rate prediction model construction and advertisement click rate prediction method
CN115114462A (en) Model training method and device, multimedia recommendation method and device and storage medium
CN115129863A (en) Intention recognition method, device, equipment, storage medium and computer program product
CN114429178A (en) Method and device for generating remarkable label and storage medium
CN114612246A (en) Object set identification method and device, computer equipment and storage medium
CN115496175A (en) Newly-built edge node access evaluation method and device, terminal equipment and product
CN113010647A (en) Corpus processing model training method and device, storage medium and electronic equipment
CN116258574B (en) Mixed effect logistic regression-based default rate prediction method and system
CN116595978B (en) Object category identification method, device, storage medium and computer equipment
Yang et al. Predicting Passenger’s Public Transportation Travel Route Using Smart Card Data
CN113535847A (en) Method and device for classifying block chain addresses

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination