CN114090770B

CN114090770B - Multi-stage unsupervised domain adaptive causal relationship identification method

Info

Publication number: CN114090770B
Application number: CN202111216820.0A
Authority: CN
Inventors: 李建军; 周云帆; 俞杰; 陆奇; 李胜炎; 李新付; 田万勇; 赵露露; 惠国宝; 唐政
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University; CETC 20 Research Institute
Priority date: 2021-10-19
Filing date: 2021-10-19
Publication date: 2022-11-04
Anticipated expiration: 2041-10-19
Also published as: CN114090770A

Abstract

The invention discloses a multi-stage unsupervised domain adaptive causal relationship identification method. The method comprises the following steps: (1) data set partitioning; (2) pre-training by using self-adaptive contrast learning; (3) performing antagonistic learning in combination with knowledge distillation; (4) filtering the multi-stage data to obtain a seed set; (5) single-stage data filtering to obtain a pseudo label set; (6) acquiring a subclass prototype by using a k-means clustering method; (7) enhancing introduction consistency by adopting data of a characteristic level; and (8) carrying out self-training by using the pseudo labels obtained by filtering. According to the method, rich causal relation knowledge in a source domain is obtained through self-adaptive comparison learning, then the source domain knowledge is transferred to a target domain in a knowledge distillation and antagonistic learning mode, multi-level subclass prototypes and pseudo labels of unlabeled samples are obtained through data filtering, characteristic-level data enhancement is carried out through the prototypes so as to introduce consistency loss, and self-training is carried out on the target domain through the pseudo labels.

Description

Multi-stage unsupervised domain adaptive causal relationship identification method

Technical Field

The invention belongs to the field of natural language processing, and relates to an unsupervised domain adaptive causal relationship identification method for self-adaptive contrast learning, anti-migration learning combined with knowledge distillation, and a self-training strategy based on feature level data enhancement and consistency loss in natural language.

Background

Causal relationship identification is the basis for reasoning and decision making. The characteristic projection of causal relationship in natural language enables the machine to better understand the surrounding environment, and provides key clues for downstream tasks such as logical reasoning, question answering system and the like. Because the traditional causal relationship identification method lacks background knowledge, is difficult to understand context and lacks logical reasoning capability, the traditional method can not meet the requirements of people on intelligent reasoning and decision making as the human society enters an information explosion era. The identification of the causal relationship requires that a model has background knowledge and the capability of understanding the context, the existing supervised causal relationship identification method depends heavily on the quantity and quality of annotated training data, however, the actual situation is that the unlabelled data is much more than the labeled data, especially the text data in some special fields, and the labeling difficulty and the data scale are large. Therefore, people are paying attention to how to improve the recognition capability and the generalization capability under the condition of limited data. The unsupervised domain adaptive causal relationship identification task aims to obtain better identification effect on target domain data by learning labeled source domain data and transferring the knowledge in the labeled source domain data to an unlabeled target domain. Currently, there are few relevant studies on the problem of unsupervised domain adaptation of causal relationships, and existing methods suffer from the following drawbacks: (a) A general unsupervised domain adaptation method for drawing the distance between the source domain feature and the target domain feature can generate a forgetting disaster and a fuzzy classification boundary, namely, a model can lose the classification capability of the task in the migration process; (b) The existing method does not have the capability of learning the target domain knowledge, the improvement of the model capability is greatly limited, and the identification effect is greatly reduced when the source domain knowledge and the data scale cannot meet the requirements; (c) The existing pseudo label method easily introduces noise, thereby greatly influencing the identification capability of the model.

Research shows that the causal relationship has rich diversity, and the specific expression is that the causal relationship can be subdivided into more subclasses of prototypes in a feature space, and the identification capability and the generalization capability of the model can be effectively improved by utilizing the diversity. From the definition of causality, causality can be further subdivided, e.g. causality at the time level, conditional causality, etc. Thus, these subclass prototypes are of physical significance. The fact is utilized to enhance the data of the characteristic level, and a consistency and self-training method is further introduced, so that the model can actively learn the knowledge of the target domain and further clarify the classification boundary.

Disclosure of Invention

According to the defect description of the existing method, the invention provides a multi-stage fusion unsupervised domain adaptive causal relationship identification method; the method comprises a pre-training stage, a countermeasure stage and a consistency adjusting stage, wherein the pre-training stage learns the knowledge of source domain data, the countermeasure stage transfers the source domain knowledge to a target domain, and the consistency adjusting stage actively learns the specific knowledge in the target domain data by combining methods of data filtering, data enhancement, consistency and the like.

In order to realize the purpose, the invention is realized by the following technical scheme:

step 1, data set division: carrying out three-time random division on source domain data, and then obtaining three groups of target domain models through a pre-training stage and a countermeasure stage by utilizing three groups of obtained source domain data sets;

step 2, self-adaptive comparison learning: the purpose of the adaptive contrast learning is to acquire enough class distance in the pre-training stage and keep proper class distance to keep the diversity of the class distance;

and 3, combining the antagonistic learning of knowledge distillation: knowledge distillation is added on the basis of general migration learning based on confrontation, and the classification capability of a model on source domain data is reserved, so that a forgetting disaster is avoided;

step 4, multi-level data filtering: using the three groups of target domain models obtained in the three steps to obtain a group of relatively clean target domain seed sets through pseudo label distribution, a voting mechanism, confidence coefficient screening and uncertainty screening;

step 5, single-stage data filtering: after each round of training is finished in a consistency adjusting stage, screening out samples with confidence coefficients higher than a confidence coefficient threshold value outside a seed set by using the confidence coefficient threshold value, and distributing pseudo labels;

step 6, obtaining a subclass prototype: for a data set obtained by multi-level data filtering and single-level data filtering, subclass prototypes of causal relations and non-causal relations are obtained by using a k-means clustering method;

and 7, enhancing feature level data and losing consistency: performing feature level data enhancement on the features of the input sample by using the obtained prototype, wherein the causal categories of the new feature vector and the original feature vector are consistent;

step 8, self-training: and (5) training the model by using the pseudo labels obtained in the step (4) and the step (5) and the weight obtained by mapping the confidence coefficient of the pseudo labels until the model converges.

Further, each source domain data set obtained by the data set division in step 1 is composed of 60% of training sets, 20% of testing sets and 20% of verification sets, and sufficient randomness is required to be ensured in the division process of the three source domain data sets.

Further, the step 2 is specifically realized as follows:

2-1, dividing the input natural language text into tokens, and projecting the text into 768-dimensional features by using a bert encoder model;

2-2, storing the generated feature vectors, clustering to obtain class centers of causal relation and non-causal relation, and calculating the average distance from all samples to the class centers;

and 2-3, taking the average distance as a hyper-training model of the contrast loss until the model converges. Further, the specific steps of the antagonistic learning combined with knowledge distillation in the step 3 are as follows:

3-1, compiling the source domain data and the target domain data into feature vectors respectively by using the model trained in 2-3;

3-2, identifying whether the characteristic vector obtained by the identification step 3-1 is from the target domain by using an identifier, and calculating the antagonistic loss;

3-3, calculating the similarity of the feature vector of the target domain and the feature of the source domain as distillation loss;

3-3. Training the model using a challenge loss of 3-2 and a distillation loss of 3-3 until the model converges.

Further, the specific steps of the multistage data filtering in the step 4 are as follows:

4-1, distributing pseudo labels to a target domain data set by three groups of models obtained by training different source domain data sets, and screening the sample through a voting mechanism only when the pseudo labels distributed by the three models are the same;

4-2, obtaining the confidence coefficient of each sample through the output of the Softmax layer, and screening out 20% of samples with the highest confidence coefficient for the next screening;

4-3, under the condition that a dropout layer is started, the model calculates the probability that the same sample is causal relation for multiple times, stores all output results, calculates the uncertainty of the output results through mean square error, and screens out the sample with the uncertainty smaller than 0.01 to form a seed set.

Further, the feature level data enhancement and consistency loss in step 7 are specifically implemented as follows:

7-1, screening out a prototype which is closest to the input characteristic Euclidean distance from the subclass prototype obtained in the step 6;

7-2, calculating the similarity of the prototype and the input features;

7-3, using the similarity obtained in the step 7-2 as weight, superposing the prototype on the input feature, and finally obtaining an enhanced feature vector through mapping of a full connection layer;

and 7-4, calculating corresponding consistency loss according to the consistency of the enhanced feature vector and the input feature vector.

Further, the self-training in step 8 is implemented as follows:

8-1, obtaining the confidence coefficient of the current model to the current sample from the Softmax layer;

8-2, taking pseudo labels obtained by multi-level data filtering and single-level data filtering as targets, and taking the confidence coefficient obtained in the step 8-1 as weight to calculate cross entropy loss;

8-3. Training the model using cross entropy loss and consistency loss of 7-4 until the model converges.

Through tests, the invention has the beneficial effects that:

1. the class spacing on the source domain and the target domain can be effectively increased by using the self-adaptive contrast learning, the clearer classification boundary can be obtained, and meanwhile, the diversity of texts can be effectively reserved.

2. The counterstudy combined with knowledge distillation can effectively avoid forgetting disasters, thereby reducing the damage of the counterprocess to classification boundaries.

3. By utilizing the fact that a plurality of subclasses exist in causal relation and non-causal relation, data enhancement at characteristic level is carried out, diversity of natural language is fully utilized, and consistency is introduced. Consistency can further clarify classification boundaries by filling in blank portions of the feature space.

4. By actively learning the knowledge of the target domain through the self-training strategy based on the pseudo label, the noise can be greatly reduced by combining a single-stage filtering mode and a multi-stage filtering mode.

Drawings

FIG. 1 is a flow chart of an unsupervised domain adaptive causal relationship identification method of the present invention.

FIG. 2 is a flow diagram of the multi-stage data filtering of the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the drawings.

The method is used for identifying the causal relationship of unsupervised domain adaptation. FIG. 1 is a flow chart of an unsupervised domain adaptive causal relationship identification method of the present invention. Referring to fig. 1, the unsupervised domain adaptive causal relationship identification method provided by the present invention specifically includes:

step 1: the source domain data set is partitioned. Three random divisions are carried out on the source domain data set to obtain three groups of source domain data sets, and the original data set is divided into a training set, a testing set and a verification set by using the probability of 60. Repeating the following steps 2 and 3 on the three sets of source domain data sets, three sets of models of the target domain can be obtained.

Step 2: pre-training based on comparative learning is performed. The method comprises the steps of training a source domain model by using a source domain data set, encouraging samples in different classes to have enough characteristic distance to obtain clearer classification boundaries in the training process, and encouraging the samples in the same class to keep proper characteristic distance to keep diversity of natural language texts, wherein the inner distance is the average distance from the sample in the class to the clustering center of the sample in the class. The pre-training based on the self-adaptive contrast learning comprises the steps of firstly dividing an input natural language text into tokens, and projecting the text into 768-dimensional characteristic vectors by utilizing a bert encoder model in a source domain model; storing the generated characteristic vectors, clustering to obtain class centers of causal relation and non-causal relation, and calculating the average distance from all samples to the class centers; and finally, calculating the contrast loss by taking the average distance as the hyper-parameter of the contrast loss, calculating the supervised cross entropy loss by taking the source domain label as the target, and training the source domain model by using the contrast loss and the cross entropy loss until the source domain model converges.

And step 3: antagonistic learning is performed in conjunction with knowledge distillation. And (3) using the model obtained in the step (2) as a feature extraction network and a judgment network to carry out countermeasure, judging whether the feature vector input by the network is from the target domain or not, and calculating the countermeasure loss. And calculating the similarity of the feature vector of the target domain and the feature of the source domain as distillation loss. In order to prevent the model from losing the capability of identifying causal relationships in the countermeasure process and reduce the influence of the countermeasure on the classification boundary of the model, the knowledge distillation loss is increased on the basis of the countermeasure learning, and the classification capability of the model on the source domain data is reserved in a knowledge distillation manner. Training the model using the alignment loss and the distillation loss until the model converges;

and 4, step 4: and performing multi-stage data filtering. FIG. 2 is a flow chart of the multi-stage data filtering of the present invention. As shown in fig. 2, the multi-stage data filtering performs pseudo label allocation, voting mechanism screening, confidence screening, and uncertainty screening in sequence, so as to obtain a set of relatively clean target domain seed sets. Specifically, the three sets of target domain models obtained in steps 1 to 3 respectively allocate pseudo labels to target domain data, and only when the three models allocate the same pseudo label to the same sample, the sample is screened through a voting mechanism. Then, the 20% sample with the highest confidence coefficient is selected from the selected samples for uncertainty screening. The uncertainty screening is to start a dropout layer in the testing process, repeat the testing for 10 times, calculate the mean square deviation of 10 times of output, remove the sample with the uncertainty higher than 0.01 if the mean square deviation is larger, and obtain the final seed set, wherein each seed set corresponds to a fixed pseudo label;

and 5: single-stage data filtering is performed. The pseudo label is assigned by using a confidence threshold value for other target domain data not included in the seed set, and the pseudo label is different from the fixed pseudo label of the seed set, and needs to be reassigned in each training. To prevent the pseudo-tags from introducing too much noise, additional confidence weights need to be introduced in the training process.

And 6: and acquiring a subclass prototype. And (4) obtaining subclass prototypes of causal relations and non-causal relations from the seed set obtained in the step (4) and the pseudo label data set obtained in the step (5) by using a k-means clustering method, wherein the subclass prototypes of the pseudo label data set need to be continuously updated along with the update of the pseudo label data set.

And 7: feature level data enhancement is performed. And performing feature-level data enhancement on the features of the input sample by using the obtained subclass prototypes, namely calculating the similarity between the input features and the subclass prototypes with the nearest feature distance by using a Softmax mode, and weighting the centers of the subclass prototypes to the input features by using the similarity. The input features and the original input features subjected to data enhancement should have the same causal category, so that consistency introduces consistency loss to further train the model.

And step 8: self-training is performed using consistency and pseudo-labels. And (5) self-training the target domain data by using the target domain pseudo labels obtained in the step (4) and the step (5) and the additional confidence degree weight. The self-training comprises the following steps: obtaining additional confidence weights for the current model for the current sample from the Softmax layer; taking pseudo labels obtained by multi-level data filtering and single-level data filtering as targets, and taking the obtained confidence coefficient as extra confidence coefficient weight to calculate cross entropy loss; the model is trained using cross-entropy loss and the aforementioned consistency loss until the model converges.

Results of experiments show that f1 of the unsupervised domain adaptation method is averagely improved by 5.5% on a sentence-level causal relationship identification task, and f1 of the unsupervised domain adaptation method is averagely improved by 9.5% on an event-level causal relationship identification task. f1 is a harmonic mean of precision and recall, and is used for representing the identification capability of causal relationships in the invention.

Claims

1. A multi-stage unsupervised domain adaptive causal relationship identification method, comprising the steps of:

step 1, data set division: carrying out three-time random division on the source domain data, and then obtaining three groups of target domain models by using the obtained three groups of source domain data sets through the step 2 and the step 3;

step 2, self-adaptive comparison learning: the purpose of the adaptive contrast learning is to obtain enough class spacing and maintain proper intra-class spacing to preserve the diversity thereof;

2-1, dividing the input natural language text into tokens, and projecting the text into 768-dimensional characteristics by using a bert encoder model;

2-3, taking the average distance as a hyper-parametric training model of the contrast loss until the model converges;

and 3, combining the counterstudy of knowledge distillation:

3-3, calculating the similarity of the characteristic vector of the target domain and the source domain characteristic as distillation loss;

3-4, training the model by using the antagonistic loss of 3-2 and the distillation loss of 3-3 until the model converges;

and 4, multi-stage data filtering: using the three groups of target domain models obtained in the three steps to obtain a group of relatively clean target domain seed sets through pseudo label distribution, a voting mechanism, confidence coefficient screening and uncertainty screening;

step 5, single-stage data filtering: screening out samples with confidence degrees higher than the confidence degree threshold value outside the seed set by using the confidence degree threshold value, and distributing pseudo labels;

step 6, obtaining a subclass prototype: for a data set obtained by multi-level data filtering and single-level data filtering, subclass prototypes of causal relation and non-causal relation are obtained by using a k-means clustering method;

2. A multi-stage unsupervised domain adaptive causal relationship identification method according to claim 1, wherein each source domain data set obtained by partitioning the data set in step 1 is composed of 60% of training set, 20% of testing set and 20% of verification set, and the three source domain data sets are guaranteed sufficient randomness during the partitioning process.

3. The method according to claim 2, wherein the multi-stage unsupervised domain adaptive causal relationship identification in step 4 comprises the following specific steps:

4-1, distributing pseudo labels to the target domain data set by three groups of models obtained by training different source domain data sets, and screening the sample through a voting mechanism when the pseudo labels distributed by the three models are the same;

4-3, under the condition that a dropout layer is started, the probability that the same sample is causal is calculated by the model for multiple times, all output results are stored, the uncertainty of the output results is calculated through mean square error, and a seed set consisting of samples with the uncertainty smaller than 0.01 is screened out.

4. A multi-stage unsupervised domain adaptive causal relationship identification method according to claim 3, wherein the feature level data enhancement and consistency loss steps in step 7 are:

7-1, screening out the prototype which is closest to the European type of the input feature from the subclass prototype obtained in the step 6;

7-2, calculating the similarity of the prototype and the input features;

7-3, overlapping the prototype to the input feature by taking the similarity obtained by 7-2 as weight, and finally obtaining an enhanced feature vector through mapping of a full connection layer;

5. A multi-stage unsupervised domain adaptive causal relationship identification method according to claim 4, wherein the self-training in step 8 comprises the specific steps of: