CN113377884A

CN113377884A - Event corpus purification method based on multi-agent reinforcement learning

Info

Publication number: CN113377884A
Application number: CN202110773927.9A
Authority: CN
Inventors: 后敬甲; 王悦; 白璐; 崔丽欣
Original assignee: Central university of finance and economics
Current assignee: Central university of finance and economics
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2021-09-10
Anticipated expiration: 2041-07-08
Also published as: CN113377884B

Abstract

The invention relates to a multi-agent reinforcement learning-based event corpus purification method, which comprises the steps of initializing and resetting an environment and an agent before model training begins, and setting corresponding training parameters; the intelligent agent forms a series of data required by training by executing corresponding purification optimization actions in the environment, samples the data and stores the data in a data cache region for subsequent training; when the data quantity in the data cache region reaches a set value, training and updating the real networks of all the agents by using the data; after the real network is updated, updating the target networks of all the intelligent agents by a method of untimed parameter replication; and repeating the steps until the training times reach the preset training times. The method is based on the purification optimization of the labeled data, so that the problem of data label noise in the training process of the sequence labeling combined extraction model is solved, and the effect of the event entity relationship combined extraction task is improved.

Description

Event corpus purification method based on multi-agent reinforcement learning

Technical Field

The invention relates to the field of multi-agent reinforcement learning methods, in particular to an event corpus purification method based on multi-agent reinforcement learning.

Background

Reinforcement Learning (MARL) is a machine learning method, and can be divided into single-agent reinforcement learning and multi-agent reinforcement learning according to different numbers of agents, wherein the multi-agent reinforcement learning has wider application scenarios, and is a key tool for solving many real-world problems. In the multi-agent reinforcement learning, according to the difference of the task relationship of the agents, the method can be divided into the following steps: fully cooperative tasks, fully competing tasks, and hybrid tasks, where we consider only fully cooperative tasks.

In the multi-agent reinforcement learning training under the complete cooperative task, the agents take the maximization of the joint reward as a target, select actions according to own strategies, execute the actions to obtain corresponding rewards and feedback in the environment to update the own strategies, and circularly execute the steps until the joint reward value converges to the maximum value, and each agent achieves the optimal strategy under the current environment.

At present, the MADDPG (Multi-Agent Deep Deterministic Policy Gradient) algorithm is one of leading-edge reinforcement learning methods in a Multi-Agent environment, solves the problem that the traditional value-based algorithm (such as DQN) is difficult to apply in a continuous environment, simultaneously improves the training efficiency of the traditional strategy-based algorithm (DPG) by adding a Deep learning method, and further improves the training effect by introducing an experience playback pool and a training mechanism of 'centralized training and distributed execution'.

However, maddppg still suffers from poor exploratory and sub-optimal properties for the joint solution space, namely: in a multi-agent reinforcement learning environment, along with the increase of the number of agents, the size of the joint strategy space is exponentially increased, so that the exploration completion degree of the agent to the strategy space is reduced in the training process, the training result tends to converge to a global suboptimal solution, and a better training effect cannot be achieved.

Entity-related extraction refers to the simultaneous detection of entity references from unstructured text and identification of semantic relationships between them. Traditional entity relationship extraction methods handle this task in a serial fashion, i.e., extracting entities first, and then identifying their relationships. The serial processing mode is simple, the two subtasks are independent and flexible, each subtask corresponds to one sub model, but the correlation between the two subtasks is also ignored.

The entity relationship joint extraction combines entity identification and relationship extraction by using a single model, can effectively integrate entity information and relationship information, achieves better effect compared with a serial entity relationship extraction method, but the entity and relationship are required to be extracted respectively in essence, so that the model can generate additional redundant information.

In order to solve the problem that an entity-relationship combined extraction model generates additional redundant information, a research proposes that a combined extraction task is converted into a label problem, and entities and relationships thereof are directly extracted by using a sequence labeling model through establishing labels with relationship information, without independently identifying the entities and the relationships.

The sequence labeling joint extraction model is an efficient event joint extraction model, but a large amount of high-quality labeling data is needed in the training process, and the automatic labeling of the data can be effectively realized by a remote supervision method, but the remote supervision method assumes that: if two entities have a relationship in a given corpus, all sentences containing both entities will refer to the relationship. This results in a large set of labeled data from the remote surveillance method, which is problematic in terms of label noise that can adversely affect the joint extraction model.

Disclosure of Invention

Aiming at the technical problems, the invention provides an event corpus purification method based on multi-agent reinforcement learning.

In order to solve the problems in the prior art, the invention provides an event corpus purification method based on multi-agent reinforcement learning, which comprises the following steps:

before the model training begins, the environment and the intelligent agent need to be initialized and reset, and corresponding training parameters are set;

the intelligent agent forms a series of data required by training by executing corresponding purification optimization actions in the environment, samples the data and stores the data in a data cache region for subsequent training;

when the data quantity in the data cache region reaches a set value, training and updating the real networks of all the agents by using the data;

after the real network is updated, updating the target networks of all the intelligent agents by a method of untimed parameter replication;

and repeating the steps until the training times reach the preset training times.

Preferably, before the model training is started, the environment and the agent need to be initialized and reset, and setting the corresponding training parameters specifically includes: and performing data preprocessing on the event corpus, and inputting the corpus as an environment parameter of the multi-agent reinforcement learning model.

Preferably, the agent forms a series of data required for training by performing corresponding refining optimization actions in the environment, samples the data and stores the data in a data buffer, so as to be used for subsequent training, specifically including:

the multi-agent reinforcement learning model generates an action set of the agent group according to the input environment parameters;

the intelligent agent group executes the action set, selects corresponding event knowledge from the corpus and forms an event knowledge set;

mapping the event knowledge set into a word vector, and inputting the word vector into a sequence labeling joint model;

and the sequence labeling combined model labels the input word vectors and compares the input word vectors with the test set so as to verify the event purification effect of the current multi-agent reinforced model and output an evaluation index.

Preferably, when the number of data in the data buffer reaches a set value, starting to train and update the real networks of all the agents using the data specifically includes:

and converting the evaluation index into a reward value according to a preset reward function, and feeding the reward value back to the training of the multi-agent reinforcement learning model so as to optimize the model.

Preferably, after the updating of the real network is completed, the updating of the target networks of all the agents by the sporadic parameter replication method further includes:

and extracting network parameters of each layer of each agent as parameter vectors, subtracting the parameter vectors of each layer one by one to obtain pairwise parameter vector differences among the multiple agents, multiplying the parameter vector differences by differentiation factors, and feeding back the multiplied parameter vector differences to the updated agents to finish the final updating of the agents.

Compared with the prior art, the event corpus purification method based on multi-agent reinforcement learning has the following beneficial effects:

1. based on the multi-agent reinforcement learning environment, the research on the exploration degree of the joint strategy space is improved, the training effect of the multi-agent reinforcement learning algorithm is improved, and a multi-agent reinforcement learning model is optimized;

2. the invention extracts the parameters in each intelligent agent sub-network, forms the parameter vector to represent the strategy of the intelligent agent, reduces the repeatability of strategy exploration among the intelligent agents and promotes the exploration degree of a joint strategy solution space by maximizing the difference among the parameter vectors;

3. the invention is based on the optimized multi-agent reinforcement learning model, and the labeled data is purified and optimized, so that the problem of data label noise in the training process of the sequence labeling combined extraction model is solved, and the effect of the event entity relationship combined extraction task is improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart of an event corpus purification method based on multi-agent reinforcement learning according to an embodiment of the present invention.

Fig. 2 is a training flowchart of an event corpus purification method based on multi-agent reinforcement learning according to an embodiment of the present invention.

Fig. 3 is a flowchart of a data sampling portion of a multi-agent reinforcement learning-based event corpus purification method according to an embodiment of the present invention.

Fig. 4 is a network updating flow chart of the multi-agent reinforcement learning-based event corpus purification method according to the embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a sequence annotation model of an event corpus purification method based on multi-agent reinforcement learning according to an embodiment of the present invention.

FIG. 6 is another flowchart of an event corpus refining method based on multi-agent reinforcement learning according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As shown in FIG. 1, the present invention provides a method for purifying an event corpus based on multi-agent reinforcement learning, which comprises

S1, before model training begins, the environment and the intelligent agent need to be initialized and reset, and corresponding training parameters are set;

s2, the agent forms a series of data needed by training by executing corresponding purification optimization actions in the environment, samples the data and stores the data in a data buffer area for subsequent training;

s3, when the number of data in the data buffer area reaches a set value, training and updating the real network of all the agents by using the data;

s4, after the real network is updated, updating the target networks of all the intelligent agents by a method of irregular parameter replication;

and S5, repeating the steps until the training times reach the preset training times.

The invention provides an event corpus purification method based on multi-agent reinforcement learning, which mainly comprises the following steps: the method comprises four parts of training environment and parameter initialization, data sampling, real network training and target network updating.

The invention mainly consists of two models, namely: the method comprises a multi-agent reinforcement learning model based on a neural network parameter vector difference maximization strategy search method and a sequence labeling combined model based on a Bi-LSTM-CRF structure.

In the invention, the sequence labeling module is used as an effect verification and reward feedback part of a corpus purification model, and the selected model structure is shown in figure 5 and mainly comprises two layers, namely: a Bi-LSTM layer and a CRF layer.

The Bi-LSTM model has excellent performance in a sequence labeling task, long-term context information can be effectively combined and utilized, and meanwhile, the fitting capacity of a neural network to nonlinear data is also achieved, but as the optimization target is to find labels with the maximum probability at each moment, and then the labels form a sequence, the phenomenon that the output of the model to the labeled sequence is discontinuous is often caused.

The CRF model can be well complemented with the advantages and disadvantages of the Bi-LSTM model to a certain extent, the CRF model has the advantage that the whole input text can be scanned through the feature template, so that more consideration is given to linear weighted combination of local features of the whole text, and the optimization target of the CRF model is a sequence with the highest occurrence probability rather than a label with the highest occurrence probability at each position of the sequence. The CRF model has the disadvantages that firstly, the selection of the characteristic template needs to have certain prior knowledge on the training corpus, the characteristics which have important influence on the labeling need to be analyzed from the statistical data of the related information in the corpus, the overfitting of the model can be caused by too much number of the characteristics, the under-fitting of the model can be caused by too little number of the characteristics, and the combination of the characteristics is a difficult task; secondly, the CRF model is limited by the window size specified by the feature template during the training process, and long-term context information is difficult to consider.

Based on the complementary advantages and disadvantages of the two models, the Bi-LSTM-CRF model combining the two models is selected, namely a linear CRF layer is added on a hidden layer of the traditional Bi-LSTM model to be used as a sequence labeling module in the invention to verify the training effect of a corpus purification model, and the training result is fed back to the training of the corpus purification model again to optimize the model.

As shown in fig. 5, before the model training starts, the environment and the agent need to be initialized and reset, and the setting of the corresponding training parameters specifically includes: and performing data preprocessing on the event corpus, and inputting the corpus as an environment parameter of the multi-agent reinforcement learning model.

As shown in fig. 6, the agent forms a series of data required for training by performing corresponding refinement optimization actions in the environment, samples the data and stores the data in a data buffer, so as to be used for subsequent training, specifically including:

As shown in fig. 6, when the number of data in the data buffer reaches a set value, the starting of training and updating the real network of all the agents using the data specifically includes:

Data sampling and agent network updates are detailed as follows:

as shown in fig. 2, the detailed steps of data sampling are as follows:

step 1-1: initializing sampling process parameters: the maximum data storage capacity max-epsilon-length, the sampled and stored data quantity t is 1;

step 1-2: acquiring a state X of a current environment, wherein the X is a vector formed by a series of environment parameters;

step 1-3: each Agent i takes the environment state X as input, generates An action Ai through the network operation of a real Actor in the Agent i, and all actions selected by agents form An action group A (A1, A2, … and An);

step 1-4: all agents perform respective actions in the current environment, namely: in the environment state X, executing the action group A (A1, A2, …, An), obtaining a new environment state X' and simultaneously obtaining a joint reward value R;

step 1-5: obtaining a complete data tuple (X, A, R, X') and storing the complete data tuple in a data cache pool D;

step 1-6: updating the current environmental state: x' > X;

step 1-7: executing the steps until the data replacement amount in the data cache pool D reaches the maximum data storage amount, namely: and when t is greater than max-epsilon-length, finishing data sampling and starting learning.

As shown in fig. 3, the detailed steps of the agent network update are as follows:

the following operations are performed for each pair of all Agent agents i:

step 2-1: randomly sampling a data tuple (X, A, R, X') of minipatch from a data cache pool D, wherein the size of the minipatch can be set autonomously;

step 2-2: calculating a target Q value according to the randomly sampled data tuples;

step 2-3: updating a real criticic network of the Agent i in a mode of minimizing a loss function, and calculating the loss function by taking an actual Q value and a target Q value as factors;

step 2-4: updating a real Actor network of the Agent i in a gradient descending manner, and calculating a strategy gradient of the model network;

step 2-5: parameter vectors of an Actor network and a Critic network of the Agent i are respectively extracted and recorded as: mi and Ni;

step 2-6: and (3) making a difference between the parameter vector of Agent i and the parameter vector of Agent (i-1), and recording as: Sub-Mi and Sub-Ni;

step 2-7: multiplying the Sub-Mi and the Sub-Ni by a differentiation factor beta, and feeding back and updating the original network respectively;

step 2-8: the steps are circulated until all the agents finish the updating of the real network;

step 2-9: and updating the target networks of all the agents in a soft updating mode, namely: the parameters of the real network are periodically copied into the target network.

The target Q value is:

wherein X is an environmental state characterizing parameter, a_iIs an action, Q is a Q-valued computation function whose parameters are x and a_iR refers to the reward value R, γ refers to the decay factor;

the loss function is:

s is the total number of agents in the environment, y^jIs the agent target Q value, Q^uIs the actual Q value of the agent;

the strategy gradient is:

mu is the agent policy, and sigma is the policy network input parameter;

periodically copying the parameters of the real network into the target network by adopting the following formula:

θ′_i←τθ_i+(1-τ)θ′_i，

theta refers to the network parameter, and tau refers to the coefficient of parameter replication at the time of network update.

The sequence labeling joint extraction model is an efficient event entity relationship joint extraction model, but a large amount of high-quality labeling data is needed in the training process, and although the amount of the labeled data can be effectively increased through a remote supervision method, the labeled data set generated by the sequence labeling joint extraction model has the problem of label noise, and the label noise can have adverse effects on the model. Aiming at the problem, the invention carries out purification and optimization on the labeled data based on the improved multi-agent reinforcement learning model, thereby solving the problem of data label noise in the training process of the sequence labeling combined extraction model, and improving the effect of the event entity relationship combined extraction task.

The embodiment of the invention provides an event corpus purification method based on multi-agent reinforcement learning. On the basis of the original MADDPG training process, after the strategy of the intelligent agent is updated, network parameters of each layer of each intelligent agent are extracted to be used as parameter vectors, then the parameter vectors of each layer are subtracted one by one to obtain pairwise parameter vector differences among the multiple intelligent agents, and then the parameter vector differences are multiplied by differentiation factors to be fed back to the updated intelligent agent, so that the final updating of the intelligent agent is completed. By the method of maximizing the vector difference of the neural network parameters, the exploration degree of the intelligent agent on the joint strategy space in the training process is expanded, so that the training result further approaches to the global optimal solution.

Multi-agent reinforcement learning (MARL) is a key tool to solve many real-world problems, while reinforcement learning algorithms in a multi-agent environment face typical problems: with the increase of the number of the agents, the solution space of the joint strategy is exponentially increased, so that poor strategy space exploratory property and strategy suboptimal property which are difficult to avoid by the algorithm are caused. Through the research on the strategy space exploration method, the exploration efficiency of the intelligent agent on the joint strategy solution space is optimized, the exploration degree on the joint strategy solution space is increased, the joint strategy solution space further tends to be covered by the full strategy solution space, and therefore the current optimal strategy is closer to the global optimal solution.

The intelligent agent group is independent to the exploration of the strategy solution space, and the random exploration process cannot avoid the repeated coverage of the strategy solution space, so that the exploration efficiency is reduced to a certain degree. The invention provides a strategy exploration method for maximizing the difference of neural network parameter vectors, which extracts the parameter vectors of each agent forming the neural network, combines the exploration of agent groups on strategy solution spaces, and avoids repeated exploration on the strategy solution spaces to a certain extent by maximizing the difference of the agent parameter vectors, thereby improving the exploration degree of the combined strategy solution spaces, further tending to the coverage of the full strategy solution spaces, further improving the training effect compared with the original algorithm and improving the model.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is intended to include such modifications and variations.

Claims

1. An event corpus purification method based on multi-agent reinforcement learning is characterized by comprising the following steps: comprises that

2. The method for refining the multi-agent reinforcement learning-based event corpus as claimed in claim 1, wherein the initialization reset of the environment and the agents is required before the model training starts, and the setting of the corresponding training parameters specifically comprises: and performing data preprocessing on the event corpus, and inputting the corpus as an environment parameter of the multi-agent reinforcement learning model.

3. The multi-agent reinforcement learning-based event corpus refining method as claimed in claim 1, wherein the agent forms a series of data required for training by performing corresponding refining optimization actions in the environment, samples the data and stores the data in the data buffer for subsequent training specifically comprises:

4. The method for refining the multi-agent reinforcement learning-based event corpus as claimed in claim 1, wherein said starting to train and update the real networks of all agents using the data when the number of data in the data buffer reaches a set value specifically comprises:

5. The method for refining event corpus based on multi-agent reinforcement learning according to claim 1, wherein after updating the real network, updating the target networks of all agents by sporadic parameter replication method further comprises: