CN113377884B

CN113377884B - Event corpus purification method based on multi-agent reinforcement learning

Info

Publication number: CN113377884B
Application number: CN202110773927.9A
Authority: CN
Inventors: 后敬甲; 王悦; 白璐; 崔丽欣
Original assignee: Central university of finance and economics
Current assignee: Central university of finance and economics
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2023-06-27
Anticipated expiration: 2041-07-08
Also published as: CN113377884A

Abstract

The invention relates to an event corpus purifying method based on multi-agent reinforcement learning, which comprises the steps that before model training starts, an environment and an agent are required to be initialized and reset, and corresponding training parameters are set; the intelligent agent performs corresponding purification and optimization actions in the environment to form a series of data required by training, samples the data and stores the data into a data cache area for subsequent training; the data quantity in the data cache area reaches a set value, and training and updating are started to be carried out on the real networks of all the intelligent agents by using the data; after the update of the real network is finished, updating the target networks of all the intelligent agents by an irregular parameter copying method; repeating the steps until the training times reach the preset training times. The method and the device are based on purification optimization of marked data, so that the problem of data tag noise encountered by the sequence marking joint extraction model in the training process is solved, and the effect of the event entity relation joint extraction task is improved.

Description

Event corpus purification method based on multi-agent reinforcement learning

Technical Field

The invention relates to the field of multi-agent reinforcement learning methods, in particular to an event corpus purifying method based on multi-agent reinforcement learning.

Background

Reinforcement Learning (MARL) is a method of machine learning, and can be classified into single-agent reinforcement learning and multi-agent reinforcement learning according to the number of agents, wherein the multi-agent reinforcement learning has wider application scenarios and is a key tool for solving a plurality of real world problems. In multi-agent reinforcement learning, according to the difference of agent task relationships, the learning method can be divided into: fully collaborative tasks, fully competing tasks, and mixed tasks, we consider here only fully collaborative tasks.

In the multi-agent reinforcement learning training under the complete cooperation task, the agents aim at maximizing the combined rewards, select actions according to own strategies, execute corresponding rewards and feedback in the environment to update own strategies, and execute the steps in a circulating way until the combined rewards converge to the maximum value, and each agent achieves the optimal strategy under the current environment.

At present, the MADDPG (Multi-Agent Deep Deterministic Policy Gradient) algorithm is one of more advanced reinforcement learning methods in a Multi-agent environment, solves the problem that a traditional value-based algorithm (such as DQN) is difficult to apply in a continuous environment, and simultaneously adds a deep learning method to improve the training efficiency of the traditional strategy algorithm (DPG), introduces an experience playback pool and a 'centralized training' training mechanism to further improve the training effect.

However, madppg still has poor exploratory and suboptimal problems for joint solution space, namely: in a multi-agent reinforcement learning environment, as the number of agents increases, the size of the combined strategy space increases exponentially, which results in a decrease in the exploration completion of the agent strategy space during the training process, and further results in a trend of converging the training results on a global suboptimal solution, failing to achieve a better training effect.

Entity-related extraction refers to the simultaneous detection of entity references from unstructured text and recognition of semantic relationships between them. The conventional entity relationship extraction method processes the task in a serial manner, namely extracting the entities and then identifying their relationships. The serial processing mode is simple, the two subtasks are independent and flexible, each subtask corresponds to one sub-model, and the correlation between the two subtasks is ignored.

The entity relationship joint extraction is to use a single model to combine entity identification and relationship extraction, so that entity information and relationship information can be effectively integrated, and better effect is achieved compared with a serial entity relationship extraction method, but the entity and relationship are required to be extracted respectively, which leads to additional redundant information of the model.

In order to solve the problem that the entity relationship joint extraction model generates extra redundant information, a research has been proposed to convert the joint extraction task into a label, and the entity and the relationship thereof are directly extracted by using the sequence labeling model by establishing the label with the relationship information, so that the entity and the relationship are not separately identified.

The sequence labeling combined extraction model is an efficient event combined extraction model, but a large amount of high-quality labeling data is needed in the training process, and the automatic labeling of the data can be effectively realized by a remote supervision method, but the remote supervision method can assume that: if two entities have a relationship in a given corpus, then all sentences that contain both entities will refer to the relationship. This results in a significant amount of labeled data set generated by the remote monitoring method, which has the problem of label noise that can adversely affect the joint extraction model.

Disclosure of Invention

Aiming at the technical problems, the invention provides an event corpus purifying method based on multi-agent reinforcement learning.

In order to solve the problems in the prior art, the invention provides a multi-agent reinforcement learning-based event corpus purification method, which comprises the following steps:

before model training starts, the environment and the intelligent body are required to be initialized and reset, and corresponding training parameters are set;

the intelligent agent performs corresponding purification and optimization actions in the environment to form a series of data required by training, samples the data and stores the data into a data cache area for subsequent training;

the data quantity in the data cache area reaches a set value, and training and updating are started to be carried out on the real networks of all the intelligent agents by using the data;

after the update of the real network is finished, updating the target networks of all the intelligent agents by an irregular parameter copying method;

repeating the steps until the training times reach the preset training times.

Preferably, the initialization and the resetting of the environment and the intelligent agent are needed before the model training is started, and the setting of the corresponding training parameters specifically includes: and carrying out data preprocessing on the event corpus, and inputting the corpus as the environmental parameters of the multi-agent reinforcement learning model.

Preferably, the agent forms a series of data required for training by performing corresponding purification optimization actions in the environment, samples the data and stores the data in a data buffer area, so that the subsequent training use specifically includes:

the multi-agent reinforcement learning model generates an action set of an agent group according to the input environmental parameters;

the intelligent agent group executes an action set, and corresponding event knowledge is selected from a corpus to form an event knowledge set;

mapping the event knowledge set into word vectors, and inputting the word vectors into a sequence labeling joint model;

the sequence labeling joint model labels the input word vectors, and compares the word vectors with a test set to verify the event purifying effect of the current multi-agent reinforcement model, and outputs evaluation indexes.

Preferably, the step of starting training and updating the real network of all the agents by using the data when the data quantity in the data buffer reaches the set value specifically includes:

and converting the evaluation index into a reward value according to a preset reward function, and feeding back to the training of the multi-agent reinforcement learning model to optimize the model.

Preferably, after the updating of the real network is completed, the updating of the target network of all the agents by the method of copying the irregular parameters further includes:

extracting network parameters of each layer of each intelligent agent as parameter vectors, subtracting the parameter vectors of each layer one by one to obtain a pairwise parameter vector difference among multiple intelligent agents, and then multiplying the parameter vector difference by a differentiation factor to feed back the parameter vector difference to the updated intelligent agent, so that the final update of the intelligent agent is completed.

Compared with the prior art, the event corpus purifying method based on multi-agent reinforcement learning has the following beneficial effects:

1. based on the multi-agent reinforcement learning environment, the training effect of the multi-agent reinforcement learning algorithm is improved and the multi-agent reinforcement learning model is optimized through the research on the exploration degree of the improved joint strategy space;

2. the invention extracts the parameters in each intelligent agent sub-network, forms the parameter vector to represent the strategy of the intelligent agent, reduces the repeatability of strategy exploration among the intelligent agents by maximizing the difference among the parameter vectors, and improves the exploration degree of the combined strategy solution space;

3. according to the method, based on the optimized multi-agent reinforcement learning model, the labeled data is purified and optimized, so that the problem of data tag noise encountered by the sequence labeling joint extraction model in the training process is solved, and the effect of the event entity relationship joint extraction task is improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 is a flowchart of an event corpus purification method based on multi-agent reinforcement learning according to an embodiment of the present invention.

Fig. 2 is a training flowchart of an event corpus purification method based on multi-agent reinforcement learning according to an embodiment of the present invention.

Fig. 3 is a flow chart of a data sampling part of an event corpus purifying method based on multi-agent reinforcement learning according to an embodiment of the present invention.

Fig. 4 is a network updating flowchart of an event corpus purifying method based on multi-agent reinforcement learning according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a sequence labeling model of an event corpus purification method based on multi-agent reinforcement learning according to an embodiment of the present invention.

Fig. 6 is another flowchart of an event corpus purifying method based on multi-agent reinforcement learning according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As shown in FIG. 1, the invention provides a method for purifying an event corpus based on multi-agent reinforcement learning, which comprises the following steps of

S1, initializing and resetting an environment and an intelligent agent before model training starts, and setting corresponding training parameters;

s2, the intelligent agent performs corresponding purification and optimization actions in the environment to form a series of data required by training, samples the data and stores the data in a data cache area for subsequent training;

s3, training and updating the real network of all the intelligent agents by using the data when the data quantity in the data cache area reaches a set value;

s4, after the update of the real network is finished, updating the target networks of all the intelligent agents by an irregular parameter copying method;

s5, repeating the steps until the training times reach the preset training times.

The invention provides an event corpus purifying method based on multi-agent reinforcement learning, which mainly comprises the following steps: training environment and parameter initialization, data sampling, real network training and target network updating.

The invention mainly consists of two models, namely: a multi-agent reinforcement learning model based on a neural network parameter vector difference maximization strategy searching method and a sequence labeling joint model based on a Bi-LSTM-CRF structure.

In the invention, the sequence labeling module is used as an effect verification and rewarding feedback part of a corpus purification model, and the selected model structure is shown in fig. 5 and mainly comprises two layers, namely: bi-LSTM layer and CRF layer.

The Bi-LSTM model has excellent performance in sequence labeling tasks, long-term context information can be effectively combined and utilized, and meanwhile, the self fitting capability of the neural network to nonlinear data is also achieved, but as the optimization target is to search labels with the highest occurrence probability at each moment, and then the labels form a sequence, the phenomenon that the model is inconsistent with the output of the label sequence is often caused.

The CRF model can be well complemented with the advantages and disadvantages of the Bi-LSTM model to a certain extent, and the CRF model has the advantages that the whole input text can be scanned through the feature template, so that more consideration is given to linear weighted combination of the local features of the whole text, and the optimization target of the CRF model is a sequence with highest occurrence probability instead of labeling of each position of the sequence with highest occurrence probability. The CRF model has the defects that firstly, a certain priori knowledge is required to be selected for training the corpus, the characteristics which have important influence on labeling are required to be analyzed from the statistical data of the related information in the corpus, the model is over-fitted due to the excessive number of the characteristics, the under-fitting phenomenon of the model is caused due to the insufficient number of the characteristics, and how to combine the characteristics is a difficult work; secondly, the CRF model is limited by the window size specified by the feature template in the training process, and long-term context information is difficult to consider.

Based on the characteristics of the two models, namely, the Bi-LSTM-CRF model combining the two models is selected, namely, a linear CRF layer is added on a hidden layer of the traditional Bi-LSTM model and is used as a sequence labeling module in the invention to verify the training effect of the corpus purification model, and the training result is fed back to the training of the corpus purification model to optimize the model.

As shown in fig. 5, the initialization and the reset of the environment and the intelligent agent are required before the model training starts, and the setting of the corresponding training parameters specifically includes: and carrying out data preprocessing on the event corpus, and inputting the corpus as the environmental parameters of the multi-agent reinforcement learning model.

As shown in fig. 6, the agent performs corresponding purification optimization actions in the environment to form a series of data required by training, samples the data and stores the data in a data buffer area, so that the subsequent training use specifically includes:

As shown in fig. 6, the step of starting training and updating the real network of all the agents by using the data when the data amount in the data buffer reaches the set value specifically includes:

Data sampling and agent network updating are described in detail as follows:

as shown in fig. 2, the detailed steps of data sampling are as follows:

step 1-1: initializing sampling process parameters: maximum data storage amount max-ep-length, sampled and stored data amount t=1;

step 1-2: acquiring a state X of a current environment, wherein X is a vector formed by a series of environment parameters;

step 1-3: each Agent i takes An environmental state X as input, generates An action Ai through the operation of a real Actor network in the Agent i, and all actions selected by the agents form An action group A (A1, A2, …, an);

step 1-4: all agents perform their respective actions in the current environment, namely: in the environment state X, executing An action group A (A1, A2, …, an) to obtain a new environment state as X', and simultaneously obtaining a combined rewards value R;

step 1-5: obtaining a complete data tuple (X, A, R, X') and storing the complete data tuple in the data cache pool D;

step 1-6: updating the current environmental state: x' - > X;

step 1-7: the steps are executed until the data exchange amount in the data cache pool D reaches the maximum data storage amount, namely: and when t > max-epi-length, ending data sampling and starting learning.

As shown in fig. 3, the detailed steps of the agent network update are as follows:

the following operations are performed for each Agent i for all agents:

step 2-1: randomly sampling a data tuple (X, A, R, X') of a miniband from the data cache pool D, wherein the size of the miniband can be set independently;

step 2-2: calculating a target Q value from the randomly sampled data tuples;

step 2-3: updating the real Critic network of the Agent i in a mode of minimizing the loss function, and calculating the loss function by taking the actual Q value and the target Q value as factors;

step 2-4: updating a real Actor network of the Agent i in a gradient descending mode, and calculating a strategy gradient of the model network;

step 2-5: extracting parameter vectors of an Actor network and a Critic network of the Agent i respectively, and marking the parameter vectors as: mi and Ni;

step 2-6: taking the difference between the parameter vector of Agent i and the parameter vector of Agent (i-1), and marking as: sub-Mi and Sub-Ni;

step 2-7: multiplying the Sub-Mi and the Sub-Ni by a differentiation factor beta, and respectively feeding back and updating the original network;

step 2-8: the steps are circulated until all the agents finish updating the real network;

step 2-9: updating the target network of all the agents by means of soft update, namely: parameters of the real network are copied into the target network periodically.

The target Q value is:

wherein X is an environmental state characterization parameter, a _i Is an action, Q is a Q-value calculation function whose parameters are x and a _i R refers to the prize value R, γ refers to the decay factor;

the loss function is:

s is the total number of agents in the environment, y ^j Is the target Q value, Q of the intelligent agent ^u Is the actual Q value of the intelligent agent;

the strategy gradient is as follows:

μ is the agent policy, σ is the policy network input parameter;

periodically copying parameters of the real network into the target network adopts the following formula:

θ′ _i ←τθ _i +(1-τ)θ′ _i ，

θ refers to the network parameters and τ refers to the coefficients of the parameter replications at the time of network update.

The sequence labeling joint extraction model is an efficient event entity relationship joint extraction model, but a large amount of high-quality labeling data is required in the training process, and the labeled data volume can be effectively increased by a remote supervision method, but the label noise problem exists in the generated labeled data set, so that the model is adversely affected. Aiming at the problem, the method and the device are based on the improved multi-agent reinforcement learning model, and the marked data is purified and optimized, so that the problem of data tag noise encountered by the sequence marking joint extraction model in the training process is solved, and the effect of the event entity relation joint extraction task is improved.

The embodiment of the invention provides an event corpus purifying method based on multi-agent reinforcement learning, wherein in the multi-agent reinforcement learning environment, an agent consists of a multi-layer neural network, and network parameters of each layer are current strategy generation parameters of the agent. On the basis of the MADDPG original training process, after the strategy of the intelligent agent is updated, network parameters of each layer of the intelligent agent are extracted to be used as parameter vectors, then the parameter vectors of each layer are subtracted one by one to obtain a two-by-two parameter vector difference between multiple intelligent agents, and the parameter vector difference is multiplied by a differentiation factor and fed back to the updated intelligent agent, so that the final updating of the intelligent agent is completed. By the method of maximizing the vector difference of the neural network parameters, the exploration degree of the intelligent agent to the combination strategy space in the training process is enlarged, so that the training result is further close to the global optimal solution.

Multi-agent reinforcement learning (MARL) is a key tool to solve many real world problems, while reinforcement learning algorithms in multi-agent environments face typical problems: as the number of agents increases, the joint policy solution space increases exponentially, resulting in poor policy space exploration and policy suboptimal that are difficult to avoid for such algorithms. Through the research of the strategy space exploration method, the exploration efficiency of the agent for the combined strategy solution space is optimized, the exploration degree of the combined strategy solution space is increased, and the combined strategy solution space is further prone to the coverage of the full strategy solution space, so that the current optimal strategy is more close to the global optimal solution.

The exploration of the strategy solution space by the agent group is independent, and the random exploration process cannot avoid repeated coverage of the strategy solution space, so that the exploration efficiency is reduced to a certain extent. The invention provides a strategy exploration method for maximizing the parameter vector difference of a neural network, which extracts the parameter vector of each agent forming the neural network, combines exploration of the strategy solution space of an agent group, and avoids repeated exploration of the strategy solution space to a certain extent by maximizing the difference between the parameter vectors of each agent, thereby improving the exploration degree of the combined strategy solution space, leading the combined strategy solution space to further trend to cover the full strategy solution space, further improving the training effect compared with the original algorithm, and improving the model.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The event corpus purifying method based on multi-agent reinforcement learning is characterized by comprising the following steps of:

the intelligent agent group executes the action set, and selects corresponding event knowledge from a corpus to form an event knowledge set;

mapping the event knowledge set into word vectors, and inputting the word vectors into a sequence labeling joint model of a Bi-LSTM-CRF structure;

the sequence labeling joint model of the Bi-LSTM-CRF structure labels the input word vectors, and compares the input word vectors with a test set to verify the event purifying effect of the current multi-agent strengthening model and output an evaluation index;

extracting network parameters of each layer of each intelligent agent as parameter vectors, subtracting the parameter vectors of each layer one by one to obtain a pairwise parameter vector difference among multiple intelligent agents, and then multiplying the parameter vector difference by a differentiation factor to feed back the parameter vector difference to the updated intelligent agent, so that the final update of the intelligent agent is completed;

repeating the steps until the training times reach the preset training times.

2. The method for purifying an event corpus based on multi-agent reinforcement learning according to claim 1, wherein the model training is performed before starting, the environment and the agents are required to be initialized and reset, and the setting of corresponding training parameters specifically comprises: and carrying out data preprocessing on the event corpus, and inputting the corpus as the environmental parameters of the multi-agent reinforcement learning model.

3. The method for purifying an event corpus based on multi-agent reinforcement learning according to claim 1, wherein the step of starting training and updating the real network of all agents by using the data when the number of data in the data buffer reaches a set value specifically comprises: