CN114780677A

CN114780677A - Chinese event extraction method based on feature fusion

Info

Publication number: CN114780677A
Application number: CN202210354653.4A
Authority: CN
Inventors: 柯欣飞; 姬红兵; 张文博
Original assignee: Shaanxi Fangcun Jihui Intelligent Technology Co ltd; Xidian University
Current assignee: Shaanxi Fangcun Jihui Intelligent Technology Co ltd; Xidian University
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2022-07-22
Anticipated expiration: 2042-04-06
Also published as: CN114780677B

Abstract

The invention discloses a Chinese event extraction method based on feature fusion, which comprises the following steps: 1) constructing a Chinese event extraction network BERT-FF; 2) constructing a training data set; 3) downloading a pre-training parameter file and optimizing by using a contrast learning method; 4) loading the optimized pre-training parameter file in a word level feature extraction network by using a transfer learning method; 5) training by using a training data set to obtain a trained Chinese event extraction network BERT-FF; 6) and crawling a text describing the event from the open network, inputting the text serving as a test data set into a trained Chinese event extraction network BERT-FF for event extraction, and outputting structured event information, namely an event extraction result. The method enhances the semantic expression capability of the model through a feature fusion method, improves the performance of extracting Chinese events, and can be used in the fields of news public opinion analysis, information processing, financial risk assessment and the like.

Description

Chinese event extraction method based on feature fusion

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to natural language processing, and particularly relates to a Chinese event extraction method based on feature fusion, which can be used for public opinion analysis and information processing.

Background

The main objective of the event extraction task is to extract structured event information from unstructured text, so as to reduce the difficulty of acquiring and processing information for users. This typically includes determining the type of event (event type) contained in the text, and identifying an event argument (argument) and determining the role (role) it plays in the event. The technology can be applied to the fields of news public opinion analysis, information processing, financial risk assessment and the like. The research of the event extraction system has been developed over a decade, and many experts and scholars have obtained a great deal of effective research results, but most of the research is directed at the english corpus, the research in the aspect of chinese event extraction is relatively immature, not only is a high-quality data set lacking, but also the model performance is poorer than that of english event extraction.

The traditional event extraction method converts event detection and event argument extraction into classification problems by acquiring characteristic information of event trigger words and vocabulary, syntax, semantics and the like of event arguments, and the core of the traditional event extraction method is the extraction of characteristics and the construction of a classifier. Although various characteristics such as vocabulary, syntax, semantics and the like can be used as input of the classifier, the construction of higher-level linguistic characteristics such as part-of-speech tagging, syntactic dependency analysis and the like requires professional knowledge of linguistics and related fields, which limits the adaptability and universality of the classifier.

With the development of neural networks, deep learning has made remarkable achievements in both event detection performance and event argument extraction performance. In a neural network, input layers can adopt simple representation of original data as input, and each layer can convert input of a shallow layer into more abstract and complex characteristics through learning and then input the more abstract and complex characteristics into a deeper layer until output characteristics of a deepest layer are used for classification. Compared with the traditional method, the deep learning can greatly reduce the difficulty of the characteristic engineering, and the detection and classification precision is superior to that of the traditional method. In 2018, in 10 months, *** AI research institute issued a pre-trained language model BERT based on a Transformer encoder structure, and a research on applying the pre-trained language model to an event extraction task was also activated. Before that, the mainstream event detection method is to detect an event trigger word from a text and judge the event type according to the event trigger word; with the development of an event extraction model based on a pre-training network, due to the strong semantic feature representation capability of the event extraction model, an event detection method based on the full text becomes the mainstream.

Disclosure of Invention

In order to overcome the above drawbacks of the prior art, the present invention provides a method for extracting chinese events based on feature fusion, so as to improve the accuracy and recall rate of chinese event extraction.

In order to achieve the purpose, the invention adopts the technical scheme that:

the Chinese event extraction method based on feature fusion comprises the following steps:

step 1, constructing a Chinese event extraction network BERT-FF

The Chinese event extraction network BERT-FF comprises a word level feature extraction network, a feature fusion network and a back-end classification network;

the word level feature extraction network is based on a BERT pre-training language model and is used for extracting word level features of an input text; the word level feature extraction network is used for extracting word level features of an input text; the feature fusion network fuses the extracted word-level features and the word-level features through an attention mechanism so as to enhance the semantic representation capability of the model and obtain fusion feature vectors; the back-end classification network is used for respectively inputting the fusion feature vectors into the event detection back-end network and the event argument extraction back-end network to obtain a final event extraction result;

step 2, constructing a training data set

The training data set consists of texts which are crawled from an open network and describe events and annotation files which are in one-to-one correspondence with the texts;

step 3, training the Chinese event extraction network BERT-FF

And 4, crawling a text describing the event from the open network, inputting the text as a test data set into a trained Chinese event extraction network BERT-FF for event extraction, outputting structured event information, obtaining an event extraction result, and calculating the event extraction accuracy and recall rate.

Compared with the prior art, the invention has the beneficial effects that:

firstly, the invention uses a transfer learning method to load the pre-training parameter file of the BERT into the word level feature extraction network constructed in the step (1a), so that the word level feature extraction network can rapidly learn the word level semantic features of the text, the integral convergence speed of the model is accelerated, and the time complexity of event extraction is reduced.

Secondly, the invention optimizes the pre-training parameters of the BERT by using a contrast learning method, relieves the anisotropy problem of the BERT semantic feature vector space, and then uses the optimized model to replace the original model as a feature extraction network, thereby improving the event detection performance of the model.

Thirdly, the invention uses a feature fusion method based on an attention mechanism to fuse the extracted word-level feature information with the word-level feature information aiming at the characteristics of Chinese text data, thereby enhancing the context semantic representation capability of the model and improving the accuracy rate and the recall rate of the model in event detection and event argument extraction.

Drawings

FIG. 1 is a schematic structural diagram of a Chinese event extraction network BERT-FF.

FIG. 2 is a schematic diagram of the structure of the BERT pre-training model.

FIG. 3 is a schematic diagram of a multi-head attention mechanism.

Fig. 4 is an exemplary diagram of event extraction.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

The invention relates to a Chinese event extraction method based on feature fusion, which comprises the following implementation steps.

Step 1, referring to fig. 1, a Chinese event extraction network BERT-FF is constructed.

The Chinese event extraction network BERT-FF comprises: the system comprises a word level feature extraction network, a feature fusion network and a back-end classification network.

The word level feature extraction network is based on a BERT pre-training language model and is used for extracting word level features of an input text; the word level feature extraction network is used for extracting word level features of the input text; the feature fusion network fuses the extracted word-level features and the word-level features through an attention mechanism so as to enhance the semantic representation capability of the model and obtain fusion feature vectors; the back-end classification network is used for inputting the fusion feature vectors into the event detection back-end network and the event argument extraction back-end network respectively to obtain a final event extraction result.

The key steps are as follows:

1.1) building a character level feature extraction network of a Chinese event extraction network BERT-FF:

referring to fig. 2, the structural relationship is: input layer → word embedding layer → position coding → N concatenated semantic coders → output layer.

The specific parameters and implementation modes of the modules are as follows:

the input of the input layer is a token sequence obtained after text word segmentation, in order to avoid the problem of OOM (out of memory) of display memory, the maximum length of the token sequence during training is set to be 128, and if the maximum length exceeds 128, truncation is carried out. In order to Batch the input text data, the token sequence length in each Batch (Batch) must be kept equal, and if not, the token sequence length Padding is the longest in the Batch. Let token sequence length be sequence length₁。

The embedding dimension embeded size of the word embedding layer is 768, which meets the input dimension requirement of BERT, that is, the word vector of each token is a column vector of 768 dimensions.

The position coding method adopts Sinusoid position coding, and is shown as formula (1) and formula (2):

wherein pos refers to the position of the current token in the sequence, and the value range is [0, sequence length ]₁) I refers to the dimension number of the word vector, and the value range is [0, embedded size/2) ], i.e., the dimension of the position code is consistent with the dimension of the word vector, and d refers to the dimension of the word vector. The formula (1) and formula (2)) is a set of sine and cosine formulas, which are respectively calculation formulas of position coding when the dimension serial number i of the word vector is even number and odd number, thereby generating different periodic changes. As can be seen from the above equation, as i increases, the frequency of the periodic variation becomes lower and lower, eventually yielding a unique texture containing location information at each different location. The method is used as a position code to be added into a word vector, so that a model can learn the dependency relationship between positions and the time sequence characteristic of a natural language.

The main body of the word-level feature extraction network is N cascaded semantic encoders, wherein N is 12. Each semantic encoder consists of two parts: a Multi-Head Self-Attention (Multi-Head Self-Attention) module containing a residual network and a Forward propagation (Feed Forward) module containing a residual network.

The multi-head self-attention module comprising the residual error network is formed by splicing a multi-head attention module and a residual error module, and the multi-head attention module refers to fig. 3. Input of multi-head attention module₁And input₂The three inputs of each Attention Head are respectively a Query (Query) vector sequence Q, a Key (Key) vector sequence K and a Value (Value) vector sequence V, and the number h of the Attention Head is 12. Q is input₁Obtained through the full connection layer, K and V are from input₂Obtained by full connection layer. In practical use, the dimension d of the input word vector is usually a large number, and since multiple groups of attention weights need to be calculated, in order to avoid the excessive number of network parameters, the mapping matrix is generally selected

Dimension reduction is performed, that is, in each extension head, the original d-dimensional word vector is projected to d/h dimension, so that the dimension of each fully-connected layer mapping matrix is 768 × 64. Each Attention Head generates a set of Q, K and V, each set of Q, K and V is input to a scaled dot product Attention module to compute a context feature vector. Finally, splicing the obtained multiple groups of context feature vectors, inputting the multiple groups of context feature vectors into a full connection layer to obtain the output of the multi-head attention module, wherein the dimensionality of a full connection layer mapping matrix is 768 multiplied by 768. The residual module adds the input and output of its pre-module and performs layer normalization.

The forward propagation module containing the residual error network is formed by splicing two full connection layers, a GeLU activation function and a residual error module, wherein the GeLU activation function is positioned between the two full connection layers. The first fully-connected layer mapping matrix has dimensions of 768 × 2048, and the second fully-connected layer mapping matrix has dimensions of 2048 × 768. The residual module adds the input and output of its pre-module and performs layer normalization. The GeLU activation function is expressed as:

where x represents the output of the pre-module and erf (×) represents the gaussian error computation function.

1.2) building a word level feature extraction network of a Chinese event extraction network BERT-FF:

the structural relationship is as follows: input layer → word embedding layer → position coding → fully connected layer → output layer.

The specific parameters and implementation of each module are as follows:

the input of the input layer is a token sequence obtained after the word segmentation of the text,in order to avoid the video on-board OOM problem, the maximum length of the token sequence during training is set to be 128, and if the maximum length exceeds 128, truncation is carried out. In order to Batch the input text data, the token sequence length in each Batch (Batch) must be kept equal, and if not, the token sequence length Padding is the longest in the Batch. Let token sequence length be sequence length₂。

The embedding dimension embeddedsize of the word embedding layer is 128, i.e. the word vector of each token is a column vector of 128 dimensions.

The position coding method adopts Sinussoid position coding.

The dimensionality of the fully-connected layer mapping matrix is 128 multiplied by 768, so that the dimensionality of the word-level feature vector is equal to the dimensionality of the word-level feature vector, and subsequent calculation is facilitated.

1.3) constructing a feature fusion network of a Chinese event extraction network BERT-FF:

the structural relationship is as follows: input layer → multi-headed attention module → output layer.

The specific parameters and implementation of each module are as follows:

the input of the input layer is composed of two parts, namely a word-level feature vector input₁And word-level feature vector input₂，input₁Has a dimension of sequence length₁×768，input₂Has a dimension of sequence length₂×768。

The three inputs of each Attention Head of the multi-Head Attention module are respectively a query vector sequence Q, a key vector sequence K and a value vector sequence V, and through experimental parameter adjustment, the model performance is optimal when the number of the Attention heads is set to be 24. Q is input₁Obtained through the full connection layer, K and V are input₂And obtaining the mapping matrix through the full connection layer, wherein the dimensionality of each full connection layer mapping matrix is 768 multiplied by 32. Each Attention Head generates a set of Q, K and V, each set of Q, K and V is input to a scaled dot product Attention module to compute a context feature vector. Finally, splicing the obtained multiple groups of context feature vectors, inputting the multiple groups of context feature vectors into a full connection layer to obtain the output of the multi-head attention module, wherein the dimensionality of a full connection layer mapping matrix is 768 multiplied by 768.

1.4) building a rear-end classification network of a Chinese event extraction network BERT-FF:

the back-end classification network of the Chinese event extraction network BERT-FF is divided into two parts, namely an event detection back-end network and an event argument extraction back-end network.

The structural relationship of the event detection back-end network is as follows in sequence: input layer → fully connected layer → multi-label classifier → output layer.

The specific parameters and implementation modes of the modules are as follows:

the input of the input layer is a feature vector of a [ CLS ] label in the fused feature vector, and the dimension is 1 × 768.

The fully connected layer mapping matrix has dimensions of 768 × n _ events, which is the total number of event types.

The multi-label classifier consists of n _ events Sigmoid functions, and the final output is the probability distribution of the event types contained in the current input text. And if the probability is greater than 0.5, the corresponding event type is considered to be contained in the text, otherwise, the corresponding event type is considered not to be contained.

The structure relationship of the event argument extraction back-end network is as follows: input layer → fully connected layer → conditional random field → output layer.

The specific parameters and implementation of each module are as follows:

the input of the input layer is a fusion feature vector with a dimension of sequence length₁×768。

The dimensionality of the fully-connected layer mapping matrix is 768 × n _ labels, the n _ labels are the total number of the labeled labels of the event argument sequences, and the labeling method is a BIO labeling method.

The conditional random field is a linear chain element random field, and a sequence labeling result of event arguments is finally output. Conditional Random Field (CRF) is a discriminative probabilistic model, which is a probabilistic undirected graph model of a Random variable Y given the Random variable X. A special conditional random field defined on the Linear Chain, called Linear Chain component random field (Linear Chain CRF), is defined as:

let X ═ X₁,X₂,…,X_n)，Y＝(Y₁,Y₂,…,Y_n) All random variable sequences are represented by linear chains, and if the conditional probability distribution P (Y | X) of Y satisfies Markov property under the condition of given X

P(Y_i|X，Y₁，...，Y_i-1，Y_i+1，…，Y_n)＝P(Y_i|X，Y_i-_1，Y_i+1)，i＝1，2，...，n#(4)

Then P (Y | X) is called the random field of the linear chain element.

And 2, constructing a training data set. The training data set consists of texts which are crawled from an open network and are used for describing events and annotation files which are in one-to-one correspondence with the texts.

In the embodiment of the invention, the construction process is as follows:

2.1) crawling at least 5000 texts describing the events from the open network.

2.2) manually labeling the event type, the event argument and the argument role contained in each text, and generating annotation files corresponding to the crawled texts one by one. The event type refers to an event type contained in the text; event arguments refer to elements in the text that are related to an event, typically named entities; argument roles refer to the role an event argument plays in an event.

In the news text "also security officer 25 days, a landmine explosion event occurred in the port city in south west of the country on the day of mooha, resulting in the death of at least 4 civilians. For example, according to a predefined event pattern, the event extraction system should recognize that the text contains two events of "disaster/accident-explosion" and "life-death", and extract that "the time" when the disaster/accident-explosion "occurs is" the day "," the location "is" the city mooha at the port in the south west of the country ", the number of the resulting" death "is" at least 4 ", and" the time "when the life-death" occurs is "the day", "the location" is "the city mooha at the port in the south west of the country", and "the dead" is "the civilian".

In the above examples, "disaster/accident-explosion" and "life-death" are event types; "the current day", "the port city of southwest of the country", "at least 4" and "civilian" are the event arguments; "time", "place", "number of dead", and "dead" are the corresponding argument roles.

2.3) composing the crawled text and the annotation files into a training data set.

And 3, training the event extraction network BERT-FF.

3.1) configuring an event extraction network BERT-FF operating environment, comprising different software such as CUDA11.0, cuDNN 8.0.4, Python 3.7, PyTorch 1.7.0 and the like.

3.2) downloading the pre-training parameter file (Chinese _ BERT _ wwm. bin) of the pre-training language model BERT to a local hard disk.

And 3.3) optimizing the pre-training parameters of the BERT by using a contrast learning method to obtain an optimized pre-training parameter file.

And 3.4) loading the optimized pre-training parameter file in the word level feature extraction network by using a transfer learning method to obtain the loaded word level feature extraction network. The method is characterized in that model parameters with strong semantic representation capability are obtained in related pre-training tasks and are loaded into a word level feature extraction network, so that the convergence speed of model training is increased, the training time is saved, and the risks of under-fitting and over-fitting are reduced.

3.5) training the training data set by using the BERT-FF of the loaded word level feature extraction network to obtain a trained Chinese event extraction network BERT-FF, wherein the training is realized as follows:

the iteration times (Epoch) of the network training are set to be 50 rounds, so that the network weight is sufficiently iterated and converged to the optimal value. And testing on the test sample after each round of training is finished, and storing the best test result and the network weight by taking the F1 score extracted by the event argument as a standard. The Batch Size (Batch Size) during training is set to 24, and the computational resources of the GPU are fully utilized. The network optimizer selects an adaptive momentum estimation (Adam) algorithm. Learning Rate (Learning Rate) the hierarchical Learning Rate is used: the learning rate of the word-level feature extraction network was set to 0.00002, the learning rate of the conditional random field was set to 0.002, and the learning rate of the other portions of the model was set to 0.0002. The word-level feature extraction network is based on a pre-training language model, and can be converged only by fine adjustment, so that the learning rate is set to be small, and other parts of the model need to be fully trained from the beginning. The conditional random field based on the probabilistic graphical model is difficult to converge, and the learning rate is set to be larger.

And 4, crawling a text describing the event from the open network, inputting the text serving as a test data set into a trained Chinese event extraction network BERT-FF for event extraction, outputting structured event information, namely an event extraction result, and calculating the event extraction accuracy and recall rate. Illustratively, to ensure effectiveness, the crawled text is at least 500 pieces.

To enter the news text "also security officer 25 days, a landmine explosion event occurred in the city in port in south west of the country on the day, resulting in the death of at least 4 civilians. "for example, the final event extraction result can be organized in a structured form of key-value pairs, see FIG. 4. This structured form may be conveniently stored in a json file or database for the user to obtain event information or a visual presentation.

The effect of the present invention will be further described with reference to simulation experiments.

1. Conditions of the experiment

The hardware test platform of the simulation experiment of the invention is as follows: CPU is Intel (R) Xeon (R) Silver 4310, the main frequency is 2.1Ghz, the memory is 32GB, and the GPU is a single NVIDIA GeForce RTX 3080 Ti; the software platform is as follows: ubuntu 18.04 system.

2. Analysis of Experimental Contents and results

The simulation experiment of the invention adopts the method of the invention to train the constructed Chinese event extraction network BERT-FF on the constructed event extraction training data set. And performing event extraction on the input test text by using the trained Chinese event extraction network BERT-FF, outputting structured event information to obtain an event extraction result, and calculating the accuracy and the recall rate of the event extraction.

Table 1 shows the performance of event detection on a test set after training a constructed data set using the method of the present invention.

TABLE 1 comparison of event detection Performance for different methods

Model	Precision	Recall	F1
				LSTM-CRF	0.8315	0.7266	0.7755
BERT(baseline)	0.8863	0.9270	0.9062
				BERT-FF	0.9356	0.9203	0.9279

Table 1 shows the performance of the method of the invention in detecting events on the test set, while the other two methods were used as control groups. LSTM-CRF represents a classical event extraction method based on LSTM-CRF sequence labeling, BERT represents an event extraction method without using the feature fusion method in the invention, and BERT-FF represents the method in the invention. Precision represents Precision, Recall represents Recall, and F1 represents F1 score.

In contrast to BERT, it can be observed that BERT-FF, after using the feature fusion method of the present invention, is 5.56% more accurate than BERT in event detection, while there is only less than 1% reduction in recall. Finally, the F1 score is calculated, and BERT-FF is improved by 2.39% compared with BERT. In comprehensive comparison, the event detection performance of BERT-FF is improved considerably compared with BERT.

Table 2 shows the performance of event argument extraction on the test set after training the constructed data set by the method of the present invention.

TABLE 2 comparison of event argument extraction Performance for different methods

Model	Precision	Recall	F1
				LSTM-CRF	0.6826	0.4989	0.5765
BERT(baseline)	0.7446	0.7465	0.7456
				BERT-FF	0.7570	0.7801	0.7684

Table 2 shows the performance of the method of the present invention in extracting event arguments on the test set, while the other two methods were used as control groups to perform the experiment.

Compared with BERT, the characteristic fusion method provided by the invention has the advantages that the BERT-FF is improved in each index, the accuracy is improved by 1.67%, the recall rate is improved by 4.50%, and the final calculation F1 score is improved by 3.04%. Compared with BERT, the performance of the event argument extraction of BERT-FF is greatly improved compared with BERT.

In conclusion, after the feature fusion method is added, the event detection performance and the event argument extraction performance of the model are improved, so that the fused word-level features and the word-level features really play a certain role in optimizing the semantic representation capability of the model, and the effectiveness of the feature fusion method provided by the chapter on the event extraction task is proved.

The above is only one specific example of the present invention for the convenience of those skilled in the art to understand the present invention, but the present invention is not limited to the scope of the specific example, and it is obvious to those skilled in the art that various changes are possible as long as they are within the spirit and scope of the present invention defined and determined by the appended claims, and all the inventions utilizing the inventive concept are protected by the present invention.

Claims

1. The Chinese event extraction method based on feature fusion is characterized by comprising the following steps of:

step 1, constructing a Chinese event extraction network BERT-FF

the word level feature extraction network is based on a BERT pre-training language model and is used for extracting word level features of an input text; the word level feature extraction network is used for extracting word level features of an input text; the feature fusion network fuses the extracted word-level features and the word-level features through an attention mechanism so as to enhance the semantic representation capability of the model and obtain fusion feature vectors; the back-end classification network is used for respectively inputting the fusion feature vector into an event detection back-end network and an event argument extraction back-end network to obtain a final event extraction result;

step 2, constructing a training data set

step 3, training the Chinese event extraction network BERT-FF

2. The method for extracting chinese events based on feature fusion according to claim 1, wherein the word-level feature extraction network has the following structural relationships in sequence: input layer → word embedding layer → position coding → N cascaded semantic coders → output layer;

the input of the input layer is a token sequence obtained after text word segmentation, the maximum length of the token sequence during training is set to be 128, if the maximum length exceeds 128, truncation is performed, the length of the token sequence in each batch is kept equal, and if the maximum length is not equal, the length of the token sequence is sequence length according to the longest token sequence length in each batch₁；

The embedding dimension embeddedsize of the word embedding layer is 768, that is, the word vector of each token is a column vector of 768 dimensions;

the position coding method adopts Sinusoid position coding, and is shown as a formula (1) and a formula (2):

wherein pos refers to the position of the current token in the sequence, and the numeric area is [0, sequence length ]₁) I refers to the dimension serial number of the word vector, the value range is [0, embedded size/2), namely the dimension of the position code is consistent with the dimension of the word vector, and d refers to the dimension of the word vector; the formula (1) and the formula (2) are respectively calculation formulas of position coding when the word vector dimension serial number i is even number and odd number, so that different periodic changes are generated; with the increase of i, the frequency of periodic variation is lower and lower, and finally, a unique texture containing position information is generated at each different position and is used as a position code to be added into a word vector, so that a model can learn the dependency relationship between the positions and the time sequence characteristic of a natural language;

the N cascaded semantic encoders are main bodies of a word-level feature extraction network, and each semantic encoder consists of two parts: a multi-headed self-attention module containing a residual network and a forward propagation module containing a residual network.

3. The feature fusion-based Chinese event extraction method of claim 2, wherein the multi-head self-attention module comprising the residual error network is formed by splicing a multi-head attention module and a residual error module, and an input of the multi-head attention module₁And input₂All the word vectors are binding position codes, three inputs of each Attention Head are respectively a query vector sequence Q, a key vector sequence K and a value vector sequence V, and the number of the Attention heads is 12; q is input₁Obtained through the full connection layer, K and V are input₂The method comprises the steps that all-connection layers are obtained, the dimensionality of each all-connection layer mapping matrix is 768 x 64, each Attention Head generates a group of Q, K and V, each group of Q, K and V inputs a scaling dot product Attention module to be calculated to obtain context eigenvectors, finally, the obtained multiple groups of context eigenvectors are spliced, and then the multiple groups of context eigenvectors are input into all-connection layers to obtain multi-Head AttentionThe dimensionality of a fully-connected layer mapping matrix is 768 x 768 for the output of the module, and the input and the output of a front module are added by a residual module and subjected to layer normalization;

the forward propagation module comprising the residual error network is formed by splicing two full connection layers, a GeLU activation function and a residual error module, wherein the GeLU activation function is positioned between the two full connection layers, the dimensionality of a mapping matrix of the first full connection layer is 768 multiplied by 2048, the dimensionality of a mapping matrix of the second full connection layer is 2048 multiplied by 768, and the residual error module adds the input and the output of a front-end module and performs layer normalization; the GeLU activation function is expressed as:

where x represents the output of the pre-stage block and erf (—) represents the gaussian error computation function.

4. The method for extracting chinese events based on feature fusion as claimed in claim 2, wherein the word-level feature extraction network has the following structural relationships: input layer → word embedding layer → position coding → full-link layer → output layer;

the input of the input layer is a token sequence obtained after text word segmentation, the maximum length of the token sequence during training is set to be 128, if the maximum length exceeds 128, truncation is carried out, the length of the token sequence in each batch is kept equal, and if the maximum length is not equal, the length of the token sequence is sequence length according to the longest length Padding of the token sequence in each batch₂；

The embedding dimension embeddedsize of the word embedding layer is 128, that is, the word vector of each token is a 128-dimensional column vector;

the position coding method adopts Sinusoid position coding.

The dimension of the fully-connected layer mapping matrix is 128 × 768.

5. The method for extracting Chinese events based on feature fusion as claimed in claim 4, wherein the feature fusion network has the following structural relationships: input layer → multi-headed attention module → output layer;

the input of the input layer is composed of two parts, namely a word-level feature vector input₁And word-level feature vector input₂，input₁Has a dimension of sequence length₁×768，input₂Has a dimension of sequence length₂×768；

The three inputs of each Attention Head of the multi-Head Attention module are respectively a query vector sequence Q, a key vector sequence K and a value vector sequence V, the number of the Attention heads is 24, and Q is input by input₁Obtained through a full connection layer, K and V are input₂The method comprises the steps that all-connection layers are obtained, the dimensionality of each all-connection layer mapping matrix is 768 x 32, each Attention Head generates a group of Q, K and V, each group of Q, K and V is input into a scaling dot product Attention module to be calculated to obtain context eigenvectors, finally, the obtained multiple groups of context eigenvectors are spliced, then the obtained multiple groups of context eigenvectors are input into all-connection layers to obtain the output of a multi-Head Attention module, and the dimensionality of the all-connection layer mapping matrix is 768 x 768.

6. The feature fusion-based Chinese event extraction method according to claim 5, wherein the back-end classification network is divided into two parts, namely an event detection back-end network and an event argument extraction back-end network;

the structural relationship of the event detection back-end network is as follows: input layer → fully connected layer → multi-label classifier → output layer;

the input of the input layer is a feature vector of a [ CLS ] label in the fusion feature vector, and the dimensionality is 1 multiplied by 768;

the dimensionality of the mapping matrix of the full connection layer is 768 multiplied by n _ events, and the n _ events is the total number of the event types;

the multi-label classifier consists of n _ events Sigmoid functions, the final output is probability distribution of event types contained in the current input text, if the probability is greater than 0.5, the text is considered to contain the corresponding event types, otherwise, the text is considered not to contain the event types;

the structural relationship of the event argument extraction back-end network is as follows in sequence: input layer → fully connected layer → conditional random field → output layer;

the input of the input layer is a fusion feature vector with a dimension of sequence length₁×768；

The dimensionality of the mapping matrix of the full connection layer is 768 multiplied by n _ labels, the n _ labels are the total number of the event argument sequences labeled labels, and the labeling method is a BIO labeling method;

the conditional random field is a linear chain element random field, and a sequence labeling result of event arguments is finally output.

7. The method for extracting Chinese events based on feature fusion according to claim 1, wherein in the step 2, at least 5000 texts describing the events are crawled from the open network, the event types, the event arguments and the argument roles contained in each text are manually labeled, and annotation files corresponding to the crawled texts one to one are generated.

8. The method for extracting chinese events based on feature fusion according to claim 1, wherein the step 3 comprises:

step 3a, configuring an event extraction network BERT-FF environment

Step 3b, downloading a pre-training parameter file (Chinese _ BERT _ wwm. bin) of a pre-training language model BERT;

step 3c, optimizing the pre-training parameters of the BERT by using a contrast learning method to obtain an optimized pre-training parameter file;

step 3d, loading the optimized pre-training parameter file in the word level feature extraction network by using a transfer learning method to obtain a loaded word level feature extraction network;

and 3e, extracting the BERT-FF of the network by using the loaded character level features, and training the training data set to obtain a trained Chinese event extraction network BERT-FF.

9. The feature fusion-based Chinese event extraction method of claim 8, wherein in step 3d, model parameters having strong semantic representation capability are obtained in a relevant pre-training task and loaded into a word-level feature extraction network, so as to accelerate convergence speed during model training, save training time, and reduce under-fitting and over-fitting risks.

10. The method for extracting chinese events based on feature fusion of claim 8, wherein in step 3e, the number of iterations of network training is set to 50 rounds, testing is performed on a test sample after each round of training is finished, the best test result and network weight are stored with F1 score extracted from event arguments as the standard, the batch size during training is set to 24, the network optimizer selects the adaptive momentum estimation algorithm, and the learning rate is a layered learning rate: the learning rate of the word-level feature extraction network was set to 0.00002, the learning rate of the conditional random field was set to 0.002, and the learning rate of the other portions of the model was set to 0.0002.