CN113312914B

CN113312914B - Security event entity identification method based on pre-training model

Info

Publication number: CN113312914B
Application number: CN202110482621.8A
Authority: CN
Inventors: 黑新宏; 董林靖; 朱磊; 姬文江; 刘雁孝
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2024-06-14
Anticipated expiration: 2041-04-30
Also published as: CN113312914A

Abstract

The invention mainly carries out Chinese named entity recognition research on the public safety field and provides an improved named entity recognition model PreTrain deg.C+ RoBERTa ⁺ -BiLSTM-CRF. In the process of performing task secondary pre-training optimization on an original RoBERTa model, a full-word Mask mechanism of a public safety domain dictionary is added, so that the PreTrain deg.C+ RoBERTa ⁺ model has better Chinese language model capability. And then inputting the generated pre-training language model and the dictionary of the extended domain entity into BiLSTM-CRF model for entity recognition training. The security event entity identification method based on the pre-training model can improve the informatization level of the public security event instance, can better perform knowledge representation, stores semantic information in text corpus, and constructs a knowledge graph of the public security event field. The map can be used for quick search of accident cases, accident association path analysis, statistical analysis and the like, so that the public event management level of China is improved, and the construction of a public safety emergency management system is enhanced.

Description

Security event entity identification method based on pre-training model

Technical Field

The invention belongs to the technical field of artificial intelligence natural language processing, and relates to a security event entity identification method based on a pre-training model.

Background

Along with the rapid development of economy in China, various safety incidents in cities are increasing, and the public safety incidents cause serious threat to life and property safety of parties and rescue workers, and also bring great influence to the life of economy and people in China. Thus, there is a need for reinforcement of public safety emergency management. However, knowledge and information in public safety event public opinion in the current stage cannot be effectively extracted and multiplexed, and sufficient assistance and early warning cannot be provided for public safety event management.

In recent years, the development of artificial intelligence is an important development direction of industry, wherein natural language processing is an important research direction of the field, and research results of the artificial intelligence are applied to industries such as medical treatment, law, finance and the like, so that the intelligent level of the field is greatly improved. However, there is a large amount of case text information in the public safety event field, and in the existing natural language processing research field, research on the public safety event is in a starting stage. The invention analyzes the text information characteristics of the public safety event instance by referring to the traditional Chinese natural language processing method in industry. The method takes the extraction of the information such as entities and relations in the public Chinese emergency corpus as the research key point to carry out intensive research, and aims to informatize the public safety event instance, better carry out knowledge representation, store semantic information in the specification and construct a public safety event field knowledge graph. The map can be used for quick search of accident cases, analysis of accident association paths, statistical analysis and the like, so that the public event management level of China is improved, and the construction of public safety emergency management systems is enhanced.

Disclosure of Invention

The invention aims to provide a security event entity identification method based on a pre-training model, which can solve the problem that information in the existing security event field cannot be extracted effectively.

According to the technical scheme adopted by the invention, in the safety event entity identification method based on the pre-training model, firstly, a whole word Mask mechanism of a public safety domain dictionary is added in the task secondary pre-training optimization process of an original RoBERTa model, so that the PreTrain K+ RoBERTa ⁺ model has better Chinese language model capacity; then inputting the generated pre-training language model and dictionary of the extended domain entity into BiLSTM-CRF model for entity recognition training; and finally, setting the deep learning model as a server test entity recognition model effect, inputting a test data set into the model to output entity types of the test data, and judging whether the effect is good or bad according to the evaluation index.

The method specifically comprises the following steps:

Step1, directly acquiring a CEC data set and an explanation file from github;

Step 2, the CEC data set has 332 public safety event examples, and the CEC data set is marked by adopting XML language as marking format, wherein the marking format comprises six most important data labels: event, denoter, time, location, participant and Object; event is used to describe events; denoter, time, location, participant and objects are used to describe the indicator words and elements of the event, and the annotated entities are extracted by using the python language according to the XML labels, so as to construct a security event entity dictionary.

And 3, carrying out named entity recognition research on five tags of Denoter, time, location, participant and Object in the data text of the step 2, and respectively and simply marking the tags as DEN, TIME, LOC, PAR, OBJ to respectively represent behaviors, time, positions, participants and objects.

Step 4, dividing the document by the number of security event instances for the CEC data set which is not marked originally in the step 1, and randomly selecting 30 specifications from 332 data sets to be used as a verification set; and then dividing 332 data sets according to the proportion of 7:3, selecting a training set and a testing set for experiments, wherein 232 examples of the training set and 100 examples of the testing set are obtained.

Step 5, performing BIO labeling on the training set and the verification set divided in the step 4, and constructing a security event data set for a named entity recognition task, wherein two columns of information are arranged in a data file, and single characters and corresponding labels are in a one-line format;

Step 6, constructing a domain pre-training data set; 100K untagged news field corpora are obtained from the Internet, data cleaning is carried out, redundant symbols and redundant information in the corpora are deleted, and format unified processing is carried out on pre-training data.

Step 7, constructing a Chinese pre-training language model, inputting the news field pre-training data set obtained in the step 5 into the PreTrain deg.K+ RoBERTa ⁺ pre-training model provided by the invention, and generating the Chinese news field pre-training language model;

step 8, constructing an entity recognition model, and taking the pre-training language model and the dynamic word vector generated in the step 7 as the input of the entity recognition model;

And 9, setting a trained entity identification model as a server test model effect, inputting a test data set into the model, outputting entity class labels of the test data, and finally realizing automatic identification of named entities in a public security event text.

In step 2, a security event entity dictionary is constructed, and the dictionary is fused into a pre-training model, so that the effect of a downstream named entity recognition model is improved.

According to the method, the field secondary pre-training is carried out on RoBERTa by using unlabeled news field data, a language model is trained on a large-scale unlabeled corpus in a self-supervision mode, and the obtained language model is connected with a downstream task model for fine adjustment.

The specific process of the step 7 is as follows:

And 7.1, adopting a whole word Mask mechanism, and if part of sub-words of a complete word are masked, other parts of the same word are masked, so that the method is more suitable for learning the Chinese grammar and can better learn the Chinese language expression mode.

And 7.2, introducing the CEC security event entity dictionary constructed in the step 2 into a word segmentation function of a RoBERTa model to enable the dictionary to keep complete semantics of the public security event text entity when the Mask mechanism predicts.

And 7.3, inputting the pre-training data of the news field of 100K and the entity dictionary of the security event into a model, setting the training iteration number to 100000 times, and obtaining a pre-training model PreTrain K+ RoBERTa ⁺ of the security event field.

The specific process of the step 8 is as follows:

Step 8.1, inputting the constructed CEC entity training set into a PreTrain100K+ RoBERTa ⁺ model subjected to field secondary training, reading the entity training set by a pre-training model according to rows, and outputting word vectors of single characters;

Step 8.2, converting each word in the entity training set into a one-dimensional vector by the PreTrain100K+RoBERTa ⁺ model to obtain a segment vector and a position vector of a sentence, taking the segment vector and the position vector as the input of the deep learning model, and finally outputting a text feature vector fused with the whole text semantic information;

And 8.3, inputting the text feature vector into the BiLSTM-CRF model to generate a PreTrain deg.K+ RoBERTa ⁺ -BiLSTM-CRF entity identification model.

In step 8.3, entity prediction labeling is carried out on the rail transit specification corpus, and the specific steps are as follows:

Step 8.3.1, taking an industrial area poisoning event as an example in a security event instance, and using a pre-training model with semantic capability in the news field to vectorize and represent a training set; training each word in an industrial area poisoning event to obtain 768-dimensional vectors, obtaining an initialization vector of each word, and using the result as input of a deep learning model;

Step 8.3.2, using BiLSTM-CRF algorithm in deep learning, bi-directional LSTM considers both past and future features, a forward input sequence, a reverse input sequence, predicts word semantics in context. For example, input of "industry" followed by BiLSTM predicts the probability that the next word is "industry" and then input of "industry" predicts the probability that the next word "region" will appear, which is a forward input. When the sequence is reversely input, predicting the probability that the poisoning event possibly occurs before the word, and then combining the output of the poisoning event and the word to be input into the next layer as a final result;

Step 8.3.3, taking the feature matrix obtained in the step 8.3.2 as the input of the CRF, and carrying out sequence labeling on the CRF by adding the feature function and the feature matrix obtained by BiLSTM to generate a real body identification model; the model is capable of identifying entities in the field of security events.

The beneficial effects of the invention are as follows:

The invention takes Chinese emergency corpus as an example to study the entity identification method in the public safety field. The method provided by the invention carries out secondary field pre-training on RoBERTa models, trains by using unlabeled public news field data sets, obtains a pre-training model related to a specific task from large-scale data through self-supervision learning, extracts semantic characterization of a word in a specific context, and enables the model to have the capability of identifying named entities. Firstly, a trained language model and an output dynamic word vector are used as the input of a downstream named entity recognition task to be finely tuned, and a network is corrected aiming at data in a specific field. The named entity recognition task obtains the context abstract features of the public security instance text by adopting a BiLSTM model, combines a conditional random field CRF to perform sequence decoding and annotate entity categories, and finally realizes automatic recognition of named entities in the public security instance text. The security event entity identification method based on the pre-training model can improve the informatization level of the public security event instance, can better perform knowledge representation, stores semantic information in text corpus, and constructs a knowledge graph in the public security event field. The map can be used for quick search of accident cases, accident association path analysis, statistical analysis and the like, so that the public event management level of China is improved, and the construction of a public safety emergency management system is enhanced.

Drawings

FIG. 1 is a general framework diagram of a security event entity identification method based on a pre-training model of the present invention;

FIG. 2 is a general flow chart of a security event entity identification method based on a pre-training model of the present invention;

FIG. 3 is a schematic diagram of a pre-training model structure in a security event entity recognition method RoBERTa based on a pre-training model according to the present invention;

FIG. 4 is a schematic diagram of a structure of a model BiLSTM in the security event entity identification method based on a pre-training model according to the present invention;

FIG. 5 is a schematic diagram of a CRF model structure in the security event entity identification method based on a pre-training model of the present invention;

FIG. 6 is a schematic diagram of a pre-training language model structure of PreTrain deg.K+ RoBERTa ⁺ in the security event entity identification method based on pre-training model of the present invention;

FIG. 7 is a schematic flow chart of a PreTrain deg.K+ RoBERTa ⁺ -BiLSTM-CRF entity recognition model in a security event entity recognition method based on a pre-training model of the present invention;

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention aims to provide a security event entity identification method based on a pre-training model, and a specific framework is shown in fig. 1. An improved named entity recognition model PreTrain K+ RoBERTa is proposed ⁺

-BiLSTM-CRF. In the process of performing field secondary pre-training optimization on the original RoBERTa model, a full-word Mask mechanism of a public safety field dictionary is added, so that the PreTrain deg.C+ RoBERTa ⁺ model has better Chinese language model capability. And then inputting the generated pre-training language model and the dictionary of the extended domain entity into BiLSTM-CRF model for entity recognition training. The named entity recognition task adopts BiLSTM model to obtain the context abstract feature of the public security instance text, and the model structure is shown in figure 4; sequence decoding labeling entity category is carried out by combining with a conditional random field CRF, and a model structure is shown in figure 5; and then, setting a deep learning model as a server test entity recognition model effect, inputting a test data set into the model to output entity types of test data, finally realizing automatic recognition of named entities in the public safety event text, and judging whether the effect is good or not according to an evaluation index.

The invention relates to a security event entity identification method based on a pre-training model, which specifically comprises the following steps of:

Step 1, the experimental corpus of the invention is derived from a Chinese emergency corpus (CEC dataset for short) constructed by Shanghai university semantic intelligent laboratory. Datasets and description files can be obtained directly from github.

Step 2, a CEC dataset has 332 public security event instances, and XML is adopted as a labeling format, wherein the XML comprises six most important data labels: event, denoter, time, location, participant and Object. Event is used to describe events; denoter, time, location, participant and Object are used to describe the indicators and elements of an event. And extracting marked entities by using python language according to the different XML labels to construct a security event entity dictionary.

Step 3, the invention carries out named entity recognition research on five tags of Denoter, time, location, participant and Object in the data text, the tags are respectively abbreviated as DEN, TIME, LOC, PAR, OBJ, the tags respectively represent behaviors, time, positions, participants and objects, and the detailed information is shown in table 1:

TABLE 1 Label to be predicted

And 4, dividing the document by the number of the safety event examples, randomly selecting 30 rule Fan Zuowei verification sets in 332 data sets, dividing and selecting a training set and a testing set according to the proportion of 7:3 for experiments on the whole data set, 232 examples of the training set and 100 examples of the testing set.

And 5, performing BIO labeling on the training set and the verification set divided in the step 4, and constructing a security event data set for a named entity recognition task, wherein two columns of information are arranged in a data file, and single characters and corresponding labels are in a row format.

And 6, constructing a domain pre-training data set. 100K untagged news field corpora are obtained from the Internet, data cleaning is carried out, redundant symbols and redundant information in the corpora are deleted, and standardized processing is carried out on pre-training data. The format is as follows:

{"text":""}

And 7, constructing a Chinese pre-training language model, wherein the model structure is shown in figure 3. Inputting the news field pre-training data set obtained in the step 5 into the PreTrain100K+ RoBERTa ⁺ pre-training model provided by the invention to generate a Chinese news field pre-training language model.

Step 7.1, the invention adopts a whole word Mask mechanism, if part of a complete word is Mask-used, other parts of the same word are Mask-used, which accords with Chinese grammar habit better, so that the model can learn Chinese language expression mode better, and the specific mode is shown in table 2.

TABLE 2 Whole word Mask

And 7.2, introducing the CEC security event entity dictionary constructed in the step 2 into a word segmentation function of a RoBERTa model, so that the complete semantics of the public security event text entity can be reserved when a Mask mechanism is predicted, and the model structure is shown in figure 6.

And 7.3, inputting the pre-training data of the news field of 100K and the entity dictionary of the security event into a model, setting the training iteration number to 100000 times, and obtaining a pre-training model PreTrain K+ RoBERTa ⁺ of the security event field. The pseudo code is as follows:

And 8, constructing an entity recognition model, and taking the pre-training language model and the dynamic word vector generated in the step 7 as the input of the entity recognition model.

And 8.3, inputting the text feature vector into the BiLSTM-CRF model to generate a PreTrain deg.K+ RoBERTa ⁺ -BiLSTM-CRF entity identification model, wherein the model structure is shown in figure 7. The pseudo code is as follows:

Entity prediction labeling is carried out on the rail transit standard corpus, and the specific steps are as follows:

step 8.3.1, taking an industrial area poisoning event as an example in a security event instance, and using a pre-training model with semantic capability in the news field to vectorize and represent a training set; each word in the industrial area poisoning event is trained to obtain 768-dimensional vectors, an initialization vector of each word is obtained, and then the result is used as input of a deep learning model.

Step 8.3.2, using BiLSTM-CRF algorithm in deep learning, bi-directional LSTM considers both past and future features, a forward input sequence, a reverse input sequence, predicts word semantics in context. For example, input of "industry" followed by BiLSTM predicts the probability that the next word is "industry" and then input of "industry" predicts the probability that the next word "region" will appear, which is a forward input. When the sequence is input in the reverse direction, the probability that the term "poisoning event" possibly occurs before is predicted, and the output of the two is combined and input to the next layer as a final result.

And 9, setting a trained entity identification model as a server test model effect, inputting a test data set into the model, outputting entity class labels of the test data, and finally realizing automatic identification of named entities in the public safety event text.

And 10, counting the number of correct and incorrect texts automatically recognized by the entity recognition model, and taking the Accuracy (Precision), the Precision (Accuracy), the Recall rate (Recall) and the F1 value as indexes for evaluating the named entity recognition model.

The evaluation standard for named entity recognition is mainly used for judging whether the boundary of an entity is correct or not and whether the type of the entity is marked correctly or not. In the prediction process, only when the type of the boundary in the type of the entity tag is completely correct with the predefined entity type, the entity is judged to be correctly predicted. The evaluation indexes of the named entity recognition are Accuracy (Precision), precision (Accuracy), recall (Recall) and F1 values. The specific formula is as follows:

Wherein TP (True Positive) represents the number of positive samples determined to be positive samples in practice; TN (True Negative) represents the number of negative samples determined to be negative samples; FP (False Positive) represents the number of positive samples, actually negative samples, determined; FN (False Negative) represents the number of positive samples, which are actually positive samples, determined as negative samples.

Claims

1. The security event entity identification method based on the pre-training model is characterized in that firstly, in the process of performing task secondary pre-training optimization on an original RoBERTa model, a full-word Mask mechanism of a public security domain dictionary is added, so that the PreTrain100K+ RoBERTa ⁺ model has better Chinese language model capacity; then inputting the generated pre-training language model and dictionary of the extended domain entity into BiLSTM-CRF model for entity recognition training; finally, setting a deep learning model as a server test entity recognition model effect, and inputting a test data set into the model to output entity types of the test data;

The method specifically comprises the following steps:

Step1, directly acquiring a CEC data set and an explanation file from github;

Step 2, the CEC data set has 332 public safety event examples, and the CEC data set is marked by adopting XML language as marking format, wherein the marking format comprises six most important data labels: event, denoter, time, location, participant and Object; event is used to describe events; denoter, time, location, participant and objects are used for describing the instruction words and elements of the event, and the marked entities are extracted by using the python language according to the different XML labels to construct a security event entity dictionary;

Step 3, carrying out named entity recognition research on five tags of Denoter, time, location, participant and objects in the data text of the step 2, and respectively and simply marking the tags as DEN, TIME, LOC, PAR, OBJ to respectively represent behaviors, time, positions, participants and objects;

Step 4, dividing the document by the number of security event instances for the CEC data set which is not marked originally in the step 1, and randomly selecting 30 rule Fan Zuowei verification sets from 332 data sets; then dividing 332 data sets according to the proportion of 7:3, selecting a training set and a testing set for experiments, wherein 232 examples of the training set and 100 examples of the testing set are obtained;

Step 6, constructing a domain pre-training data set; acquiring 100K unlabeled news field corpora from the Internet, cleaning data, deleting redundant symbols and redundant information in the corpora, and carrying out uniform processing on the pre-training data in format;

step 7, constructing a Chinese pre-training language model, inputting the news field pre-training data set obtained in the step 6 into the PreTrain deg.K+ RoBERTa ⁺ pre-training model, and generating a Chinese news field pre-training language model;

2. The security event entity recognition method based on the pre-training model according to claim 1, wherein in the step 2, a security event entity dictionary is constructed, and the dictionary is fused into the pre-training model to improve the effect of the downstream named entity recognition model.

3. The security event entity identification method based on a pre-training model according to claim 1, wherein step 6 uses unlabeled news field data pair RoBERTa to perform field secondary pre-training, trains a language model on a large scale unlabeled corpus in a self-supervision manner, and connects the obtained language model with a downstream task model to perform fine tuning.

4. The security event entity identification method based on the pre-training model according to claim 1, wherein the specific process of the step 7 is as follows:

Step 7.1, adopting a whole word Mask mechanism, if part of sub-words of a complete word are Mask, other parts of the same word are Mask, which accords with Chinese grammar habit better, so that the model can learn Chinese language expression mode better;

Step 7.2, introducing the CEC security event entity dictionary constructed in the step 2 into a word segmentation function of a RoBERTa model to enable the dictionary to reserve complete semantics of a public security event text entity when a Mask mechanism predicts the dictionary;

5. The security event entity identification method based on the pre-training model as claimed in claim 1, wherein the specific process of the step 8 is as follows:

6. The security event entity identification method based on the pre-training model according to claim 5, wherein in the step 8.3, entity prediction labeling is performed on the rail traffic specification corpus, and the specific steps are as follows:

Step 8.3.1, the example of the security event is an industrial area poisoning event, and a training set is vectorized and represented by using a pre-training model with semantic capability in the news field; training each word in an industrial area poisoning event to obtain 768-dimensional vectors, obtaining an initialization vector of each word, and using the result as input of a deep learning model;

Step 8.3.2, using BiLSTM-CRF algorithm in deep learning, two-way LSTM considers both past and future characteristics, a forward input sequence, a reverse input sequence, predicts the meaning of word in context, biLSTM will predict the probability that the next word is "industry" after inputting "industry", then the probability that "industry" predicts the occurrence of next word "region", which is forward input, when the sequence is input in reverse, predicts the probability that "poisoning event" may occur "before the word, and then combines the outputs of both as final result to input to the next layer;

Step 8.3.3, taking the feature matrix obtained in the step 8.3.2 as the input of the CRF, and carrying out sequence labeling on the CRF by adding the feature function and the feature matrix obtained by BiLSTM to generate an entity identification model; the entity identification model is capable of identifying entities in the field of security events.