CN113312914B - Security event entity identification method based on pre-training model - Google Patents

Security event entity identification method based on pre-training model Download PDF

Info

Publication number
CN113312914B
CN113312914B CN202110482621.8A CN202110482621A CN113312914B CN 113312914 B CN113312914 B CN 113312914B CN 202110482621 A CN202110482621 A CN 202110482621A CN 113312914 B CN113312914 B CN 113312914B
Authority
CN
China
Prior art keywords
model
training
entity
security event
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110482621.8A
Other languages
Chinese (zh)
Other versions
CN113312914A (en
Inventor
黑新宏
董林靖
朱磊
姬文江
刘雁孝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202110482621.8A priority Critical patent/CN113312914B/en
Publication of CN113312914A publication Critical patent/CN113312914A/en
Application granted granted Critical
Publication of CN113312914B publication Critical patent/CN113312914B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)

Abstract

The invention mainly carries out Chinese named entity recognition research on the public safety field and provides an improved named entity recognition model PreTrain deg.C+ RoBERTa + -BiLSTM-CRF. In the process of performing task secondary pre-training optimization on an original RoBERTa model, a full-word Mask mechanism of a public safety domain dictionary is added, so that the PreTrain deg.C+ RoBERTa + model has better Chinese language model capability. And then inputting the generated pre-training language model and the dictionary of the extended domain entity into BiLSTM-CRF model for entity recognition training. The security event entity identification method based on the pre-training model can improve the informatization level of the public security event instance, can better perform knowledge representation, stores semantic information in text corpus, and constructs a knowledge graph of the public security event field. The map can be used for quick search of accident cases, accident association path analysis, statistical analysis and the like, so that the public event management level of China is improved, and the construction of a public safety emergency management system is enhanced.

Description

Security event entity identification method based on pre-training model
Technical Field
The invention belongs to the technical field of artificial intelligence natural language processing, and relates to a security event entity identification method based on a pre-training model.
Background
Along with the rapid development of economy in China, various safety incidents in cities are increasing, and the public safety incidents cause serious threat to life and property safety of parties and rescue workers, and also bring great influence to the life of economy and people in China. Thus, there is a need for reinforcement of public safety emergency management. However, knowledge and information in public safety event public opinion in the current stage cannot be effectively extracted and multiplexed, and sufficient assistance and early warning cannot be provided for public safety event management.
In recent years, the development of artificial intelligence is an important development direction of industry, wherein natural language processing is an important research direction of the field, and research results of the artificial intelligence are applied to industries such as medical treatment, law, finance and the like, so that the intelligent level of the field is greatly improved. However, there is a large amount of case text information in the public safety event field, and in the existing natural language processing research field, research on the public safety event is in a starting stage. The invention analyzes the text information characteristics of the public safety event instance by referring to the traditional Chinese natural language processing method in industry. The method takes the extraction of the information such as entities and relations in the public Chinese emergency corpus as the research key point to carry out intensive research, and aims to informatize the public safety event instance, better carry out knowledge representation, store semantic information in the specification and construct a public safety event field knowledge graph. The map can be used for quick search of accident cases, analysis of accident association paths, statistical analysis and the like, so that the public event management level of China is improved, and the construction of public safety emergency management systems is enhanced.
Disclosure of Invention
The invention aims to provide a security event entity identification method based on a pre-training model, which can solve the problem that information in the existing security event field cannot be extracted effectively.
According to the technical scheme adopted by the invention, in the safety event entity identification method based on the pre-training model, firstly, a whole word Mask mechanism of a public safety domain dictionary is added in the task secondary pre-training optimization process of an original RoBERTa model, so that the PreTrain K+ RoBERTa + model has better Chinese language model capacity; then inputting the generated pre-training language model and dictionary of the extended domain entity into BiLSTM-CRF model for entity recognition training; and finally, setting the deep learning model as a server test entity recognition model effect, inputting a test data set into the model to output entity types of the test data, and judging whether the effect is good or bad according to the evaluation index.
The method specifically comprises the following steps:
Step1, directly acquiring a CEC data set and an explanation file from github;
Step 2, the CEC data set has 332 public safety event examples, and the CEC data set is marked by adopting XML language as marking format, wherein the marking format comprises six most important data labels: event, denoter, time, location, participant and Object; event is used to describe events; denoter, time, location, participant and objects are used to describe the indicator words and elements of the event, and the annotated entities are extracted by using the python language according to the XML labels, so as to construct a security event entity dictionary.
And 3, carrying out named entity recognition research on five tags of Denoter, time, location, participant and Object in the data text of the step 2, and respectively and simply marking the tags as DEN, TIME, LOC, PAR, OBJ to respectively represent behaviors, time, positions, participants and objects.
Step 4, dividing the document by the number of security event instances for the CEC data set which is not marked originally in the step 1, and randomly selecting 30 specifications from 332 data sets to be used as a verification set; and then dividing 332 data sets according to the proportion of 7:3, selecting a training set and a testing set for experiments, wherein 232 examples of the training set and 100 examples of the testing set are obtained.
Step 5, performing BIO labeling on the training set and the verification set divided in the step 4, and constructing a security event data set for a named entity recognition task, wherein two columns of information are arranged in a data file, and single characters and corresponding labels are in a one-line format;
Step 6, constructing a domain pre-training data set; 100K untagged news field corpora are obtained from the Internet, data cleaning is carried out, redundant symbols and redundant information in the corpora are deleted, and format unified processing is carried out on pre-training data.
Step 7, constructing a Chinese pre-training language model, inputting the news field pre-training data set obtained in the step 5 into the PreTrain deg.K+ RoBERTa + pre-training model provided by the invention, and generating the Chinese news field pre-training language model;
step 8, constructing an entity recognition model, and taking the pre-training language model and the dynamic word vector generated in the step 7 as the input of the entity recognition model;
And 9, setting a trained entity identification model as a server test model effect, inputting a test data set into the model, outputting entity class labels of the test data, and finally realizing automatic identification of named entities in a public security event text.
In step 2, a security event entity dictionary is constructed, and the dictionary is fused into a pre-training model, so that the effect of a downstream named entity recognition model is improved.
According to the method, the field secondary pre-training is carried out on RoBERTa by using unlabeled news field data, a language model is trained on a large-scale unlabeled corpus in a self-supervision mode, and the obtained language model is connected with a downstream task model for fine adjustment.
The specific process of the step 7 is as follows:
And 7.1, adopting a whole word Mask mechanism, and if part of sub-words of a complete word are masked, other parts of the same word are masked, so that the method is more suitable for learning the Chinese grammar and can better learn the Chinese language expression mode.
And 7.2, introducing the CEC security event entity dictionary constructed in the step 2 into a word segmentation function of a RoBERTa model to enable the dictionary to keep complete semantics of the public security event text entity when the Mask mechanism predicts.
And 7.3, inputting the pre-training data of the news field of 100K and the entity dictionary of the security event into a model, setting the training iteration number to 100000 times, and obtaining a pre-training model PreTrain K+ RoBERTa + of the security event field.
The specific process of the step 8 is as follows:
Step 8.1, inputting the constructed CEC entity training set into a PreTrain100K+ RoBERTa + model subjected to field secondary training, reading the entity training set by a pre-training model according to rows, and outputting word vectors of single characters;
Step 8.2, converting each word in the entity training set into a one-dimensional vector by the PreTrain100K+RoBERTa + model to obtain a segment vector and a position vector of a sentence, taking the segment vector and the position vector as the input of the deep learning model, and finally outputting a text feature vector fused with the whole text semantic information;
And 8.3, inputting the text feature vector into the BiLSTM-CRF model to generate a PreTrain deg.K+ RoBERTa + -BiLSTM-CRF entity identification model.
In step 8.3, entity prediction labeling is carried out on the rail transit specification corpus, and the specific steps are as follows:
Step 8.3.1, taking an industrial area poisoning event as an example in a security event instance, and using a pre-training model with semantic capability in the news field to vectorize and represent a training set; training each word in an industrial area poisoning event to obtain 768-dimensional vectors, obtaining an initialization vector of each word, and using the result as input of a deep learning model;
Step 8.3.2, using BiLSTM-CRF algorithm in deep learning, bi-directional LSTM considers both past and future features, a forward input sequence, a reverse input sequence, predicts word semantics in context. For example, input of "industry" followed by BiLSTM predicts the probability that the next word is "industry" and then input of "industry" predicts the probability that the next word "region" will appear, which is a forward input. When the sequence is reversely input, predicting the probability that the poisoning event possibly occurs before the word, and then combining the output of the poisoning event and the word to be input into the next layer as a final result;
Step 8.3.3, taking the feature matrix obtained in the step 8.3.2 as the input of the CRF, and carrying out sequence labeling on the CRF by adding the feature function and the feature matrix obtained by BiLSTM to generate a real body identification model; the model is capable of identifying entities in the field of security events.
The beneficial effects of the invention are as follows:
The invention takes Chinese emergency corpus as an example to study the entity identification method in the public safety field. The method provided by the invention carries out secondary field pre-training on RoBERTa models, trains by using unlabeled public news field data sets, obtains a pre-training model related to a specific task from large-scale data through self-supervision learning, extracts semantic characterization of a word in a specific context, and enables the model to have the capability of identifying named entities. Firstly, a trained language model and an output dynamic word vector are used as the input of a downstream named entity recognition task to be finely tuned, and a network is corrected aiming at data in a specific field. The named entity recognition task obtains the context abstract features of the public security instance text by adopting a BiLSTM model, combines a conditional random field CRF to perform sequence decoding and annotate entity categories, and finally realizes automatic recognition of named entities in the public security instance text. The security event entity identification method based on the pre-training model can improve the informatization level of the public security event instance, can better perform knowledge representation, stores semantic information in text corpus, and constructs a knowledge graph in the public security event field. The map can be used for quick search of accident cases, accident association path analysis, statistical analysis and the like, so that the public event management level of China is improved, and the construction of a public safety emergency management system is enhanced.
Drawings
FIG. 1 is a general framework diagram of a security event entity identification method based on a pre-training model of the present invention;
FIG. 2 is a general flow chart of a security event entity identification method based on a pre-training model of the present invention;
FIG. 3 is a schematic diagram of a pre-training model structure in a security event entity recognition method RoBERTa based on a pre-training model according to the present invention;
FIG. 4 is a schematic diagram of a structure of a model BiLSTM in the security event entity identification method based on a pre-training model according to the present invention;
FIG. 5 is a schematic diagram of a CRF model structure in the security event entity identification method based on a pre-training model of the present invention;
FIG. 6 is a schematic diagram of a pre-training language model structure of PreTrain deg.K+ RoBERTa + in the security event entity identification method based on pre-training model of the present invention;
FIG. 7 is a schematic flow chart of a PreTrain deg.K+ RoBERTa + -BiLSTM-CRF entity recognition model in a security event entity recognition method based on a pre-training model of the present invention;
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
The invention aims to provide a security event entity identification method based on a pre-training model, and a specific framework is shown in fig. 1. An improved named entity recognition model PreTrain K+ RoBERTa is proposed +
-BiLSTM-CRF. In the process of performing field secondary pre-training optimization on the original RoBERTa model, a full-word Mask mechanism of a public safety field dictionary is added, so that the PreTrain deg.C+ RoBERTa + model has better Chinese language model capability. And then inputting the generated pre-training language model and the dictionary of the extended domain entity into BiLSTM-CRF model for entity recognition training. The named entity recognition task adopts BiLSTM model to obtain the context abstract feature of the public security instance text, and the model structure is shown in figure 4; sequence decoding labeling entity category is carried out by combining with a conditional random field CRF, and a model structure is shown in figure 5; and then, setting a deep learning model as a server test entity recognition model effect, inputting a test data set into the model to output entity types of test data, finally realizing automatic recognition of named entities in the public safety event text, and judging whether the effect is good or not according to an evaluation index.
The invention relates to a security event entity identification method based on a pre-training model, which specifically comprises the following steps of:
Step 1, the experimental corpus of the invention is derived from a Chinese emergency corpus (CEC dataset for short) constructed by Shanghai university semantic intelligent laboratory. Datasets and description files can be obtained directly from github.
Step 2, a CEC dataset has 332 public security event instances, and XML is adopted as a labeling format, wherein the XML comprises six most important data labels: event, denoter, time, location, participant and Object. Event is used to describe events; denoter, time, location, participant and Object are used to describe the indicators and elements of an event. And extracting marked entities by using python language according to the different XML labels to construct a security event entity dictionary.
Step 3, the invention carries out named entity recognition research on five tags of Denoter, time, location, participant and Object in the data text, the tags are respectively abbreviated as DEN, TIME, LOC, PAR, OBJ, the tags respectively represent behaviors, time, positions, participants and objects, and the detailed information is shown in table 1:
TABLE 1 Label to be predicted
And 4, dividing the document by the number of the safety event examples, randomly selecting 30 rule Fan Zuowei verification sets in 332 data sets, dividing and selecting a training set and a testing set according to the proportion of 7:3 for experiments on the whole data set, 232 examples of the training set and 100 examples of the testing set.
And 5, performing BIO labeling on the training set and the verification set divided in the step 4, and constructing a security event data set for a named entity recognition task, wherein two columns of information are arranged in a data file, and single characters and corresponding labels are in a row format.
And 6, constructing a domain pre-training data set. 100K untagged news field corpora are obtained from the Internet, data cleaning is carried out, redundant symbols and redundant information in the corpora are deleted, and standardized processing is carried out on pre-training data. The format is as follows:
{"text":""}
{"text":""}
And 7, constructing a Chinese pre-training language model, wherein the model structure is shown in figure 3. Inputting the news field pre-training data set obtained in the step 5 into the PreTrain100K+ RoBERTa + pre-training model provided by the invention to generate a Chinese news field pre-training language model.
Step 7.1, the invention adopts a whole word Mask mechanism, if part of a complete word is Mask-used, other parts of the same word are Mask-used, which accords with Chinese grammar habit better, so that the model can learn Chinese language expression mode better, and the specific mode is shown in table 2.
TABLE 2 Whole word Mask
And 7.2, introducing the CEC security event entity dictionary constructed in the step 2 into a word segmentation function of a RoBERTa model, so that the complete semantics of the public security event text entity can be reserved when a Mask mechanism is predicted, and the model structure is shown in figure 6.
And 7.3, inputting the pre-training data of the news field of 100K and the entity dictionary of the security event into a model, setting the training iteration number to 100000 times, and obtaining a pre-training model PreTrain K+ RoBERTa + of the security event field. The pseudo code is as follows:
And 8, constructing an entity recognition model, and taking the pre-training language model and the dynamic word vector generated in the step 7 as the input of the entity recognition model.
Step 8.1, inputting the constructed CEC entity training set into a PreTrain100K+ RoBERTa + model subjected to field secondary training, reading the entity training set by a pre-training model according to rows, and outputting word vectors of single characters;
Step 8.2, converting each word in the entity training set into a one-dimensional vector by the PreTrain100K+RoBERTa + model to obtain a segment vector and a position vector of a sentence, taking the segment vector and the position vector as the input of the deep learning model, and finally outputting a text feature vector fused with the whole text semantic information;
And 8.3, inputting the text feature vector into the BiLSTM-CRF model to generate a PreTrain deg.K+ RoBERTa + -BiLSTM-CRF entity identification model, wherein the model structure is shown in figure 7. The pseudo code is as follows:
Entity prediction labeling is carried out on the rail transit standard corpus, and the specific steps are as follows:
step 8.3.1, taking an industrial area poisoning event as an example in a security event instance, and using a pre-training model with semantic capability in the news field to vectorize and represent a training set; each word in the industrial area poisoning event is trained to obtain 768-dimensional vectors, an initialization vector of each word is obtained, and then the result is used as input of a deep learning model.
Step 8.3.2, using BiLSTM-CRF algorithm in deep learning, bi-directional LSTM considers both past and future features, a forward input sequence, a reverse input sequence, predicts word semantics in context. For example, input of "industry" followed by BiLSTM predicts the probability that the next word is "industry" and then input of "industry" predicts the probability that the next word "region" will appear, which is a forward input. When the sequence is input in the reverse direction, the probability that the term "poisoning event" possibly occurs before is predicted, and the output of the two is combined and input to the next layer as a final result.
Step 8.3.3, taking the feature matrix obtained in the step 8.3.2 as the input of the CRF, and carrying out sequence labeling on the CRF by adding the feature function and the feature matrix obtained by BiLSTM to generate a real body identification model; the model is capable of identifying entities in the field of security events.
And 9, setting a trained entity identification model as a server test model effect, inputting a test data set into the model, outputting entity class labels of the test data, and finally realizing automatic identification of named entities in the public safety event text.
And 10, counting the number of correct and incorrect texts automatically recognized by the entity recognition model, and taking the Accuracy (Precision), the Precision (Accuracy), the Recall rate (Recall) and the F1 value as indexes for evaluating the named entity recognition model.
The evaluation standard for named entity recognition is mainly used for judging whether the boundary of an entity is correct or not and whether the type of the entity is marked correctly or not. In the prediction process, only when the type of the boundary in the type of the entity tag is completely correct with the predefined entity type, the entity is judged to be correctly predicted. The evaluation indexes of the named entity recognition are Accuracy (Precision), precision (Accuracy), recall (Recall) and F1 values. The specific formula is as follows:
Wherein TP (True Positive) represents the number of positive samples determined to be positive samples in practice; TN (True Negative) represents the number of negative samples determined to be negative samples; FP (False Positive) represents the number of positive samples, actually negative samples, determined; FN (False Negative) represents the number of positive samples, which are actually positive samples, determined as negative samples.

Claims (6)

1. The security event entity identification method based on the pre-training model is characterized in that firstly, in the process of performing task secondary pre-training optimization on an original RoBERTa model, a full-word Mask mechanism of a public security domain dictionary is added, so that the PreTrain100K+ RoBERTa + model has better Chinese language model capacity; then inputting the generated pre-training language model and dictionary of the extended domain entity into BiLSTM-CRF model for entity recognition training; finally, setting a deep learning model as a server test entity recognition model effect, and inputting a test data set into the model to output entity types of the test data;
The method specifically comprises the following steps:
Step1, directly acquiring a CEC data set and an explanation file from github;
Step 2, the CEC data set has 332 public safety event examples, and the CEC data set is marked by adopting XML language as marking format, wherein the marking format comprises six most important data labels: event, denoter, time, location, participant and Object; event is used to describe events; denoter, time, location, participant and objects are used for describing the instruction words and elements of the event, and the marked entities are extracted by using the python language according to the different XML labels to construct a security event entity dictionary;
Step 3, carrying out named entity recognition research on five tags of Denoter, time, location, participant and objects in the data text of the step 2, and respectively and simply marking the tags as DEN, TIME, LOC, PAR, OBJ to respectively represent behaviors, time, positions, participants and objects;
Step 4, dividing the document by the number of security event instances for the CEC data set which is not marked originally in the step 1, and randomly selecting 30 rule Fan Zuowei verification sets from 332 data sets; then dividing 332 data sets according to the proportion of 7:3, selecting a training set and a testing set for experiments, wherein 232 examples of the training set and 100 examples of the testing set are obtained;
Step 5, performing BIO labeling on the training set and the verification set divided in the step 4, and constructing a security event data set for a named entity recognition task, wherein two columns of information are arranged in a data file, and single characters and corresponding labels are in a one-line format;
Step 6, constructing a domain pre-training data set; acquiring 100K unlabeled news field corpora from the Internet, cleaning data, deleting redundant symbols and redundant information in the corpora, and carrying out uniform processing on the pre-training data in format;
step 7, constructing a Chinese pre-training language model, inputting the news field pre-training data set obtained in the step 6 into the PreTrain deg.K+ RoBERTa + pre-training model, and generating a Chinese news field pre-training language model;
Step 8, constructing an entity recognition model, and taking the pre-training language model and the dynamic word vector generated in the step 7 as the input of the entity recognition model;
And 9, setting a trained entity identification model as a server test model effect, inputting a test data set into the model, outputting entity class labels of the test data, and finally realizing automatic identification of named entities in the public safety event text.
2. The security event entity recognition method based on the pre-training model according to claim 1, wherein in the step 2, a security event entity dictionary is constructed, and the dictionary is fused into the pre-training model to improve the effect of the downstream named entity recognition model.
3. The security event entity identification method based on a pre-training model according to claim 1, wherein step 6 uses unlabeled news field data pair RoBERTa to perform field secondary pre-training, trains a language model on a large scale unlabeled corpus in a self-supervision manner, and connects the obtained language model with a downstream task model to perform fine tuning.
4. The security event entity identification method based on the pre-training model according to claim 1, wherein the specific process of the step 7 is as follows:
Step 7.1, adopting a whole word Mask mechanism, if part of sub-words of a complete word are Mask, other parts of the same word are Mask, which accords with Chinese grammar habit better, so that the model can learn Chinese language expression mode better;
Step 7.2, introducing the CEC security event entity dictionary constructed in the step 2 into a word segmentation function of a RoBERTa model to enable the dictionary to reserve complete semantics of a public security event text entity when a Mask mechanism predicts the dictionary;
And 7.3, inputting the pre-training data of the news field of 100K and the entity dictionary of the security event into a model, setting the training iteration number to 100000 times, and obtaining a pre-training model PreTrain K+ RoBERTa + of the security event field.
5. The security event entity identification method based on the pre-training model as claimed in claim 1, wherein the specific process of the step 8 is as follows:
Step 8.1, inputting the constructed CEC entity training set into a PreTrain100K+ RoBERTa + model subjected to field secondary training, reading the entity training set by a pre-training model according to rows, and outputting word vectors of single characters;
Step 8.2, converting each word in the entity training set into a one-dimensional vector by the PreTrain100K+RoBERTa + model to obtain a segment vector and a position vector of a sentence, taking the segment vector and the position vector as the input of the deep learning model, and finally outputting a text feature vector fused with the whole text semantic information;
and 8.3, inputting the text feature vector into the BiLSTM-CRF model to generate a PreTrain deg.K+ RoBERTa + -BiLSTM-CRF entity identification model.
6. The security event entity identification method based on the pre-training model according to claim 5, wherein in the step 8.3, entity prediction labeling is performed on the rail traffic specification corpus, and the specific steps are as follows:
Step 8.3.1, the example of the security event is an industrial area poisoning event, and a training set is vectorized and represented by using a pre-training model with semantic capability in the news field; training each word in an industrial area poisoning event to obtain 768-dimensional vectors, obtaining an initialization vector of each word, and using the result as input of a deep learning model;
Step 8.3.2, using BiLSTM-CRF algorithm in deep learning, two-way LSTM considers both past and future characteristics, a forward input sequence, a reverse input sequence, predicts the meaning of word in context, biLSTM will predict the probability that the next word is "industry" after inputting "industry", then the probability that "industry" predicts the occurrence of next word "region", which is forward input, when the sequence is input in reverse, predicts the probability that "poisoning event" may occur "before the word, and then combines the outputs of both as final result to input to the next layer;
Step 8.3.3, taking the feature matrix obtained in the step 8.3.2 as the input of the CRF, and carrying out sequence labeling on the CRF by adding the feature function and the feature matrix obtained by BiLSTM to generate an entity identification model; the entity identification model is capable of identifying entities in the field of security events.
CN202110482621.8A 2021-04-30 2021-04-30 Security event entity identification method based on pre-training model Active CN113312914B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110482621.8A CN113312914B (en) 2021-04-30 2021-04-30 Security event entity identification method based on pre-training model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110482621.8A CN113312914B (en) 2021-04-30 2021-04-30 Security event entity identification method based on pre-training model

Publications (2)

Publication Number Publication Date
CN113312914A CN113312914A (en) 2021-08-27
CN113312914B true CN113312914B (en) 2024-06-14

Family

ID=77371586

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110482621.8A Active CN113312914B (en) 2021-04-30 2021-04-30 Security event entity identification method based on pre-training model

Country Status (1)

Country Link
CN (1) CN113312914B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961705A (en) * 2021-10-29 2022-01-21 聚好看科技股份有限公司 Text classification method and server
CN114065760B (en) * 2022-01-14 2022-06-10 中南大学 Legal text class case retrieval method and system based on pre-training language model
CN115482665B (en) * 2022-09-13 2023-09-15 重庆邮电大学 Knowledge and data collaborative driving multi-granularity traffic accident prediction method and device
CN116756328B (en) * 2023-08-23 2023-11-07 北京宝隆泓瑞科技有限公司 Gas pipeline accident text recognition method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626056A (en) * 2020-04-11 2020-09-04 中国人民解放军战略支援部队信息工程大学 Chinese named entity identification method and device based on RoBERTA-BiGRU-LAN model
CN111967266A (en) * 2020-09-09 2020-11-20 中国人民解放军国防科技大学 Chinese named entity recognition model and construction method and application thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11218500B2 (en) * 2019-07-31 2022-01-04 Secureworks Corp. Methods and systems for automated parsing and identification of textual data
CN111460820B (en) * 2020-03-06 2022-06-17 中国科学院信息工程研究所 Network space security domain named entity recognition method and device based on pre-training model BERT

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626056A (en) * 2020-04-11 2020-09-04 中国人民解放军战略支援部队信息工程大学 Chinese named entity identification method and device based on RoBERTA-BiGRU-LAN model
CN111967266A (en) * 2020-09-09 2020-11-20 中国人民解放军国防科技大学 Chinese named entity recognition model and construction method and application thereof

Also Published As

Publication number Publication date
CN113312914A (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN113312914B (en) Security event entity identification method based on pre-training model
CN107992597B (en) Text structuring method for power grid fault case
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN113191148B (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN109800310B (en) Electric power operation and maintenance text analysis method based on structured expression
CN112115721B (en) Named entity recognition method and device
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
Carbonell et al. Joint recognition of handwritten text and named entities with a neural end-to-end model
CN113779358B (en) Event detection method and system
CN116151256A (en) Small sample named entity recognition method based on multitasking and prompt learning
CN113204967B (en) Resume named entity identification method and system
CN114417839A (en) Entity relation joint extraction method based on global pointer network
CN114298035A (en) Text recognition desensitization method and system thereof
Li et al. A method for resume information extraction using bert-bilstm-crf
CN115080750B (en) Weak supervision text classification method, system and device based on fusion prompt sequence
CN113919366A (en) Semantic matching method and device for power transformer knowledge question answering
CN114492460B (en) Event causal relationship extraction method based on derivative prompt learning
CN115203406A (en) RoBERTA model-based long text information ground detection method
Wosiak Automated extraction of information from Polish resume documents in the IT recruitment process
Jin et al. Fintech key-phrase: a new Chinese financial high-tech dataset accelerating expression-level information retrieval
CN111178080A (en) Named entity identification method and system based on structured information
CN114564950A (en) Electric Chinese named entity recognition method combining word sequence
Wu et al. One improved model of named entity recognition by combining BERT and BiLSTM-CNN for domain of Chinese railway construction
CN115757775B (en) Text inclusion-based trigger word-free text event detection method and system
CN116187304A (en) Automatic text error correction algorithm and system based on improved BERT

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant