CN117332785B - Method for extracting entity and relation from network security threat information combination - Google Patents

Method for extracting entity and relation from network security threat information combination Download PDF

Info

Publication number
CN117332785B
CN117332785B CN202311302393.7A CN202311302393A CN117332785B CN 117332785 B CN117332785 B CN 117332785B CN 202311302393 A CN202311302393 A CN 202311302393A CN 117332785 B CN117332785 B CN 117332785B
Authority
CN
China
Prior art keywords
entity
vector
tag
label
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311302393.7A
Other languages
Chinese (zh)
Other versions
CN117332785A (en
Inventor
韩晓晖
吕海青
左文波
崔慧
刘广起
刘洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Qilu University of Technology
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology, Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Qilu University of Technology
Priority to CN202311302393.7A priority Critical patent/CN117332785B/en
Publication of CN117332785A publication Critical patent/CN117332785A/en
Application granted granted Critical
Publication of CN117332785B publication Critical patent/CN117332785B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A method for jointly extracting entities and relations from network security threat information relates to the technical field of network security, utilizes a model framework of multi-task joint learning to jointly extract the entity relations, can effectively reduce the error propagation problem of a non-joint extraction mode, utilizes unified vector representation encoded by the same vector encoder to extract characteristics specific to different tasks, reduces noise influence of useless characteristics and improves the decoding speed of the entity relations.

Description

Method for extracting entity and relation from network security threat information combination
Technical Field
The invention relates to the technical field of network security, in particular to a method for jointly extracting entities and relations from network security threat information.
Background
With the rapid development of the internet, network security problems are increasingly prominent, and various security threats are continuously emerging. Cyber security threat intelligence becomes an important basis for research and analysis of cyber security specialists, however, due to the numerous sources of information, the variety of information forms, and the general unstructured state of information, traditional manual analysis methods are inefficient and prone to errors. Therefore, the invention aims to provide an automatic and efficient network security threat information analysis method and system.
With the wide popularization of the internet and the rapid development of information technology, network security problems are becoming more serious. Network security threats, including cyber attacks, malware, data leakage, etc., continue to emerge, bringing significant losses and risks to individuals, businesses, and government agencies. Network security professionals and security teams need to collect, analyze, and understand network security threat intelligence effectively at times in order to protect the security of network systems and users.
At present, the method mainly used for extracting threat information entities and relations is as follows: a pipeline-based approach and a joint extraction approach. First, the pipeline-based approach refers to the fact that entity recognition and relationship extraction are treated as two independent subtasks and are accomplished sequentially through two independent models. The pipeline-based method is simple and easy to implement, but since the two steps of entity identification and relation extraction are independently performed, an erroneous entity identification error can lead to an error in the relation extraction result, namely error propagation, because the information interaction between the two tasks is limited. The method based on joint extraction refers to the task of regarding entity identification and relation extraction as a joint and simultaneously considering the association relation between the entity and the relation. The method aims to obtain more accurate results by simultaneously optimizing the identification of entities and relationships. The interaction between the entity and the relation is considered based on the joint extraction method, so that the context information in the text can be better captured, and the extraction accuracy is improved. However, the execution flow of most entity-relationship joint extraction models is generally: the method comprises the steps of firstly identifying the entities and then classifying the relationships based on the extracted entity pairs, wherein the time complexity of the method is in an exponential form along with the increase of the number of the entities. Due to the complexity of the model and the difficulty of training, joint extraction-based methods typically require more computational resources and data.
To overcome these problems, artificial intelligence techniques such as Natural Language Processing (NLP) and machine learning are widely used in the field of analysis of cyber security threat intelligence. Automated entity identification and relationship extraction techniques make the analysis process more efficient, thereby discovering potential threats in time. The application of the deep learning model enables the system to better understand and analyze unstructured text data, and improves the analysis precision of threat information. Moreover, the application of the visualization technology can intuitively display the association between the entities, and help security specialists to better understand the essence and trend of the network security threat.
However, there is still a lack of a method and system for efficient, rapid, comprehensive, and intelligent network security threat intelligence combined extraction of entities and relationships.
Disclosure of Invention
In order to overcome the defects of the technology, the invention provides a method for constructing an efficient and accurate network security threat information entity and relationship extraction by combining a plurality of artificial intelligence technologies.
The technical scheme adopted for overcoming the technical problems is as follows:
a method for jointly extracting entities and relationships from cyber security threat intelligence, comprising the steps of:
(a) Collecting n network security threat information articles to obtain a network security threat information set D, D= { D 1 ,d 2 ,…,d i ,...,d n },d i I e { 1..n } for the i-th cyber security threat intelligence article; (b) The ith network security threat information article d i Performing data preprocessing operationsObtaining a sentence set S i ,S i ={s 1 ,s 2 ,...,s j ,...,s l },s j For sentence set S i J e {1,., l }, l is the sentence set S i The number of sentences in a sentence;
(c) Will j th sentence s j Word segmentation is carried out to obtain a character sequence X j ,X j ={w 1 ,w 2 ,...,w t ,...,w N },w t For the t-th character, t e {1,., N }, N is the j-th sentence s j The number of characters;
(d) Obtaining the embedded representation e of the t-th character t Jth sentence s j Is e, e= { e 1 ,e 2 ,...,e t ,...,e N };
(e) Inputting the vector embedding sequence e into an attention module to obtain an entity recognition task vector matrix e ner Task vector matrix e for relationship classification re Shared vector matrix e for entity identification and relationship classification tasks share Vector embedding for entity recognition corresponding to the t-th character,/-> Embedding the relation classification vector corresponding to the t-th character, identifying the shared vector embedding of the relationship classification task for the entity corresponding to the t character;
(f) Task vector matrix e is identified according to entities ner Shared vector matrix e of body recognition and relation classification tasks share Obtaining a candidate head entity tag sequence E, E= { entity 1 ,entity 2 ,...,entity i ,...,entity m },entity i I e { 1..m } is the label of the i-th entity, m is the number of entities;
(g) Classifying task vector matrix e according to relationships re Shared vector matrix e of entity identification and relationship classification tasks share Obtaining all header entity vector set entries CHE Embedding the pooled i-th head entity vector, i e { 1..m };
(h) According to all header entity vector set entries CHE Calculating to obtain all head entity and relation correlation expression sequences O, O= { O 1 ,o 2 ,...,o j ,...,o m },o j Representing vectors for the relevance of the jth head entity to the relationship, j e {1, …, m };
(i) Obtaining the entity of the j-th entity according to all the head entities and the relation correlation expression sequence O j Is a triplet sequence of (c).
Further, n cyber security threat intelligence articles are collected from public vulnerability reports and/or social media and/or security news in step (a).
Further, the method of the data preprocessing operation in the step (b) is as follows:
(b-1) removing the ith cyber-security threat intelligence article d i Punctuation marks, mathematical operation symbols, brackets, quotation marks, @, #, $,%, ≡,&Completing noise removal operation;
(b-2) ith cyber-security threat intelligence article d after noise removal using the split () function of the python program i Conversion to sentence set S i
Further, step (c) uses the token function in the transducer kit to take the jth sentence s j And performing word segmentation.
Further, step (d) comprises the steps of:
(d-1) writing the t-th character w t Inputting into a Bert pre-training model, and outputting to obtain feature vectors R is real space, d w Is a dimension;
(d-2) converting the t-th character w t Input into a Glove pre-training model, and output to obtain a feature vector d g Is a dimension;
(d-3) using the space tool part-of-speech tagging tool to tag the t-th character w t Character embedding is carried out to obtain part-of-speech vectors d p Is a dimension;
(d-4) extracting feature vectorsFeature vector->Part of speech vector->Performing splicing operation to obtain an embedded representation e of the t-th character t ,/>Jth sentence s j The vector embedding sequence is e, < >>
Further, step (e) comprises the steps of:
(e-1) the attention module is composed of an entity recognition attention module, a shared task attention module, and a relationship classification attention module, and is initialized to obtain a parameter matrix for the entity recognition attention module by using a torch.randn () function of a torch toolkit d att Initializing to obtain a parameter matrix for sharing task attention modules by using a torch.randn () function of a torch toolkit for dimension > Initializing to get a parameter matrix for the relational classification attention module using the torch.randn () function of the torch toolkit>
(e-2) passing through the formulaCalculating to obtain Q vector of t-th entity recognition attention moduleWherein T is transposed by the formula +.>Calculating K vector of t-th entity recognition attention module>By the formula->Calculating the V vector of the t-th entity recognition attention module>The Q matrix of the entity recognition attention module is +.> The K matrix of the entity recognition attention module is +.>The V matrix of the entity recognition attention module is +.> By the formula->Calculating the attention score +.f of the t-th Q vector and the j-th K vector>All attention score matrices are alpha ner ,/>By the formulaCalculating to obtain entity identification vector embedding +.> Entity recognition task vector matrix is e ner
(e-3) passing through the formulaCalculating to obtain Q vector of t-th shared task attention module>Wherein T is transposed by the formula +.>Calculating K vector of t-th shared task attention module>By the formula->Calculating to obtain V vector of t-th shared task attention module>The Q matrix of the shared task attention module is +.> K matrix of shared task attention module isThe V matrix of the shared task attention module is +. >
By the formula->Calculating the attention score +.f of the t-th Q vector and the j-th K vector>All attention score matrices are alpha share ,/>By the formula->Calculating to obtain the shared vector embedding of the entity corresponding to the t character and the relation classification task>The shared vector matrix of the entity identification and relation classification task is e share ,/>
(e-4) passing through the formulaCalculating Q vector of t-th relation classification attention module>Wherein T is transposed by the formula +.>K vector of t-th relation classification attention module is calculated>By the formula->Calculating the V vector of the t-th relation classification attention module>The Q matrix of the relational classification attention module is +.> The K matrix of the relation classification attention module is +.> The V matrix of the relation classification attention module is +.>
By the formula->Calculating the attention score +.f of the t-th Q vector and the j-th K vector>All attention score matrices are alpha reBy the formulaShared vector embedding for calculating and obtaining relation task corresponding to t-th characterThe relation classification task vector matrix is e re
Further, step (f) includes the steps of:
(f-1) embedding the entity recognition vector corresponding to the t-th characterAnd the entity corresponding to the t character and the shared vector embedding of the relation classification task +. >Performing splicing operation to obtain vector->Vector matrix is e CHE
(f-2) matrix of vectors e CHE Inputting into a two-way long and short memory neural network BiLSTM to obtain a vector matrix O CHE For the t-th character w t Output of BiLSTM via two-way long and short memory neural network,>d h is a dimension;
(f-3) outputting the vector into the sequence O CHE Is input into a single-layer linear network through a formula P CHE =O CHE W CHE +b CHE Calculating to obtain a vector matrix P CHEL ner Number of physical tags, W CHE Parameter matrix for single-layer linear network, +.>b CHE Bias term for single-layer linear network, +.>L ner The entity labels are B-compton, I-compton, B-identity, I-identity, B-tool, I-tool, B-actor, I-actor, B-vulnerability, I-vulnerability, O, B-compton representing the label of the first character of the cyber attack activity type entity in the text, I-compton representing the label of the other character than the first character of the cyber attack activity type entity in the text, B-identity representing the label of the first character of the organization or person type entity in the text, I-identity representing the label of the other character than the first character of the organization or person type entity in the text, B-tool representing the label of the first character of the tool type in the text, I-tool representing the label of the other character than the first character of the tool type in the text, B-malware represents a tag of a first character of a malware type entity in the text, I-malware represents a tag of a first character of a malware type entity in the text other than the first character, B-actor represents a tag of a first character of a threat actor type entity in the text, I-actor represents a tag of a first character of a threat actor type entity in the text other than the first character, B-vulnerabilities represents a tag of a first character of a vulnerability type entity in the text, I-vulnerabilities represents a tag of a second character of a vulnerability type entity in the text other than the first character, O is a tag of a non-entity type character in the text, B-compainen, B-identity, B-tool, B-actor, B-vulnerabilities constitute a B tag set, I-Compainen, I-identity, I-tool, I-malware, I-actor, I-vulnerabilities constitute an I tag set;
(f-4) matrix of vectors P CHE Inputting into conditional random field CRF to obtain probability of different entity labels corresponding to each character For the t-th character w t Probability sequences corresponding to different entity type tags; (f-5) calculating the t-th character w using the torch.argmax () function in the pytorch toolkit t Probability sequences corresponding to tags of different entity types +.>Entity tag l of the highest probability of (2) t Obtaining the tag sequence of sentence->Tag sequence of traversing sentence->Tag sequence marked as head entity by using tag in B tag set as beginning and tag in I tag set as middle tag sequence to obtain head selection entity tag sequence E, E= { entity 1 ,entity 2 ,…,entity i ,…,entity m },entity i For the label of the i-th header entity, i e { 1..m }, m is the number of header entities.
Further, step (g) includes the steps of:
(g-1) embedding a shared vector of a relational task corresponding to t charactersShared vector embedding of entity and relationship classification task corresponding to the t-th character>Performing splicing operation to obtain vector->Vector matrix is e TRE
(g-2) labeling entity of the ith header entity i The entity position of the middle entity is filled with 1, the non-entity position is filled with 0, and the label entity of the ith head entity is obtained i Mask vector L of (2) i ,L i ∈R 1×N Jth sentence s j The set of mask vectors for the entities in (a) is L, l= { L 1 ,L 2 ,...,L i ,...,L m },L∈R m×N
(g-3) passing through the formulaCalculating to obtain the i head entity vector embedding +.>Embedding the i-th header entity vector +.>Inputting into the maximum pooling layer, outputting to obtain the pooled i-th head entity vector embedded +.>Further, step (h) comprises the steps of:
(h-1) initializing to obtain a parameter matrix for correlation calculation using a torch.randn () function of a torch toolkitParameter matrix->
(h-2) vector of the vectorAnd parameter matrix->Performing multiplication operation to obtain key vector +.>The key vector matrix is K TRE ,/> Embedding the pooled j-th header entity vector +.>And parameter matrix->Multiplication is performed to obtain a query vector +.>Query vector matrix Q CHE
(h-3) passing through the formulaCalculating to obtain the relevance score S of the query vector corresponding to each head entity and the key vector corresponding to the t character of each sentence jt Wherein V is a parameter matrix, < >>
(h-4) using a softmax function on the relevance score S jt Normalizing to obtain normalized correlation fraction alpha jt ,α jt The value range is [0,1 ]];
(h-5) passing through the formulaCalculating to obtain the query vector corresponding to the jth head entityThe contextual representation h of the vector corresponding to the t-th character of each sentence jt The set of contextual representations of all head entities and vectors corresponding to the t-th character of each sentence is h,/-, and +.>h={h 1t ,h 2t ,…,h jt ,...,h mt };
(h-6) passing through the formulaThe calculated context represents h jt Threshold g of (2) j ,g j ∈[0,1]Wherein sigma (·) is a sigmoid function, W 1 Is a parameter matrix->W 2 Is a parameter matrix-> B for splice operation 1 And b 2 Are bias terms;
(h-7) passing through the formula u j =g j ·tan(W 3 h jt +b 3 ) Calculating to obtain a filtered vector u j Wherein W is 3 As a matrix of parameters,b 3 is a bias term;
(h-8) vector of the vectorAnd the filtered vector u j Performing splicing operation to obtain a j-th head entity and relationship correlation expression vector matrix o j ,/>All header entities and relations are related in the representation sequence O,further, step (i) comprises the steps of:
(i-1) inputting all head entities and relation correlation expression sequences O into a two-way long and short memory neural network BiLSTM to obtain vector sequences O reRepresenting vector matrix o for jth head entity and relationship correlation j Output of BiLSTM via two-way long and short memory neural network,>
(i-2) representing the j-th header entity with a relationship correlation vector matrix o j Input into a single-layer linear network through the formulaCalculating to obtain vector matrix->W in the formula re Parameter matrix for single-layer linear network, +.>b re Bias terms that are single-layer linear networks;
(i-3) matrix vectorsInput to the conditionIn the random field CRF, the jth header entity is obtained j Probability of different entity tags in corresponding sentence +.> For the j-th header entity j The t character w of the corresponding sentence t Probability sequences corresponding to labels of different entity types, wherein the probability value set corresponding to all head entities is +.>
(i-4) calculating the j-th header entity using the torch.argmax () function in the pytorch toolkit j Probability of corresponding to different entity tags in a sentenceEntity tag l of the highest probability of (2) j ,l j ∈R N Maximum probability entity tag l j The entity tags of (a) are respectively B-Compainen, I-Compainen, B-identity, I-identity, B-tool, I-tool, B-malware, I-malware, B-actor, I-actor and B-vulnerability, I-vulnerability, O, and all tail entity tag sequences are->
(i-5) the entity tag l with the highest probability j Defined as a tail entity tag sequence, using tags in the B tag set as a start, in the I tag setThe tag is used as a middle tag sequence to be marked as a tag of a tail entity, so that a tag sequence E ', E' = { entity of the tail entity is obtained 1 ′,entity 2 ′,...,entity i ′,…,entity n ′},entity i ' is the label of the ith head entity, i epsilon {1, …, n }, n is the number of tail entities;
(i-6) tag entry for the ith tail entity i Filling the entity position in' 1, filling the non-entity position in 0, and obtaining the label entity of the ith tail entity i ' mask vector L i ′,L i ′∈R 1×N Jth header entity j The mask vector set of the corresponding tail entity is L j ,L j ={L 1 ′,L 2 ′,…,L i ′,…,L n ′},L j ∈R n×N
(i-7) passing through the formulaCalculating to obtain the i-th tail entity vector embedding +.> All tail entity vector embedding sequences are entity tailEmbedding the i-th tail entity vector +.>Inputting into the maximum pooling layer, outputting to obtain the pooled i-th tail entity vector embedded +.>All vector sequences are entity tail ′,/>
(i-8) embedding the pooled i-th tail entity vectorEmbedding +.>Splicing to obtain vector->Vector +.>Is input into a single-layer linear network by the formula +.>Calculating to obtain vector matrix->W in the formula re ' is a parameter matrix of a single-layer linear network, +.>b re ' is a bias term of a single-layer linear network, +.>L re For the number of relationship tags, L re The relationship label is use, targets, other, the relationship of label B-software and label B-Compainen is defined as use, the relationship of label I-software and label I-Compainen is defined as use, the relationship of label B-actor and label B-Compainen is defined as use, the relationship of label I-actor and label I-Compainen is defined as use, the relationship of label B-actor and label B-tool is defined as use, the relationship of label I-actor and label I-tool is defined as use, the relationship of label B-actor and label B-software is defined as use, and the relationship of label I-actor and label I-mail is defined as use,; label B-cto r and tag B-vulnerabilities are defined as targets, tag I-actor and tag I-vulnerabilities are defined as targets, tag B-malware and tag B-vulnerabilities are defined as targets, tag I-malware and tag I-vulnerabilities are defined as targets, tag B-actor and tag B-identity are defined as targets, tag I-actor and tag I-identity are defined as targets, tag B-server and tag B-identity are defined as targets, tag B-malware and tag B-identity are defined as targets, tag I-malware and tag I-identity are defined as targets; the method comprises the steps of defining a relationship between a tag B-Compainen and a tag B-vulnerabilities as an other, defining a relationship between a tag I-Compainen and a tag I-vulnerabilities as an other, defining a relationship between a tag B-tool and a tag B-vulnerabilities as an other, defining a relationship between a tag I-tool and a tag I-vulnerabilities as an other, defining a relationship between a tag B-tool and a tag B-Compainen as an other, defining a relationship between a tag I-tool and a tag I-Compainen as an other, defining a relationship between a tag B-identity and a tag B-Compainen as an other, defining a relationship between a tag I-identity and a tag I-Compainen as an other, defining a relationship between a tag B-tool and a tag B-identity as an other, defining a relationship between a tag I-tool and a tag I-identity as an other, and a tag I-identity as a tag I-identity, and a tag I-vpileability as an other, and a tag I-identity as a tag I-identity;
(i-9) vector matrix pairs using a softmax functionNormalizing to obtain normalized vector ++> Obtaining the most probable entity tag l using the torch.argmax () function in the pytorch toolkit j Corresponding relation label rel ij Obtaining triples<entity j ,rel ij ,entity i ′>Is extracted to obtain the firstj header entity entries j Corresponding all triplet sequences
{<entity j ,rel 1j ,entity 1 ′>,<entity j ,rel 2j ,entity 2 ′>,
...,<entity j ,rel ij ,entity i ′>,...,<entity j ,rel nj ,entity n ′>}。
The beneficial effects of the invention are as follows: the model architecture of multi-task joint learning is utilized to perform joint extraction of entity relations, so that the error propagation problem of a non-joint extraction mode can be effectively reduced, unified vector representations coded by the same vector coder are utilized to extract characteristics specific to different tasks, the noise influence of useless characteristics is reduced, and the decoding speed of the entity relations is improved.
Drawings
FIG. 1 is a diagram of a model structure of a network security threat intelligence joint extraction entity and relationship according to the present invention.
Detailed Description
The invention is further described with reference to fig. 1.
A method for jointly extracting entities and relationships from cyber security threat intelligence, comprising the steps of:
(a) Collecting n network security threat information articles to obtain a network security threat information set D, D= { D 1 ,d 2 ,…,d i ,...,d n },d i I e { 1..n } for the i-th cyber security threat intelligence article. (b) The ith network security threat information article d i Performing data preprocessing operation to obtain sentence set S i ,S i ={s 1 ,s 2 ,...,s j ,...,s l },s j For sentence set S i J e {1,., l }, l is the sentence set S i The number of sentences in (a).
(c) Will j th sentence s j Word segmentation is carried out to obtain a character sequence X j ,X j ={w 1 ,w 2 ,...,w t ,...,w N },w t Is t thA character, t e { 1., N, N is the jth sentence s j Number of characters.
(d) Obtaining the embedded representation e of the t-th character t Jth sentence s j Is e, e= { e 1 ,e 2 ,...,e t ,...,e N }。
(e) Inputting the vector embedded sequence e into an attention module, extracting shared vectors for entity recognition task vectors, relationship classification task vectors and entity recognition and relationship classification tasks, and obtaining an entity recognition task vector matrix e ner Task vector matrix e for relationship classification re Shared vector matrix e for entity identification and relationship classification tasks share Vector embedding for entity recognition corresponding to the t-th character,/-> Embedding +.> And identifying the shared vector embedding of the relationship classification task for the entity corresponding to the t character. />
(f) Task vector matrix e is identified according to entities ner Shared vector matrix e of body recognition and relation classification tasks share Obtaining a candidate head entity tag sequence E, E= { entity 1 ,entity 2 ,...,entity i ,...,entity m },entity i I e { 1..m } is the label of the i-th entity, m is the number of entities.
(g) Classifying task vector matrix e according to relationships re Shared vector matrix e of entity identification and relationship classification tasks share Obtaining all header entity vector set entries CHE Embedded for the pooled i-th header entity vector, i e { 1..m }.
(h) According to all header entity vector set entries CHE Calculating to obtain all head entity and relation correlation expression sequences O, O= { O 1 ,o 2 ,...,o j ,...,o m },o j The vector is represented for the relevance of the j-th head entity to the relationship, j e { 1..m }.
(i) Obtaining the entity of the j-th entity according to all the head entities and the relation correlation expression sequence O j Is a triplet sequence of (c).
The entity recognition and relation extraction can be performed in a multitasking parallel manner by using an end-to-end neural network model architecture based on multitasking learning. Firstly, embedding an input text into unified vector representation by using an NLP pre-training model, and extracting feature vectors specific to entity recognition and relationship by using an attention module; inputting the feature vector of the entity identification task into a head entity module for identifying the head entity; and inputting the characteristic vector of the relation extraction task and the identified representing vector of the head entity into a tail entity and relation decoding module to decode the tail entity and the relation. The method provided by the invention fully utilizes the same vector encoder to extract the characteristics specific to different tasks, reduces the noise of useless characteristics and improves the utilization efficiency of the characteristics; and secondly, the decoding speed of the entity relationship is improved by utilizing an end-to-end model architecture of multitask learning, the information of the entity can be efficiently and accurately extracted from diversified network security threat information data, and the association relationship between the entities is established.
Entity identification experimental result of collected network security threat information data set D in different models
Model Accuracy of Accuracy of Recall rate of recall F1-fraction
SpERT 79.2% 70.3% 74.6% 76.8%
Multi-turn QA 85.0% 81.3% 82.6% 83.7%
MTL 89.3% 86.5% 89.6% 89.4%
DYGIE 78.2% 76.5% 79.6% 78.9%
PERA 88.1% 86.1% 89.2% 86.6%
ours 90.5% 91.2% 89.8% 90.1%
According to the experimental results in the table one, the method for jointly extracting the entity and the relation from the network security threat information provided by the invention has the advantages that the entity identification accuracy reaches 91.2%, the accuracy reaches 90.5%, the F1-fraction reaches 90.1%, and the recall rate reaches 89.8%. Compared with other traditional experimental methods, the method has the advantages of greatly improving the precision and having good entity identification effect.
Relationship classification experimental result of network security threat information data set D collected in table two in different models
Model Accuracy of Accuracy of Recall rate of recall F1-fraction
SpERT 74.7% 73.6% 71.5% 72.8%
Multi-turn QA 69.20% 67.4% 68.2% 68.9%
MTL 77.73% 67.1% 68.4% 72.63%
DYGIE 72.2% 69.5% 71.6% 71.2%
PERA 76.1% 74.1% 75.0% 75.5%
ours 77.6% 75.4% 78.3% 77.9%
According to the experimental result of the second table, the method for extracting the entity and the relation from the network security threat information in a combined way has the relation classification and identification accuracy reaching 75.4%, the precision reaching 77.6%, the F1-fraction reaching 77.9% and the recall rate reaching 78.3%. Compared with other traditional experimental methods, the method has the advantages of greatly improving the precision and having good relation classification effect.
In one embodiment of the invention, n cyber security threat intelligence articles are collected from public vulnerability reports and/or social media and/or security news in step (a).
In one embodiment of the present invention, the method of the data preprocessing operation in step (b) is:
(b-1) removing the ith cyber-security threat intelligence article d i Punctuation marks, mathematical operation symbols, brackets, quotation marks, @, #, $,%, ≡,&And (5) completing noise removal operation.
(b-2) using the split () function of the python program to divide periods into partitions, removing noise from the ith cyber security threat information article d i Conversion to sentence set S i
In one embodiment of the present invention, step (c) uses the token function in the transducer kit to take the jth sentence s j And performing word segmentation.
In one embodiment of the invention, step (d) comprises the steps of:
(d-1) writing the t-th character w t Inputting into a Bert pre-training model, and outputting to obtain feature vectors R is real space, d w Is a dimension.
(d-2) converting the t-th character w t Input into a Glove pre-training model, and output to obtain a feature vector d g Is a dimension.
(d-3) using the space tool part-of-speech tagging tool to tag the t-th character w t Character embedding is carried out to obtain part-of-speech vectorsd p Is a dimension.
(d-4) extracting feature vectorsFeature vector->Part of speech vector->Performing splicing operation to obtain an embedded representation e of the t-th character t ,/>Jth sentence s j The vector embedding sequence is e, < >>
In one embodiment of the invention, step (e) comprises the steps of:
(e-1) the attention module is constituted by an entity recognition attention module, a shared task attention module, and a relationship classification attention moduleInstead, initializing with the torch.randn () function of the torch toolkit yields a parameter matrix for entity recognition attention module d att Initializing to obtain a parameter matrix for sharing task attention modules by using a torch.randn () function of a torch toolkit for dimension> Initializing to get a parameter matrix for the relational classification attention module using the torch.randn () function of the torch toolkit>
(e-2) passing through the formulaCalculating to obtain Q vector of t-th entity recognition attention moduleWherein T is transposed by the formula +.>Calculating to obtain the K vector of the t-th entity recognition attention moduleBy the formula->Calculating the V vector of the t-th entity recognition attention module>The Q matrix of the entity recognition attention module is +.> The K matrix of the entity recognition attention module is +.>The V matrix of the entity recognition attention module is +.> By the formula->Calculating the attention score +.f of the t-th Q vector and the j-th K vector>All attention score matrices are alpha ner ,/>By the formulaCalculating to obtain entity identification vector embedding +.> Entity recognition task vector matrix is e ner
(e-3) passing through the formulaCalculating to obtain Q vector of t-th shared task attention module>Wherein T is transposed by the formula +.>Calculating K vector of t-th shared task attention module>By the formula->Calculating to obtain V vector of t-th shared task attention module>The Q matrix of the shared task attention module is +.> K matrix of shared task attention module isThe V matrix of the shared task attention module is +.>
By the formula->Calculating the attention score +.f of the t-th Q vector and the j-th K vector>All attention score matrices are alpha share ,/>By the formula->Calculating to obtain the shared vector embedding of the entity corresponding to the t character and the relation classification task>The shared vector matrix of the entity identification and relation classification task is e share ,/>
(e-4) passing through the formulaCalculating Q vector of t-th relation classification attention module>Wherein T is transposed by the formula +.>K vector of t-th relation classification attention module is calculated>By the formula->Calculating the V vector of the t-th relation classification attention module>The Q matrix of the relational classification attention module is +.> The K matrix of the relation classification attention module is +. > The V matrix of the relation classification attention module is +.> />
By passing throughFormula->Calculating the attention score +.f of the t-th Q vector and the j-th K vector>All attention score matrices are alpha reBy the formula->Shared vector embedding of relation task corresponding to t-th character>The relation classification task vector matrix is e re ,/>
Through the calculation, the entity recognition task vector matrix e is obtained ner Task vector matrix e for relationship classification re Shared vector matrix e for entity identification and relationship classification tasks share The identification task vector, the relationship classification task vector and the sharing task vector in the corresponding respectively.
In one embodiment of the invention, step (f) comprises the steps of:
(f-1) embedding the entity recognition vector corresponding to the t-th characterAnd the entity corresponding to the t character and the shared vector embedding of the relation classification task +.>Performing splicing operation to obtain vector->Vector matrix is e CHE
(f-2) matrix of vectors e CHE Inputting into a two-way long and short memory neural network BiLSTM to obtain a vector matrix O CHE For the t-th character w t Output of BiLSTM via two-way long and short memory neural network,>d h is a dimension.
(f-3) outputting the vector into the sequence O CHE Is input into a single-layer linear network through a formula P CHE =O CHE W CHE +b CHE Calculating to obtain a vector matrix P CHEL ner Number of physical tags, W CHE Parameter matrix for single-layer linear network, +.>b CHE Bias term for single-layer linear network, +.>L ner The physical labels are B-Compainen, I-Compainen, B-identity, I-identity, B-tool, I-tool, B-mail, I-mail, B-actor, I-actor, B-vulnerability, I-vulnerability, O, B-Compainen indicates the network attack activity in the textA tag of a first character of the action type entity, I-Compaine represents a tag of a network attack action type entity in the text other than the first character, B-identity represents a tag of a first character of an organization or person type entity in the text, I-identity represents a tag of a first character of an organization or person type entity in the text other than the first character, B-tool represents a tag of a first character of a tool type in the text, I-tool represents a tag of a tool type other than the first character in the text, B-malware represents a tag of a first character of a malware type entity in the text, I-malware represents a tag of a malware type entity other than the first character in the text, B-actor means a tag of a first character of a threat actor type entity in the text, I-actor means a tag of other characters than the first character of the threat actor type entity in the text, B-vulnerabilities means a tag of the first character of the vulnerability type entity in the text, I-vulnerabilities means a tag of other characters than the first character of the vulnerability type entity in the text, O is a tag of a non-entity type character in the text, B-composite, B-identity, B-tool, B-mail, B-actor, B-vulnerabilities constitute a B tag set, I-composite, I-identity, I-tool, I-actor, I-vulnerabilities constitute an I tag set.
(f-4) matrix of vectors P CHE Inputting the probability of the different entity labels corresponding to each character into a conditional random field CRF to perform head entity decoding For the t-th character w t Probability sequences corresponding to different entity type tags;
(f-5) calculating the t-th character w using the torch.argmax () function in the pytorch toolkit t Corresponding to tags of different entity typesProbability sequenceEntity tag l of the highest probability of (2) t Obtaining tag sequences of sentencesTag sequence of traversing sentence->Tag sequence marked as head entity by using tag in B tag set as beginning and tag in I tag set as middle tag sequence to obtain head selection entity tag sequence E, E= { entity 1 ,entity 2 ,...,entity i ,...,entity m },entity i For the label of the i-th header entity, i e { 1..m }, m is the number of header entities.
In one embodiment of the invention, step (g) comprises the steps of:
(g-1) embedding a shared vector of a relational task corresponding to t charactersShared vector embedding of entity and relationship classification task corresponding to the t-th character>Performing splicing operation to obtain vector->Vector matrix is e TRE
(g-2) labeling entity of the ith header entity i The entity position of the middle entity is filled with 1, the non-entity position is filled with 0, and the label entity of the ith head entity is obtained i Mask vector L of (2) i ,L i ∈R 1×N Jth sentence s j The set of mask vectors for the entities in (a) is L, l= { L 1 ,L 2 ,...,L i ,...,L m },L∈R m×N . (g-3) passing through the formulaCalculating to obtain the i head entity vector embedding +.>Embedding the i-th header entity vector +.>Inputting into a maximum pooling layer, unifying the first dimension of the entity by using the maximum pooling layer, and outputting to obtain the pooled i-th head entity vector embedded +.>
In one embodiment of the invention, step (h) comprises the steps of:
(h-1) initializing to obtain a parameter matrix for correlation calculation using a torch.randn () function of a torch toolkitParameter matrix->
(h-2) vector of the vectorAnd parameter matrix->Performing multiplication operation to obtain key vector +.>The key vector matrix is K TRE ,/> Embedding the pooled j-th header entity vector +.>And parameter matrix->Multiplication is performed to obtain a query vector +.>Query vector matrix Q CHE
(h-3) passing through the formulaCalculating to obtain the relevance score S of the query vector corresponding to each head entity and the key vector corresponding to the t character of each sentence jt Wherein V is a parameter matrix, < >>
(h-4) using a softmax function on the relevance score S jt Normalizing to obtain normalized correlation fraction alpha jt ,α jt The value range is [0,1 ]]。
(h-5) passing through the formulaCalculating to obtain the query vector corresponding to the jth head entity The contextual representation h of the vector corresponding to the t-th character of each sentence jt The set of contextual representations of all head entities and vectors corresponding to the t-th character of each sentence is h,/-, and +.>h={h 1t ,h 2t ,...,h jt ,...,h mt }。
(h-6) passing through the formulaThe calculated context represents h jt Threshold g of (2) j ,g j ∈[0,1]Wherein sigma (·) is a sigmoid function, W 1 Is a parameter matrix->W 2 Is a parameter matrix-> B for splice operation 1 And b 2 Are bias terms.
(h-7) passing through the formula u j =g j ·tan(W 3 h jt +b 3 ) Calculating to obtain a filtered vector u j Wherein W is 3 As a matrix of parameters,b 3 is a bias term.
(h-8) vector of the vectorAnd the filtered vector u j Performing splicing operation to obtain a j-th head entity and relationship correlation expression vector matrix o j ,/>All header entities and relations are related in the representation sequence O,using vector e TRE All header entity vector set entries CHE And performing correlation calculation of the entity and the relation, wherein the step corresponds to the correlation calculation module of the entity and the relation.
In one embodiment of the invention, step (i) comprises the steps of:
(i-1) inputting all head entities and relation correlation expression sequences O into a two-way long and short memory neural network BiLSTM, performing context learning on the feature sequences, and processing the feature sequences by the two-way long and short memory neural network BiLSTM to obtain vector sequences O re Representing vector matrix o for jth head entity and relationship correlation j Output of BiLSTM via two-way long and short memory neural network,>
(i-2) representing the j-th header entity with a relationship correlation vector matrix o j Input into a single-layer linear network through the formulaCalculating to obtain vector matrix->W in the formula re Parameter matrix for single-layer linear network, +.>b re Is a bias term for a single layer linear network.
(i-3) matrix vectorsInputting into conditional random field CRF to obtain j-th head entity j Probability of different entity tags in corresponding sentence +.> For the j-th header entity j The t character w of the corresponding sentence t Probability sequences corresponding to labels of different entity types, wherein the probability value set corresponding to all head entities is +.>
(i-4) calculating the j-th header entity using the torch.argmax () function in the pytorch toolkit j Probability of corresponding to different entity tags in a sentenceEntity tag l of the highest probability of (2) j ,l j ∈R N Maximum probability entity tag l j The entity tags of (a) are respectively B-Compainen, I-Compainen, B-identity, I-identity, B-tool, I-tool, B-malware, I-malware, B-actor, I-actor and B-vulnerability, I-vulnerability, O, and all tail entity tag sequences are- >
(i-5) the entity tag l with the highest probability j Defining as a tail entity tag sequence, marking the tag sequence as a tail entity tag by using a tag in the B tag set as a start and a tag in the I tag set as a middle tag sequence, and obtaining a tail entity tag sequence E ', E' = { entity 1 ′,entity 2 ′,...,entity i ′,...,entity n ′},entity i ' is the label of the i-th head entity, i e { 1..the n }, n being the number of tail entities.
(i-6) tag entry for the ith tail entity i Filling the entity position in' 1, filling the non-entity position in 0, and obtaining the label entity of the ith tail entity i ' mask vector L i ′,L i ′∈R 1×N Jth header entity j The mask vector set of the corresponding tail entity is L j ,L j ={L 1 ′,L 2 ′,...,L i ′,...,L n ′},L j ∈R n×N
(i-7) passing through the formulaCalculating to obtain the i-th tail entity vector embedding +.> All tail entity vector embedding sequences are entity tailEmbedding the i-th tail entity vector +.>Inputting into the maximum pooling layer, outputting to obtain the pooled i-th tail entity vector embedded +.>All vector sequences are entity tail ′,
/>
(i-8) embedding the pooled i-th tail entity vectorEmbedding +.>Splicing to obtain vector->Vector +.>Is input into a single-layer linear network by the formula +.>Calculating to obtain vector matrix->W in the formula re ' is a parameter matrix of a single-layer linear network, +. >b re ' is a bias term of a single-layer linear network, +.>L re For the number of relationship tags, L re The relationship labels are use, targets, other, the labels use and targets represent the relationship between the entities of the predefined types, and the label other represents that no relationship exists between the representative entities. Specifically, the relationship of tag B-malware to tag B-Compainen is defined as useDefining a relation between the tag I-software and the tag I-Compaine as use, defining a relation between the tag B-actor and the tag B-Compaine as use, defining a relation between the tag I-actor and the tag I-Compaine as use, defining a relation between the tag B-actor and the tag B-tool as use, defining a relation between the tag I-actor and the tag I-tool as use, defining a relation between the tag B-actor and the tag B-software as use, and defining a relation between the tag I-actor and the tag I-software as use; the relation between the label B-actor and the label B-vulnerabilities is defined as targets, the relation between the label I-actor and the label I-vulnerabilities is defined as targets, the relation between the label B-software and the label B-vulnerabilities is defined as targets, the relation between the label I-software and the label I-vulnerabilities is defined as targets, the relation between the label B-actor and the label B-identity is defined as targets, the relation between the label I-actor and the label I-identity is defined as targets, the relation between the label B-software and the label B-identity is defined as targets, and the relation between the label I-software and the label I-identity is defined as targets; the relationship between the label B-Compainen and the label B-vulnerabilities is defined as an other, the relationship between the label I-Compainen and the label I-vulnerabilities is defined as an other, the relationship between the label B-tool and the label B-vulnerabilities is defined as an other, the relationship between the label I-tool and the label I-vulnerabilities is defined as an other, the relationship between the label B-tool and the label B-Compainen is defined as an other, the relationship between the label I-tool and the label I-Compainen is defined as an other, the relationship between the label B-identity and the label B-Compainen is defined as an other, the relationship between the label I-identity and the label I-Compainen is defined as an other, the relationship between the label B-tool and the label B-identity is defined as an other, the relationship between the label I-tool and the label I-identity is defined as an other, and the label I-vpileness is defined as an other.
(i-9) vector matrix pairs using a softmax functionNormalizing to obtain normalized vector ++> Obtaining the most probable entity tag l using the torch.argmax () function in the pytorch toolkit j Corresponding relation label rel ij Obtaining triples<entity j ,rel ij ,entity i ′>Extracting to obtain the j-th header entity j Corresponding all triplet sequences {<entity j ,rel 1j ,entity 1 ′>,<entity j ,rel 2j ,entity 2 ′>,
…,<entity j ,rel ij ,entity i ′〉,…,<entity j ,rel nj ,entity n ′>}。
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for jointly extracting entities and relationships from cyber security threat intelligence, comprising the steps of:
(a) Collecting n network security threat information articles to obtain a network security threat information set D, D= { D 1 ,d 2 ,…,d i ,...,d n },d i I e {1, …, n }, which is the i-th article of cyber security threat intelligence;
(b) The ith network security threat information article d i Performing data preprocessing operation to obtain sentence set S i ,S i ={s 1 ,s 2 ,…,s j ,...,s l },s j Is sentence setS in S i J e {1,., l }, l is the sentence set S i The number of sentences in a sentence;
(c) Will j th sentence s j Word segmentation is carried out to obtain a character sequence X j ,X j ={w 1 ,w 2 ,...,w t ,...,w N },w t For the t-th character, t e {1,., N }, N is the j-th sentence s j The number of characters;
(d) Obtaining the embedded representation e of the t-th character t Jth sentence s j Is e, e= { e 1 ,e 2 ,...,e t ,...,e N };
(e) Inputting the vector embedding sequence e into an attention module to obtain an entity recognition task vector matrix e ner Task vector matrix e for relationship classification re Shared vector matrix e for entity identification and relationship classification tasks shareThe vector identified for the entity corresponding to the t-th character is embedded,embedding the relation classification vector corresponding to the t-th character, identifying the shared vector embedding of the relationship classification task for the entity corresponding to the t character;
(f) Task vector matrix e is identified according to entities ner Shared vector matrix e of body recognition and relation classification tasks share Obtaining a candidate head entity tag sequence E, E= { entity 1 ,entity 2 ,...,entity i ,…,entity m },entity i I e {1, …, m } is the label of the ith entity, m is the number of entities;
(g) Classifying task vector matrix e according to relationships re Shared vector matrix e of entity identification and relationship classification tasks share Obtaining all header entity vector set entries CHE Embedding the pooled i-th head entity vector, i epsilon {1, …, m };
(h) According to all header entity vector set entries CHE Calculating to obtain all head entity and relation correlation expression sequences O, O= { O 1 ,o 2 ,…,o j ,…,o m },o j Representing vectors for the relevance of the jth head entity to the relationship, j e {1, …, m };
(i) Obtaining the entity of the j-th entity according to all the head entities and the relation correlation expression sequence O j Is a triplet sequence of (c).
2. The method for jointly extracting entities and relationships from cyber-security threat intelligence of claim 1, wherein: in step (a), n cyber security threat intelligence articles are collected from public vulnerability reports and/or social media and/or security news.
3. The method of claim 1, wherein the data preprocessing operation in step (b) comprises:
(b-1) removing the ith cyber-security threat intelligence article d i Punctuation marks, mathematical operation symbols, brackets, quotation marks, @, #, $,%, ≡,&Performing noise removal operation;
(b-2) ith cyber-security threat intelligence article d after noise removal using the split () function of the python program i Conversion to sentence set S i
4. The method for jointly extracting entities and relationships from cyber-security threat intelligence of claim 1, wherein: using a token function in a transducer kit in step (c) to take the jth sentence s j And performing word segmentation.
5. The method of claim 1, wherein step (d) comprises the steps of:
(d-1) writing the t-th character w t Inputting into a Bert pre-training model, and outputting to obtain feature vectors R is real space, d w Is a dimension;
(d-2) converting the t-th character w t Input into a Glove pre-training model, and output to obtain a feature vector d g Is a dimension;
(d-3) using the space tool part-of-speech tagging tool to tag the t-th character w t Character embedding is carried out to obtain part-of-speech vectorsd p Is a dimension;
(d-4) extracting feature vectorsFeature vector->Part of speechQuantity->Performing splicing operation to obtain an embedded representation e of the t-th character t ,/>Jth sentence s j The vector embedding sequence is e, < >>
6. The method for federated extraction of entities and relationships from cyber-security threat intelligence of claim 5, wherein step (e) comprises the steps of:
(e-1) the attention module is composed of an entity recognition attention module, a shared task attention module, and a relationship classification attention module, and is initialized to obtain a parameter matrix for the entity recognition attention module by using a torch.randn () function of a torch toolkit d att Initializing to obtain a parameter matrix for sharing task attention modules by using a torch.randn () function of a torch toolkit for dimension> Initializing to get a parameter matrix for the relational classification attention module using the torch.randn () function of the torch toolkit>
(e-2) passing through the formulaCalculating the Q vector of the t-th entity recognition attention module>Wherein T is transposed by the formula +.>Calculating to obtain the K vector of the t-th entity recognition attention moduleBy the formula->Calculating the V vector of the t-th entity recognition attention module>The Q matrix of the entity recognition attention module is +.> The K matrix of the entity recognition attention module is +.>The V matrix of the entity recognition attention module is +.> By the formula->Calculating the attention score +.f of the t-th Q vector and the j-th K vector>All attention score matrices are alpha ner ,/>α ner ∈R N×N By the formulaCalculating to obtain entity identification vector embedding +.> Entity recognition task vector matrix is e ner ,/>
(e-3) passing through the formulaCalculating to obtain Q vector of t-th shared task attention module>Wherein T is transposed by the formula +.>Calculating K vector of t-th shared task attention module>By the formula->Calculating to obtain V vector of t-th shared task attention module >The Q matrix of the shared task attention module is +.> The K matrix of the shared task attention module is +.>The V matrix of the shared task attention module is +.> By the formula/>Calculating the attention score +.f of the t-th Q vector and the j-th K vector>All attention score matrices are alpha share ,/>α share ∈R N×N By the formulaCalculating to obtain the shared vector embedding of the entity corresponding to the t character and the relation classification task>The shared vector matrix of the entity identification and relation classification task is e share
(e-4) passing through the formulaCalculating Q vector of t-th relation classification attention module>Wherein T is transposed by the formula +.>K vector of t-th relation classification attention module is calculated>By the formula->Calculating the V vector of the t-th relation classification attention module>The Q matrix of the relational classification attention module is +.> The K matrix of the relation classification attention module is +.> The V matrix of the relation classification attention module is +.> By the formula->Calculating the attention score +.f of the t-th Q vector and the j-th K vector>All attention score momentsThe array is alpha reα re ∈R N×N By the formula->Shared vector embedding of relation task corresponding to t-th character>The relation classification task vector matrix is e re ,/>
7. The method for jointly extracting entities and relationships from cyber-security threat intelligence of claim 6, wherein step (f) comprises the steps of:
(f-1) embedding the entity recognition vector corresponding to the t-th characterAnd the entity corresponding to the t character and the shared vector embedding of the relation classification task +.>Performing splicing operation to obtain vector->Vector matrix is e CHE
(f-2) matrix of vectors e CHE Inputting into a two-way long and short memory neural network BiLSTM to obtain a vector matrix O CHEFor the t-th character w t Output of BiLSTM via two-way long and short memory neural network,>d h is a dimension;
(f-3) outputting the vector into the sequence O CHE Is input into a single-layer linear network through a formula P CHE =O CHE W CHE +b CHE Calculating to obtain a vector matrix P CHEL ner Number of physical tags, W CHE Parameter matrix for single-layer linear network, +.>b CHE Bias term for single-layer linear network, +.>L ner The entity labels are B-Compainen, I-Compainen, B-identity, I-identity, B-tool, I-tool, B-actor, I-actor, B-vulnerability, I-vulnerability, O, B-Compainen representing the label of the first character of the cyber-attack activity type entity in the text, I-Compainen representing the label of the other character than the first character of the cyber-attack activity type entity in the text, B-identity representing the label of the first character of the organization or person type entity in the text, I-identity representing the label of the other character than the first character of the organization or person type entity in the text, B-tool representing the label of the first character of the tool type in the text, I-tool representing the label of the other character than the first character of the tool type in the text, B-tool table I-malware represents a tag of a first character of a malware type entity in the text, B-actor represents a tag of a first character of a threat actor type entity in the text, I-actor represents a tag of a first character of a threat actor type entity in the text, B-vulnerabilities represents a tag of a first character of a vulnerability type entity in the text, I-vulnerabilities represents a tag of a other character of a vulnerability type entity in the text, O is a tag of a non-entity type character in the text, B-completions, B-identity, B-tools, B-actor, B-vulnerabilities constitute a B tag set, I-aid, I-identity, I-tool, I-actor, I-vulnerabilities constitute an I tag set;
(f-4) matrix of vectors P CHE Inputting into conditional random field CRF to obtain probability of different entity labels corresponding to each character For the t-th character w t Probability sequences corresponding to different entity type tags;
(f-5) calculating the t-th character w using the torch.argmax () function in the pytorch toolkit t Probability sequences corresponding to different entity type tags Entity tag l of the highest probability of (2) t Obtaining tag sequences of sentencesTag sequence of traversing sentence->Using tags in the B-tag set as start, and I-tag setThe label is used as a label of the head entity, and the label sequence E, E= { entity is obtained 1 ,entity 2 ,…,entity i ,...,entity m },entity i For the label of the i-th header entity, i e { 1..m }, m is the number of header entities.
8. The method for federated extraction of entities and relationships from cyber-security threat intelligence of claim 7, wherein step (g) comprises the steps of:
(g-1) embedding a shared vector of a relational task corresponding to t charactersShared vector embedding of entity and relationship classification task corresponding to the t-th character>Performing splicing operation to obtain vector->Vector matrix is e TRE
(g-2) labeling entity of the ith header entity i The entity position of the middle entity is filled with 1, the non-entity position is filled with 0, and the label entity of the ith head entity is obtained i Mask vector L of (2) i ,L i ∈R 1×N Jth sentence s j The set of mask vectors for the entities in (a) is L, l= { L 1 ,L 2 ,...,L i ,...,L m },L∈R m×N
(g-3) passing through the formulaCalculating to obtain the i head entity vector embedding +.>Embedding the i-th header entity vector +.>Inputting into the maximum pooling layer, outputting to obtain the pooled i-th head entity vector embedded +. >
9. The method for jointly extracting entities and relationships from cyber-security threat intelligence of claim 8, wherein step (h) comprises the steps of:
(h-1) initializing to obtain a parameter matrix for correlation calculation using a torch.randn () function of a torch toolkitParameter matrix->
(h-2) vector of the vectorAnd parameter matrix->Performing multiplication operation to obtain key vector +.>The key vector matrix is K TRE ,/> Embedding the pooled j-th header entity vector +.>And parameter matrix->Multiplication is performed to obtain a query vector +.>Query vector matrix Q CHE
(h-3) passing through the formulaCalculating to obtain the relevance score S of the query vector corresponding to each head entity and the key vector corresponding to the t character of each sentence jt Wherein V is a parameter matrix, < >>
(h-4) using a softmax function on the relevance score S jt Normalizing to obtain normalized correlation fraction alpha jt ,α jt The value range is [0,1 ]];
(h-5) passing through the formulaCalculating to obtain a query vector corresponding to the j-th head entity>The contextual representation h of the vector corresponding to the t-th character of each sentence jt The set of contextual representations of all head entities and vectors corresponding to the t-th character of each sentence is h,/-, and +.>h={h 1t ,h 2t ,...,h jt ,...,h mt };
(h-6) passing through the formulaThe calculated context represents h jt Threshold g of (2) j ,g j ∈[0,1]Wherein sigma (·) is a sigmoid function, W 1 Is a parameter matrix->W 2 Is a parameter matrix->B for splice operation 1 And b 2 Are bias terms;
(h-7) passing through the formula u j =g j ·tan(W 3 h jt +b 3 ) Calculating to obtain a filtered vector u j Wherein W is 3 As a matrix of parameters,b 3 is a bias term;
(h-8) vector of the vectorAnd the filtered vector u j Performing splicing operation to obtain the j-th head entity and relationThe correlation represents a vector matrix o j ,/>All head entities and relations have a correlation expression sequence O, o= { O 1 ,o 2 ,...,o j ,...,o m },/>
10. The method for jointly extracting entities and relationships from cyber-security threat intelligence of claim 9, wherein step (i) comprises the steps of:
(i-1) inputting all head entities and relation correlation expression sequences O into a two-way long and short memory neural network BiLSTM to obtain vector sequences O reRepresenting vector matrix o for jth head entity and relationship correlation j Output of BiLSTM via two-way long and short memory neural network,>
(i-2) representing the j-th header entity with a relationship correlation vector matrix o j Input into a single-layer linear network through the formulaCalculating to obtain vector matrix->W in the formula re Parameter matrix for single-layer linear network, +.>b re Bias terms that are single-layer linear networks;
(i-3) matrix vectors Inputting into conditional random field CRF to obtain j-th head entity j Probability of different entity tags in corresponding sentence +.> For the j-th header entity j The t character w of the corresponding sentence t Probability sequences corresponding to labels of different entity types, wherein the probability value set corresponding to all head entities is +.>
(i-4) calculating the j-th header entity using the torch.argmax () function in the pytorch toolkit j Probability of corresponding to different entity tags in a sentenceEntity tag l of the highest probability of (2) j ,l j ∈R N Maximum probability entity tag l j The entity tags of (a) are respectively B-Compainen, I-Compainen, B-identity, I-identity, B-tool, I-tool, B-malware, I-malware, B-actor, I-actor and B-vulnerability, I-vulnerability, O, and all tail entity tag sequences are->
(i-5) the entity tag l with the highest probability j Defined as a tail entity tag sequence, using tags in the B tag set as a start, I tag setThe label in the label is used as a label sequence in the middle and is marked as a label of a tail entity, so that a tail entity selecting label sequence E ', E' = { entity is obtained 1 ′,entity 2 ′,...,entity i ′,…,entity n ′},entity i ' is the label of the ith head entity, i epsilon {1, …, n }, n is the number of tail entities;
(i-6) tag entry for the ith tail entity i Filling the entity position in' 1, filling the non-entity position in 0, and obtaining the label entity of the ith tail entity i ' mask vector L i ′,L i ′∈R 1×N Jth header entity j The mask vector set of the corresponding tail entity is L j ,L j ={L 1 ′,L 2 ′,…,L i ′,…,L n ′},L j ∈R n×N
(i-7) passing through the formulaCalculating to obtain the i-th tail entity vector embedding +.> All tail entity vector embedding sequences are entity tailEmbedding the ith tail entity vectorInputting into the maximum pooling layer, outputting to obtain the pooled i-th tail entity vector embedded +.>All vector sequences are entity tail ′,/>
(i-8) embedding the pooled i-th tail entity vectorEmbedding with the pooled jth header entity vectorSplicing to obtain vector->Vector +.>Is input into a single-layer linear network by the formula +.>Calculating to obtain vector matrix->W in the formula re ' is a parameter matrix of a single-layer linear network, +.>b re ' is a bias term of a single-layer linear network, +.>L re For the number of relationship tags, L re =3, the relationship label is use, targets, other, the relationship of label B-malware and label B-compain is defined as use, the relationship of label I-malware and label I-compain is defined as use, the relationship of label B-actor and label B-compain is defined as use, the relationship of label I-actor and label I-compain is defined as use, the relationship of label B-actor and label B-tool is defined as use, and the label The relation between the label I-actor and the label I-tool is defined as use, the relation between the label B-actor and the label B-malware is defined as use, and the relation between the label I-actor and the label I-malware is defined as use; the relation between the label B-actor and the label B-vulnerabilities is defined as targets, the relation between the label I-actor and the label I-vulnerabilities is defined as targets, the relation between the label B-software and the label B-vulnerabilities is defined as targets, the relation between the label I-software and the label I-vulnerabilities is defined as targets, the relation between the label B-actor and the label B-identity is defined as targets, the relation between the label I-actor and the label I-identity is defined as targets, the relation between the label B-software and the label B-identity is defined as targets, and the relation between the label I-software and the label I-identity is defined as targets; the method comprises the steps of defining a relationship between a tag B-Compainen and a tag B-vulnerabilities as an other, defining a relationship between a tag I-Compainen and a tag I-vulnerabilities as an other, defining a relationship between a tag B-tool and a tag B-vulnerabilities as an other, defining a relationship between a tag I-tool and a tag I-vulnerabilities as an other, defining a relationship between a tag B-tool and a tag B-Compainen as an other, defining a relationship between a tag I-tool and a tag I-Compainen as an other, defining a relationship between a tag B-identity and a tag B-Compainen as an other, defining a relationship between a tag I-identity and a tag I-Compainen as an other, defining a relationship between a tag B-tool and a tag B-identity as an other, defining a relationship between a tag I-tool and a tag I-identity as an other, and a tag I-identity as a tag I-identity, and a tag I-vpileability as an other, and a tag I-identity as a tag I-identity;
(i-9) vector matrix pairs using a softmax functionNormalizing to obtain normalized vector ++> Using a pytorch toolThe torch.argmax () function in the package gets the entity tag l of the highest probability j Corresponding relation label rel ij Obtaining triples<entity j ,rel ij ,entity i ′>Extracting to obtain the j-th header entity j Corresponding all triplet sequences {<entity j ,rel 1j ,entity 1 ′>,<entity j ,rel 2j ,entity 2 ′>,…,<entity j ,rel ij ,entity i ′>,…,<entity j ,rel nj ,entity n ′>}。
CN202311302393.7A 2023-10-10 2023-10-10 Method for extracting entity and relation from network security threat information combination Active CN117332785B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311302393.7A CN117332785B (en) 2023-10-10 2023-10-10 Method for extracting entity and relation from network security threat information combination

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311302393.7A CN117332785B (en) 2023-10-10 2023-10-10 Method for extracting entity and relation from network security threat information combination

Publications (2)

Publication Number Publication Date
CN117332785A CN117332785A (en) 2024-01-02
CN117332785B true CN117332785B (en) 2024-03-01

Family

ID=89274987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311302393.7A Active CN117332785B (en) 2023-10-10 2023-10-10 Method for extracting entity and relation from network security threat information combination

Country Status (1)

Country Link
CN (1) CN117332785B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368528A (en) * 2020-03-09 2020-07-03 西南交通大学 Entity relation joint extraction method for medical texts
CN111950297A (en) * 2020-08-26 2020-11-17 桂林电子科技大学 Abnormal event oriented relation extraction method
CN113806559A (en) * 2021-09-24 2021-12-17 东南大学 Knowledge graph embedding method based on relationship path and double-layer attention
CN114330322A (en) * 2022-01-05 2022-04-12 北京邮电大学 Threat information extraction method based on deep learning
CN115759092A (en) * 2022-10-13 2023-03-07 中国民航大学 Network threat information named entity identification method based on ALBERT
CN115860152A (en) * 2023-02-20 2023-03-28 南京星耀智能科技有限公司 Cross-modal joint learning method oriented to character military knowledge discovery

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368528A (en) * 2020-03-09 2020-07-03 西南交通大学 Entity relation joint extraction method for medical texts
CN111950297A (en) * 2020-08-26 2020-11-17 桂林电子科技大学 Abnormal event oriented relation extraction method
CN113806559A (en) * 2021-09-24 2021-12-17 东南大学 Knowledge graph embedding method based on relationship path and double-layer attention
CN114330322A (en) * 2022-01-05 2022-04-12 北京邮电大学 Threat information extraction method based on deep learning
CN115759092A (en) * 2022-10-13 2023-03-07 中国民航大学 Network threat information named entity identification method based on ALBERT
CN115860152A (en) * 2023-02-20 2023-03-28 南京星耀智能科技有限公司 Cross-modal joint learning method oriented to character military knowledge discovery

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张超然 ; 裘杭萍 ; 孙毅 ; 王中伟 ; .基于预训练模型的机器阅读理解研究综述.计算机工程与应用.2020,(11),第22-30页. *

Also Published As

Publication number Publication date
CN117332785A (en) 2024-01-02

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN114169330B (en) Chinese named entity recognition method integrating time sequence convolution and transform encoder
CN110134946B (en) Machine reading understanding method for complex data
CN113705218B (en) Event element gridding extraction method based on character embedding, storage medium and electronic device
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN110598066B (en) Bank full-name rapid matching method based on word vector expression and cosine similarity
CN115759092A (en) Network threat information named entity identification method based on ALBERT
Li et al. Adversarial Active Learning for Named Entity Recognition in Cybersecurity.
CN115688752A (en) Knowledge extraction method based on multi-semantic features
CN114298035A (en) Text recognition desensitization method and system thereof
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN111126056B (en) Method and device for identifying trigger words
CN117407532A (en) Method for enhancing data by using large model and collaborative training
CN112685594B (en) Attention-based weak supervision voice retrieval method and system
CN117332785B (en) Method for extracting entity and relation from network security threat information combination
CN116680407A (en) Knowledge graph construction method and device
CN116069985A (en) Robust online cross-modal hash retrieval method based on label semantic enhancement
CN113378571A (en) Entity data relation extraction method of text data
CN113886602A (en) Multi-granularity cognition-based domain knowledge base entity identification method
CN113657443A (en) Online Internet of things equipment identification method based on SOINN network
CN114444485B (en) Cloud environment network equipment entity identification method
CN114153969B (en) Efficient text classification system with high accuracy
CN117521658B (en) RPA process mining method and system based on chapter-level event extraction
CN114996407B (en) Remote supervision relation extraction method and system based on packet reconstruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant