CN116522331A

CN116522331A - Network threat information extraction method, device, storage medium and apparatus

Info

Publication number: CN116522331A
Application number: CN202210069037.4A
Authority: CN
Inventors: 唐杰; 吴龙平; 莫建平; 余凯
Original assignee: 360 Digital Security Technology Group Co Ltd
Current assignee: 360 Digital Security Technology Group Co Ltd
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2023-08-01
Also published as: WO2023138047A1

Abstract

The invention relates to the technical field of Internet, and discloses a network threat information extraction method, equipment, a storage medium and a device, wherein the method comprises the following steps: carrying out natural language processing on unstructured network threat information to obtain an attack objective and an attack means, carrying out attack means prediction on the attack objective through a preset machine learning model to obtain an unknown attack means, and generating structured network threat information according to the attack objective, the attack means and the unknown attack means; the invention automatically identifies and extracts the attack purpose and attack means of the attacker in the unstructured network threat information based on natural language processing and a preset machine learning model, so that the analysis process of the network threat information can be simplified, and the security defense capability can be improved.

Description

Network threat information extraction method, device, storage medium and apparatus

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method, an apparatus, a storage medium, and a device for extracting network threat information.

Background

With the explosive growth of cyber threat attacks, extraction and sharing of relevant information of technologies and attack implementation processes (TTPs) used by attackers in threat analysis reports are crucial to cyber security construction. However, due to the lack of automated extraction and analysis techniques for standard structured language descriptions and technical intelligence for cyber threat reporting, analyzing complex and unstructured threat analysis reports is time consuming and laborious.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The invention mainly aims to provide a network threat information extraction method, equipment, a storage medium and a device, and aims to solve the technical problems that in the prior art, due to the lack of a standard structured language description and an automatic extraction and analysis technology of technical information of a network threat report, analysis of complex and unstructured threat analysis reports is time-consuming and labor-consuming.

In order to achieve the above object, the present invention provides a network threat information extraction method, which includes the following steps:

performing natural language processing on unstructured network threat information to obtain an attack purpose and an attack means;

predicting the attack means of the attack objective through a preset machine learning model to obtain an unknown attack means;

and generating structured network threat information according to the attack purpose, the attack means and the unknown attack means.

Optionally, the step of performing natural language processing on unstructured cyber threat information to obtain an attack purpose and an attack means includes:

Text preprocessing is carried out on unstructured network threat information to obtain simplified network threat information;

deep sentence breaking is carried out on the simplified network threat information to obtain threat sentences;

carrying out semantic dependency analysis on the threat statement to obtain a standard threat statement;

carrying out vocabulary marking on the standard threat sentences to obtain a threat corpus;

carrying out synonym expansion on the threat corpus to obtain a target corpus;

and determining an attack purpose and an attack means according to the target corpus.

Optionally, the step of performing semantic dependency analysis on the threat statement to obtain a dependency relationship between words in the threat statement includes:

carrying out semantic dependency analysis on the threat statement to obtain the dependency relationship among all the words in the threat statement;

and carrying out standardized processing on the threat statement according to the dependency relationship to obtain a standard threat statement.

Optionally, the step of performing standardization processing on the threat statement according to the dependency relationship to obtain a standard threat statement includes:

acquiring part-of-speech information of each word in the threat statement;

and carrying out standardized processing on the threat statement according to the part-of-speech information and the dependency relationship to obtain a standard threat statement.

Optionally, the step of performing synonym expansion on the threat corpus to obtain a target corpus includes:

acquiring the occurrence frequency of each keyword in the threat corpus, and determining threat keywords according to the occurrence frequency;

and carrying out synonym expansion on the threat keywords based on a preset dictionary to obtain a target corpus.

Optionally, the step of performing deep sentence breaking on the simplified cyber threat information to obtain a threat sentence includes:

acquiring sentence ending symbols, parallel relation conjunctions and progressive relation conjunctions in the simplified network threat information;

and carrying out deep sentence breaking on the simplified network threat information according to the sentence ending symbol, the parallel relation conjunctions and the progressive relation conjunctions to obtain threat sentences.

Optionally, the step of lexically marking the standard threat statement to obtain a threat corpus includes:

obtaining necessary scores of all parts in the standard threat statement;

and simplifying the standard threat statement according to the necessary score to obtain a threat corpus.

Optionally, the step of performing text preprocessing on unstructured cyber threat information to obtain simplified cyber threat information further includes:

Acquiring random information in unstructured network threat information;

and simplifying the random information to obtain simplified network threat information.

Optionally, before the step of predicting the attack means for the attack purpose by a preset machine learning model to obtain the unknown attack means, the method further includes:

constructing a training set corpus according to the target corpus;

and training the initial machine learning model according to the training set corpus to obtain a preset machine learning model.

Optionally, the step of constructing a training set corpus according to the target corpus includes:

acquiring the statement number of synonymous threat statements in the target corpus;

selecting threat sentence samples from the target corpus according to the sentence quantity;

and constructing a training set corpus based on the threat statement samples.

Optionally, the step of selecting a threat sentence sample from the target corpus according to the sentence number includes:

sequencing threat sentences in the target corpus according to the sentence quantity;

receiving a semantic tag input by a user based on the target corpus;

and selecting a threat statement sample from the target corpus according to the sorting result and the semantic tag.

Optionally, the step of predicting the attack means for the attack purpose by a preset machine learning model to obtain an unknown attack means includes:

acquiring multi-platform network threat information;

and carrying out attack means prediction on the attack objective through a preset machine learning model based on the multi-platform network threat information to obtain an unknown attack means.

In addition, in order to achieve the above object, the present invention also proposes a cyber threat information extraction apparatus including a memory, a processor, and a cyber threat information extraction program stored on the memory and executable on the processor, the cyber threat information extraction program being configured to implement the cyber threat information extraction method as described above.

In addition, in order to achieve the above object, the present invention also proposes a storage medium having stored thereon a cyber threat information extraction program which, when executed by a processor, implements the cyber threat information extraction method as described above.

In addition, in order to achieve the above object, the present invention also provides a cyber threat information extraction apparatus, the cyber threat information extraction apparatus comprising: the device comprises a language processing module, a means prediction module and an information generation module;

The language processing module is used for carrying out natural language processing on unstructured network threat information to obtain an attack purpose and an attack means;

the means prediction module is used for predicting the attack means of the attack purpose through a preset machine learning model to obtain an unknown attack means;

the information generation module is used for generating structured network threat information according to the attack purpose, the attack means and the unknown attack means.

Optionally, the language processing module is further configured to perform text preprocessing on unstructured network threat information to obtain simplified network threat information;

the language processing module is also used for carrying out deep sentence breaking on the simplified network threat information to obtain threat sentences;

the language processing module is also used for carrying out semantic dependency analysis on the threat statement to obtain a standard threat statement;

the language processing module is also used for carrying out vocabulary marking on the standard threat sentences to obtain a threat corpus;

the language processing module is also used for carrying out synonym expansion on the threat corpus to obtain a target corpus;

the language processing module is also used for determining an attack purpose and an attack means according to the target corpus.

Optionally, the language processing module is further configured to perform semantic dependency analysis on the threat statement to obtain a dependency relationship between each vocabulary in the threat statement;

the language processing module is further used for carrying out standardized processing on the threat statement according to the dependency relationship to obtain a standard threat statement.

Optionally, the language processing module is further configured to obtain part-of-speech information of each vocabulary in the threat statement;

the language processing module is further used for carrying out standardized processing on the threat statement according to the part-of-speech information and the dependency relationship to obtain a standard threat statement.

Optionally, the language processing module is further configured to obtain an occurrence frequency of each keyword in the threat corpus, and determine a threat keyword according to the occurrence frequency;

the language processing module is further used for carrying out synonym expansion on the threat keywords based on a preset dictionary to obtain a target corpus.

Optionally, the language processing module is further configured to obtain a sentence ending symbol, a parallel relationship conjunctive and a progressive relationship conjunctive in the simplified network threat information;

the language processing module is further used for performing deep sentence breaking on the simplified network threat information according to the sentence ending symbol, the parallel relation conjunctions and the progressive relation conjunctions to obtain threat sentences.

The invention discloses a method for processing unstructured network threat information in natural language to obtain an attack purpose and an attack means, wherein the attack purpose is predicted by a preset machine learning model to obtain an unknown attack means, and structured network threat information is generated according to the attack purpose, the attack means and the unknown attack means; the invention automatically identifies and extracts the attack purpose and attack means of the attacker in the unstructured network threat information based on natural language processing and a preset machine learning model, so that the analysis process of the network threat information can be simplified, and the security defense capability can be improved.

Drawings

FIG. 1 is a schematic diagram of a network threat information extraction apparatus of a hardware operating environment according to an embodiment of the invention;

FIG. 2 is a flowchart of a first embodiment of a network threat information extraction method of the invention;

FIG. 3 is a flowchart of a second embodiment of a network threat information extraction method of the invention;

FIG. 4 is a flowchart of a third embodiment of a network threat information extraction method of the invention;

FIG. 5 is a schematic diagram of semantic dependency analysis according to an embodiment of the network threat information extraction method of the present invention;

FIG. 6 is a flowchart of a fourth embodiment of a network threat information extraction method of the invention;

fig. 7 is a block diagram illustrating a first embodiment of a cyber threat information extraction apparatus according to the invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic diagram of a network threat information extraction apparatus of a hardware running environment according to an embodiment of the present invention.

As shown in fig. 1, the cyber threat information extraction apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), and the optional user interface 1003 may also include a standard wired interface, a wireless interface, and the wired interface for the user interface 1003 may be a USB interface in the present invention. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the structure shown in fig. 1 does not constitute a limitation of the cyber threat information extraction apparatus, and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a network threat information extraction program may be included in a memory 1005, which is considered to be a type of computer storage medium.

In the network threat information extraction apparatus shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server, and performing data communication with the background server; the user interface 1003 is mainly used for connecting user equipment; the cyber threat information extraction apparatus invokes a cyber threat information extraction program stored in the memory 1005 through the processor 1001, and executes the cyber threat information extraction method provided by the embodiment of the present invention.

Based on the hardware structure, the embodiment of the network threat information extraction method is provided.

Referring to fig. 2, fig. 2 is a flowchart of a first embodiment of a network threat information extraction method according to the present invention, and the first embodiment of the network threat information extraction method according to the present invention is provided.

In a first embodiment, the cyber threat information extraction method includes the steps of:

step S10: and carrying out natural language processing on unstructured network threat information to obtain an attack purpose and an attack means.

It should be understood that the execution body of the method of the present embodiment may be a network threat information extraction apparatus having functions of data processing, network communication, and program running, for example, a server, etc., or other electronic apparatuses capable of implementing the same or similar functions, which is not limited in this embodiment.

It will be appreciated that with the explosive growth of cyber-threat attacks, extraction and sharing of relevant intelligence for technology and attack implementation (TTP) used by attackers in threat analysis reports is crucial to cyber-security construction. However, due to the lack of automated extraction and analysis techniques for standard structured language descriptions and technical intelligence for cyber threat reporting, analyzing complex and unstructured threat analysis reports is time consuming and laborious.

The existing TRAM project realized by the American MITRE company based on the Machine Learning (ML) technology and TTPDrill project realized by the American university of North Carolina based on the Information Retrieval (IR) technology can process threat analysis reports relatively simply, but because the processing modes are all based on English reports, the method cannot be applied to Chinese threat analysis reports with complex and changeable description modes, and the output accuracy and false alarm rate of the two projects are not ideal, thus being research properties and basically having no practicality.

Therefore, in order to overcome the above-mentioned drawbacks, in this embodiment, the attack purpose and attack means of an attacker in unstructured cyber threat information are automatically identified and extracted based on natural language processing and a preset machine learning model, so that the analysis process of cyber threat information can be simplified, and the security defensive capability can be further improved.

It should be noted that, the natural language processing may be at least one of text preprocessing, text deep sentence breaking, sentence semantic dependency analysis, vocabulary marking and vocabulary synonym expansion, which is not limited in this embodiment.

It should be noted that, the attack purpose may be a technique adopted by an attacker in unstructured cyber threat information, for example, the technique may be that the virus is continuously running on a computer.

The attack means may be an attack implementation process of an attacker in unstructured network threat information, for example, the attack implementation process may be modification of a registry or startup self-starting, etc.

Step S20: and predicting the attack means for the attack purpose through a preset machine learning model to obtain an unknown attack means.

It should be noted that, the preset machine learning model may be preset, and in this embodiment and other embodiments, a Bag of Words (BOW) model is taken as an example for explanation.

The bag of words model puts all the words into one bag without regard to the lexical and lexical problems, i.e. each word is independent.

Step S30: and generating structured network threat information according to the attack purpose, the attack means and the unknown attack means.

It should be understood that the generation of the structured network threat information according to the attack purpose, the attack means and the unknown attack means may be to aggregate the attack purpose, the attack means and the unknown attack means to obtain the structured network threat information.

In a first embodiment, natural language processing is performed on unstructured network threat information to obtain an attack purpose and an attack means, attack means prediction is performed on the attack purpose through a preset machine learning model to obtain an unknown attack means, and structured network threat information is generated according to the attack purpose, the attack means and the unknown attack means; because the embodiment automatically identifies and extracts the attack purpose and attack means of the attacker in the unstructured network threat information based on natural language processing and a preset machine learning model, the analysis process of the network threat information can be simplified, and the security defense capability can be improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the network threat information extraction method according to the present invention, and based on the first embodiment shown in fig. 2, the second embodiment of the network threat information extraction method according to the present invention is provided.

In a second embodiment, the step S10 includes:

step S101: and performing text preprocessing on the unstructured network threat information to obtain simplified network threat information.

It should be understood that, in order to improve the processing effect of natural language processing, in this embodiment, text preprocessing may be performed on unstructured network threat information, deep sentence breaking may be performed, semantic dependency analysis may be performed, vocabulary marking may be performed, and synonym expansion may be performed, so as to obtain the attack purpose and attack means of an attacker in the unstructured network threat information.

It may be appreciated that text preprocessing is performed on unstructured cyber threat information, and obtaining simplified cyber threat information may be obtaining random information in the unstructured cyber threat information, and simplified processing is performed on the random information to obtain simplified cyber threat information.

Step S102: and carrying out deep sentence breaking on the simplified network threat information to obtain threat sentences.

It should be appreciated that to simplify cyber threat information, ensuring that each sentence to be analyzed independently expresses a Technical and Tactical Process (TTP). In this embodiment, deep sentence breaking can be performed on the simplified network threat information to obtain threat sentences.

It can be understood that deep sentence breaking is performed on the simplified network threat information, and the threat sentence can be obtained by obtaining a sentence ending symbol, a parallel relation conjunctive and a progressive relation conjunctive in the simplified network threat information, and deep sentence breaking is performed on the simplified network threat information according to the sentence ending symbol, the parallel relation conjunctive and the progressive relation conjunctive, so as to obtain the threat sentence.

Step S103: and carrying out semantic dependency analysis on the threat statement to obtain a standard threat statement.

It should be understood that, in order to standardize and unify the complex and changeable description modes, in this embodiment, semantic dependency analysis may also be performed on the threat statement to obtain the dependency relationship between each vocabulary in the threat statement, and standardized processing may be performed on the threat statement according to the dependency relationship to obtain a standard threat statement.

It can be understood that the semantic dependency analysis is performed on the threat statement, and the obtaining of the standard threat statement may be performed on the threat statement to obtain a dependency relationship between each vocabulary in the threat statement, and the standardized processing is performed on the threat statement according to the dependency relationship to obtain the standard threat statement.

Step S104: and carrying out vocabulary marking on the standard threat sentences to obtain a threat corpus.

It should be understood that, in order to converge the threat corpus, in this embodiment, the necessary score of each part in the standard threat sentence may be obtained first, and then the standard threat sentence may be simplified according to the necessary score to obtain the threat corpus.

It may be understood that, performing vocabulary marking on the standard threat statement, and obtaining the threat corpus may be obtaining necessary scores of each part in the standard threat statement, and simplifying the standard threat statement according to the necessary scores, so as to obtain the threat corpus.

Step S105: and carrying out synonym expansion on the threat corpus to obtain a target corpus.

It should be appreciated that, in this embodiment, synonym expansion may be performed on the high-frequency keywords in the threat corpus in order to increase the recall rate of the subsequent model predictions.

It may be understood that the threat corpus is subjected to synonym expansion, so that the target corpus can be obtained by obtaining the occurrence frequency of each keyword in the threat corpus, determining the threat keywords according to the occurrence frequency, and carrying out synonym expansion on the threat keywords based on a preset dictionary.

Step S106: and determining an attack purpose and an attack means according to the target corpus.

In a second embodiment, text preprocessing is performed on unstructured cyber threat information to obtain simplified cyber threat information, deep sentence breaking is performed on the simplified cyber threat information to obtain threat sentences, semantic dependency analysis is performed on the threat sentences to obtain standard threat sentences, vocabulary marking is performed on the standard threat sentences to obtain a threat corpus, synonym expansion is performed on the threat corpus to obtain a target corpus, and attack purposes and attack means are determined according to the target corpus; the text preprocessing is firstly carried out on unstructured network threat information, then deep sentence breaking is carried out, semantic dependency analysis is carried out, vocabulary marking is carried out, and synonym expansion is carried out, so that the attack purpose and attack means of an attacker in the unstructured network threat information are obtained, and the processing effect of natural language processing can be improved.

In a second embodiment, the step S20 includes:

step S201: and acquiring multi-platform network threat information.

It should be understood that, in order to obtain a multi-dimensional unknown attack means, in this embodiment, multi-platform network threat information may be obtained first, and then attack means prediction is performed on the attack purpose through a preset machine learning model based on the multi-platform network threat information, so as to obtain the unknown attack means.

It should be noted that the multi-platform cyber threat information may be cyber threat information obtained by detecting multiple security platforms.

Step S202: and carrying out attack means prediction on the attack objective through a preset machine learning model based on the multi-platform network threat information to obtain an unknown attack means.

In a second embodiment, acquiring multi-platform network threat information, and predicting an attack means for the attack purpose through a preset machine learning model based on the multi-platform network threat information to acquire an unknown attack means; because the embodiment predicts the attack means based on the multi-platform network threat information, the multi-dimensional unknown attack means can be obtained.

Referring to fig. 4, fig. 4 is a schematic flow chart of a third embodiment of the network threat information extraction method according to the present invention, and based on the second embodiment shown in fig. 3, the third embodiment of the network threat information extraction method according to the present invention is proposed.

In a third embodiment, the step S101 includes:

step S1011: random information in unstructured network threat information is acquired.

It should be understood that, in order to reduce input randomness and increase information processing speed, in this embodiment, random information in unstructured network threat information may be acquired first, and then simplified processing may be performed on the random information to obtain simplified network threat information.

Step S1012: and simplifying the random information to obtain simplified network threat information.

In a specific implementation, the unstructured cyber threat information is, for example, a Trojan horse program, which, after running, releases the normal TP program TPHelper.exe and malicious TPHelperBase.dll under the%TEMP% directory to form dll hijacking. The random information in unstructured network threat information is simplified to obtain simplified network threat information, namely a Trojan horse program which can release normal TP program EXE files and malicious DLL files under a specific directory after running to form DLL hijacking.

In a third embodiment, it is disclosed to obtain random information in unstructured cyber threat information, simplify the processing of the random information, and simplify cyber threat information; because the embodiment firstly acquires the random information in the unstructured network threat information, and then simplifies the random information to acquire the simplified network threat information, the input randomness can be reduced, and the information processing speed is improved.

In a third embodiment, the step S102 includes:

step S1021: and acquiring sentence ending symbols, parallel relation conjunctions and progressive relation conjunctions in the simplified network threat information.

It should be noted that the sentence ending symbol may include? The following is carried out … … … '""' and the like, the parallel relationship conjunctions can include union, sum, heel, and the like, and the progressive relationship conjunctions can include not only, but also, what conditions and the like.

Step S1022: and carrying out deep sentence breaking on the simplified network threat information according to the sentence ending symbol, the parallel relation conjunctions and the progressive relation conjunctions to obtain threat sentences.

It can be understood that deep sentence breaking is performed on the simplified network threat information according to the sentence ending symbol, the parallel relation conjunctions and the progressive relation conjunctions, and the threat sentence obtaining may be to obtain the positions of the sentence ending symbol, the parallel relation conjunctions and the progressive relation conjunctions in the simplified network threat information, and deep sentence breaking is performed on the simplified network threat information according to the positions, so as to obtain the threat sentence.

In a third embodiment, it is disclosed to obtain sentence ending symbols, parallel relation conjunctions and progressive relation conjunctions in the simplified network threat information, and deep sentence breaking is performed on the simplified network threat information according to the sentence ending symbols, the parallel relation conjunctions and the progressive relation conjunctions to obtain threat sentences; because the embodiment carries out deep sentence breaking on the simplified network threat information to obtain threat sentences, the network threat information can be simplified, and each sentence to be analyzed is ensured to independently express a technical and tactical technology and an implementation process.

In a third embodiment, the step S103 includes:

step S1031: and carrying out semantic dependency analysis on the threat statement to obtain the dependency relationship among all the words in the threat statement.

It should be noted that the dependency relationship may be a dependency relationship between parent and child words.

Step S1032: and carrying out standardized processing on the threat statement according to the dependency relationship to obtain a standard threat statement.

It can be understood that after the threat statement is standardized according to the dependency relationship, tools, path methods, spatial positions, implementation ranges, achievement effects and the like used by an attacker in the statement can be standardized and output.

In a specific implementation, for example, the words to be expressed and the words to be expressed may be standardized and unified.

Further, in order to improve the effect of the normalization process, the step S1032 includes:

acquiring part-of-speech information of each word in the threat statement;

For ease of understanding, the description is given with reference to fig. 5, but the present solution is not limited thereto. Fig. 5 is a schematic diagram of semantic dependency analysis, in which a threat statement is "a key log obtained by a trojan horse is sent to a configurable email address", a ROOT represents a ROOT node, which is a full sentence core node, mDEPD represents a fluxing word, FEAT represents a modifier, PAT represents an object of a subject operation (object changes), rPAT represents an object of a subject operation (object changes, passive), CONT represents an object of a subject operation (object does not change significantly), rCONT represents an object of a subject operation (object does not change significantly), passive statement, mRELA represents a conjunctive word, a preposition, such as but equal to, AGT represents a subject word, LOC represents a space, and mPUNC represents a punctuation mark.

In a third embodiment, semantic dependency analysis is performed on threat sentences to obtain dependency relationships among words in the threat sentences, and standardized processing is performed on the threat sentences according to the dependency relationships to obtain standard threat sentences; because the embodiment performs semantic dependency analysis on the threat statement, the dependency relationship among the words in the threat statement is obtained, and the threat statement is standardized according to the dependency relationship, so that the standard threat statement is obtained, and the complex and changeable description modes can be standardized and unified.

In a third embodiment, the step S104 includes:

step S1041: and obtaining the necessary scores of all parts in the standard threat statement.

It should be noted that the necessary score is used to measure the degree of necessity of each vocabulary in the threat sentence in the sentence.

Step S1042: and simplifying the standard threat statement according to the necessary score to obtain a threat corpus.

In a specific implementation, for example, the standard threat statement is "Trojan horse sends the obtained keyboard log to a configurable email address", and the threat corpus in the threat corpus is obtained after the standard threat statement is simplified according to the necessary score of each part in the standard threat statement, so as to obtain the threat corpus "send the keyboard log email address".

In a third embodiment, obtaining necessary scores of all parts in a standard threat sentence, and simplifying the standard threat sentence according to the necessary scores to obtain a threat corpus; in the embodiment, the necessary scores of all the parts in the standard threat statement are obtained first, and then the standard threat statement is simplified according to the necessary scores to obtain the threat corpus, so that the threat corpus can be converged.

In a third embodiment, the step S105 includes:

step S1051: and obtaining the occurrence frequency of each keyword in the threat corpus, and determining threat keywords according to the occurrence frequency.

It may be appreciated that determining threat keywords based on the frequency of occurrence may be ranking the keywords based on the frequency of occurrence and determining threat keywords based on the ranking result.

Step S1052: and carrying out synonym expansion on the threat keywords based on a preset dictionary to obtain a target corpus.

It should be noted that, the preset dictionary may be preset, and synonyms corresponding to each keyword may be stored in the preset dictionary.

In particular implementations, for example, "Trojan gathering domain account names" in the threat corpus may be extended to "Trojan gathering domain user names", "Trojan gathering domain user accounts", and "Trojan harvesting domain user login names", among others.

In a third embodiment, obtaining the occurrence frequency of each keyword in a threat corpus, determining threat keywords according to the occurrence frequency, and performing synonym expansion on the threat keywords based on a preset dictionary to obtain a target corpus; because the embodiment carries out synonym expansion on the high-frequency keywords in the threat corpus, the recall rate of the subsequent model prediction can be improved.

Referring to fig. 6, fig. 6 is a flowchart illustrating a fourth embodiment of the network threat information extraction method according to the present invention, and the fourth embodiment of the network threat information extraction method according to the present invention is proposed based on the second embodiment shown in fig. 3.

In the fourth embodiment, before step S201, further includes:

step S110: and constructing a training set corpus according to the target corpus.

It should be appreciated that, in order to improve the accuracy of the preset machine learning model, in this embodiment, the initial machine learning model may be trained first to obtain the preset machine learning model.

It may be appreciated that constructing the training set corpus from the target corpus may be by randomly selecting training samples from the target corpus to construct the training set corpus.

Further, in order to cluster the corpus of training sets, the step S110 includes:

and constructing a training set corpus based on the threat statement samples.

It should be appreciated that synonymous threat statements may be statements having the same semantics.

It can be understood that the threat statement sample is selected from the target corpus according to the statement number, the synonymous threat statements are ordered according to the statement number from large to small, and the preset number of synonymous threat statements ordered at the front are used as threat statement samples.

Further, in order to improve reliability of the threat sentence samples, the selecting threat sentence samples from the target corpus according to the sentence number includes:

receiving a semantic tag input by a user based on the target corpus;

It should be appreciated that, to improve the reliability of the threat statement samples, each threat statement in the target corpus may also be labeled by the user entering a semantic tag.

It may be appreciated that, selecting the threat statement sample from the target corpus according to the sorting result and the semantic tag may be to sort the threat statement sample forward, where the semantic tag is a preset number of synonymous threat statements of a preset tag as the threat statement sample. The preset tag may be preset, which is not limited in this embodiment.

Step S120: and training the initial machine learning model according to the training set corpus to obtain a preset machine learning model.

It should be understood that, training the initial machine learning model according to the training set corpus, to obtain the preset machine learning model may be to input each threat sentence sample in the training set corpus into the initial machine learning model, and adjust the initial machine learning model according to the output result, so as to train the initial machine learning model to obtain the preset machine learning model.

In a fourth embodiment, it is disclosed that a training set corpus is constructed according to a target corpus, and an initial machine learning model is trained according to the training set corpus to obtain a preset machine learning model; because the initial machine learning model is trained in advance in the embodiment to obtain the preset machine learning model, the accuracy of the preset machine learning model is improved.

In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium is stored with a network threat information extraction program, and the network threat information extraction program realizes the network threat information extraction method when being executed by a processor.

In addition, referring to fig. 7, an embodiment of the present invention further provides a cyber threat information extraction apparatus, where the cyber threat information extraction apparatus includes: a language processing module 10, a means predicting module 20, and an information generating module 30;

the language processing module 10 is configured to perform natural language processing on unstructured network threat information, so as to obtain an attack purpose and an attack means.

The means predicting module 20 is configured to predict the attack means for the attack purpose by using a preset machine learning model, so as to obtain an unknown attack means.

The information generating module 30 is configured to generate structured network threat information according to the attack purpose, the attack means, and the unknown attack means.

In the embodiment, natural language processing is performed on unstructured network threat information to obtain an attack purpose and an attack means, attack means prediction is performed on the attack purpose through a preset machine learning model to obtain an unknown attack means, and structured network threat information is generated according to the attack purpose, the attack means and the unknown attack means; because the embodiment automatically identifies and extracts the attack purpose and attack means of the attacker in the unstructured network threat information based on natural language processing and a preset machine learning model, the analysis process of the network threat information can be simplified, and the security defense capability can be improved.

Other embodiments or specific implementation manners of the network threat information extraction apparatus of the present invention may refer to the above method embodiments, and are not described herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read only memory mirror (Read Only Memory image, ROM)/random access memory (Random Access Memory, RAM), magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

The invention discloses A1, a network threat information extraction method, which comprises the following steps:

A2, the network threat information extraction method as described in A1, wherein the step of performing natural language processing on unstructured network threat information to obtain an attack purpose and an attack means comprises the following steps:

carrying out synonym expansion on the threat corpus to obtain a target corpus;

A3, the network threat information extraction method as described in A2, wherein the step of performing semantic dependency analysis on the threat statement to obtain the dependency relationship between each vocabulary in the threat statement comprises the following steps:

A4, the network threat information extraction method according to A3, wherein the step of carrying out standardized processing on the threat statement according to the dependency relationship to obtain a standard threat statement comprises the following steps:

acquiring part-of-speech information of each word in the threat statement;

A5, the network threat information extraction method as described in A2, wherein the step of performing synonym expansion on the threat corpus to obtain a target corpus comprises the following steps:

A6, the network threat information extraction method as described in A2, wherein the step of performing deep sentence breaking on the simplified network threat information to obtain threat sentences comprises the following steps:

A7, the network threat information extraction method as described in A2, wherein the step of performing vocabulary marking on the standard threat sentences to obtain a threat corpus comprises the following steps:

obtaining necessary scores of all parts in the standard threat statement;

A8, the network threat information extraction method as described in A2, wherein the step of performing text preprocessing on unstructured network threat information to obtain simplified network threat information further comprises:

Acquiring random information in unstructured network threat information;

A9, the network threat information extraction method according to A2, wherein before the step of predicting the attack means for the attack purpose by a preset machine learning model and obtaining the unknown attack means, further comprises:

constructing a training set corpus according to the target corpus;

A10, the network threat information extraction method as described in A9, wherein the step of constructing a training set corpus according to the target corpus comprises the following steps:

and constructing a training set corpus based on the threat statement samples.

A11, the network threat information extraction method according to A10, wherein the step of selecting threat sentence samples from the target corpus according to the sentence number comprises the following steps:

Receiving a semantic tag input by a user based on the target corpus;

A12, the network threat information extraction method according to any one of A1 to A11, wherein the attack means prediction is performed on the attack purpose through a preset machine learning model, and the step of obtaining an unknown attack means comprises the following steps:

acquiring multi-platform network threat information;

The invention also discloses B13, a network threat information extraction device, the network threat information extraction device includes: the system comprises a memory, a processor and a network threat information extraction program stored on the memory and capable of running on the processor, wherein the network threat information extraction program is executed by the processor to realize the network threat information extraction method.

The invention also discloses C14, a storage medium, the storage medium stores a network threat information extraction program, and the network threat information extraction program realizes the network threat information extraction method when being executed by a processor.

The invention also discloses a D15 and a network threat information extraction device, wherein the network threat information extraction device comprises: the device comprises a language processing module, a means prediction module and an information generation module;

D16, the cyber threat information extraction device according to D15, wherein the language processing module is further configured to perform text preprocessing on unstructured cyber threat information to obtain simplified cyber threat information;

D17, the cyber threat information extraction device as described in D16, where the language processing module is further configured to perform semantic dependency analysis on the threat statement, so as to obtain a dependency relationship between each vocabulary in the threat statement;

The network threat information extraction device as described in the D17, wherein the language processing module is further configured to obtain part-of-speech information of each vocabulary in the threat sentence;

The network threat information extraction device as described in the D19, the language processing module is further configured to obtain occurrence frequencies of keywords in the threat corpus, and determine threat keywords according to the occurrence frequencies;

D20, the cyber threat information extraction apparatus as described in D16, wherein the language processing module is further configured to obtain a sentence ending symbol, a parallel relation conjunctive, and a progressive relation conjunctive in the simplified cyber threat information;

Claims

1. The network threat information extraction method is characterized by comprising the following steps of:

2. The cyber threat information extraction method of claim 1, wherein the step of performing natural language processing on unstructured cyber threat information to obtain an attack purpose and an attack means comprises:

carrying out synonym expansion on the threat corpus to obtain a target corpus;

3. The cyber threat information extraction method of claim 2, wherein the step of performing semantic dependency analysis on the threat statement to obtain a dependency relationship between words in the threat statement includes:

4. The cyber threat information extraction method of claim 3, wherein the step of normalizing the threat statement according to the dependency relationship to obtain a standard threat statement comprises:

Acquiring part-of-speech information of each word in the threat statement;

5. The cyber threat information extraction method of claim 2, wherein the step of performing synonym expansion on the threat corpus to obtain a target corpus comprises:

6. The cyber threat information extraction method of claim 2, wherein the step of performing deep sentence breaking on the simplified cyber threat information to obtain a threat sentence includes:

7. The cyber threat information extraction method of claim 2, wherein the step of lexically tagging the standard threat sentences to obtain a threat corpus comprises:

Obtaining necessary scores of all parts in the standard threat statement;

8. A cyber threat information extraction apparatus, the cyber threat information extraction apparatus comprising: a memory, a processor, and a cyber threat information extraction program stored on the memory and executable on the processor, which when executed by the processor, implements the cyber threat information extraction method of any of claims 1 to 7.

9. A storage medium having stored thereon a cyber threat information extraction program which when executed by a processor implements the cyber threat information extraction method of any of claims 1 to 7.

10. A cyber threat information extraction apparatus, the cyber threat information extraction apparatus comprising: the device comprises a language processing module, a means prediction module and an information generation module;