CN107391684B

CN107391684B - Method and system for generating threat information

Info

Publication number: CN107391684B
Application number: CN201710606939.6A
Authority: CN
Inventors: 梁玉; 赵振洋; 古亮; 蒋振超
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Shenzhen Shenxinfu Information Security Co ltd
Priority date: 2017-07-24
Filing date: 2017-07-24
Publication date: 2020-12-11
Anticipated expiration: 2037-07-24
Also published as: CN107391684A

Abstract

The application discloses a method for generating threat intelligence, which comprises the following steps: acquiring a source article according to a preset period, and preprocessing the source article to generate a text source article; performing theme analysis on the text source article, and screening out the text source article related to a safety theme; screening out security sentences related to IoC from the text source articles related to security subjects, and judging whether the security sentences are IoC items; if the safety statement is the IoC entry, standardizing the safety statement to generate standard IoC data; the method automatically extracts and generates the structured threat information, saves the steps of manual analysis and information arrangement, ensures that the arrangement of the threat information is more orderly and rapid, and has great value on safety research and safety improvement; the application also discloses a system for generating threat information, which has the beneficial effects.

Description

Method and system for generating threat information

Technical Field

The invention relates to the field of network security, in particular to a method and a system for generating threat intelligence.

Background

Today, with rapid development of science and technology, ensuring the security of the network has become the premise of internet technology development, and many security researchers and security manufacturers can share technical details with the colleagues and researchers by issuing technical research articles, security reports and the like through internet media. Some articles issued by these security researchers and security manufacturers contain information with extremely high value in the aspect of security technology, and this information is beneficial to relevant technical personnel to realize significant breakthrough in the aspect of network security.

Because the technical research articles and the safety reports usually exist in unstructured data, the extraction of valuable information in the technical research articles and the safety reports at present mainly depends on experts with safety field backgrounds to read, analyze and finally manually refine and summarize threat intelligence which can be processed by safety equipment or software. There is a lack of automated analysis methods and tools to extract threat intelligence from these internet open data more comprehensively and quickly.

Therefore, how to automatically extract and generate structured threat intelligence is a technical problem that needs to be solved by those skilled in the art at present.

Disclosure of Invention

The application aims to provide a threat information generation method and system, which can automatically extract and generate structured threat information.

In order to solve the above technical problem, the present application provides a method and a system for generating threat intelligence, where the method includes:

acquiring a source article according to a preset period, and preprocessing the source article to generate a text source article;

performing theme analysis on the text source article, and screening out the text source article related to a safety theme;

screening out security sentences related to IoC from the text source articles related to security subjects, and judging whether the security sentences are IoC items;

and if the safety statement is the IoC entry, standardizing the safety statement to generate standard IoC data.

Optionally, the obtaining the source article according to the preset period, and preprocessing the source article to generate the text source article includes:

acquiring the source article according to a preset period by using a web crawler;

and preprocessing the source article through a natural language processing technology and an image processing technology to generate the text source article.

Optionally, performing topic analysis on the text source article, and screening out the text source article related to a safe topic includes:

and performing topic analysis on the text source article through the document topic correlation model and algorithm in the natural language processing technology, and screening out the text source article related to a safe topic.

Optionally, the screening out security sentences related to IoC from the text source articles related to security topics, and determining whether the security sentences are IoC entries includes:

screening the safety sentences related to IoC from the text source articles related to the safety subjects, and performing lexical analysis and term recognition on the safety sentences to obtain recognition results;

constructing a dependency graph according to the identification result, searching a shortest path on the dependency graph, and generating a dependency relationship;

judging whether the dependency relationship is an IoC relationship through a machine learning-based correlation algorithm;

if the dependency is the IoC relationship, then the secure statement is defined as IoC entries.

Optionally, the IoC data of the generation criteria includes:

generating the IoC data of the OpenIoC standard.

The present application further provides a system for threat intelligence generation, the system comprising:

the system comprises a preprocessing module, a searching module and a display module, wherein the preprocessing module is used for acquiring a source article according to a preset period and preprocessing the source article to generate a text source article;

the article screening module is used for performing theme analysis on the text source articles and screening out the text source articles related to a safety theme;

the sentence screening module is used for screening out safety sentences related to IoC from the text source articles related to the safety subjects and judging whether the safety sentences are IoC items;

and the standardization module is used for carrying out standardization processing on the safety statement to generate standard IoC data when the safety statement is the IoC entry.

Optionally, the preprocessing module includes:

the acquisition unit is used for acquiring the source article according to a preset period by using a web crawler;

and the text conversion unit is used for preprocessing the source article through a natural language processing technology and an image processing technology to generate the text source article.

Optionally, the article screening module is specifically a module that performs topic analysis on the text source article through a document topic correlation model and an algorithm in the natural language processing technology to screen out the text source article related to a security topic.

Optionally, the statement filtering module includes:

the recognition unit is used for screening out the safety sentences related to IoC from the text source articles related to the safety subjects, and performing lexical analysis and term recognition on the safety sentences to obtain recognition results;

the dependency construction unit is used for constructing a dependency graph according to the identification result, searching a shortest path on the dependency graph and generating a dependency relationship;

a dependency judgment unit for judging whether the dependency relationship is an IoC relationship by a machine learning-based correlation algorithm;

a defining unit, configured to define the secure statement as IoC entry when the dependency relationship is the IoC relationship.

Optionally, the normalization module includes:

a standard generating unit, configured to generate the IoC data of the OpenIoC standard.

The invention provides a method for generating threat information, which comprises the steps of acquiring a source article according to a preset period, preprocessing the source article and generating a text source article; performing theme analysis on the text source article, and screening out the text source article related to a safety theme; screening out security sentences related to IoC from the text source articles related to security subjects, and judging whether the security sentences are IoC items; and if the safety statement is the IoC entry, standardizing the safety statement to generate standard IoC data.

The method obtains the source articles existing in an unstructured form from the open Internet, and certainly, part of contents in the source articles exist in the forms of pictures and tables, so that all the contents of the source articles need to be converted into text forms, namely text source articles, for convenience of content identification. And (3) performing topic screening on the text source article to select the text source article related to safety, wherein the sentences of the text source article after the topic screening are screened to screen out the sentences related to IoC because all contents in the text source article related to safety are not related to safety. The sentences related to IoC do not all satisfy the IoC relationship, only the sentences with IoC relationship can generate IoC data according to the standard, so the sentences related to IoC are screened to select IoC entries with IoC relationship. After IoC entries are obtained, IoC data are generated according to certain criteria. The method automatically extracts and generates the structured threat information, saves the steps of manual analysis and information arrangement, ensures that the arrangement of the threat information is more orderly and rapid, and has great value on safety research and safety improvement. The application also provides a system for generating threat information, which has the beneficial effects and is not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

Fig. 1 is a flowchart of a method for threat intelligence generation according to an embodiment of the present application;

FIG. 2 is a flow chart of another method for threat intelligence generation provided by an embodiment of the present application;

fig. 3 is a schematic structural diagram of a system for threat intelligence generation according to the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a flowchart of a method for generating threat intelligence according to an embodiment of the present disclosure;

the specific steps may include:

step S101: acquiring a source article according to a preset period, and preprocessing the source article to generate a text source article;

the implementation subject of the step is a system, and the step aims to initially acquire a source article to be processed in an open network environment. The source articles in this step may be selected security vendor technical blogs, security technician blogs, technical forums, microblog/Twitter social media of specific users, and the like, and the source range of the source articles is not limited herein, and the source articles may be any places where technical articles in related fields are often published.

Of course, after the range of the source article is selected, the period for acquiring the source article may be set according to the average update frequency and quality of the source article, and the period for acquiring the source article is not specifically limited herein. There may also be cases where: some technical forums stop the source article updating for some reasons, and at this time, obtaining the source article from the technical forum is a time-consuming and labor-consuming task without reporting, so that the source article can be obtained regularly within a preset network range, the article updating frequency and quality of the source website of each source article can be evaluated regularly, websites with slow updating and low quality of some source articles are eliminated, and the quality of obtaining the source article is improved. The updating frequency and the quality of the websites of all source article sources can be uploaded to the cloud for data sharing, and therefore the overall efficiency and the quality of the whole threat information generation environment are improved.

It can be understood that obtaining unstructured source articles is far from performing secure topic analysis and IoC item judgment, because the source articles are widely available, the source articles may include articles in various formats, that is, the articles may include pictures, tables, etc., and it is difficult and variable to perform the steps of topic analysis on the pictures and the tables. Therefore, the information in the obtained source article needs to be converted into a uniform format, and certainly, the source article can be converted into a plurality of formats, such as characters, codes, voiceprints, and the like.

Generally, there are many methods for converting a non-text source article into a text source article, and a person skilled in the art can select or create a conversion method to realize conversion from the non-text source article to the text source article according to the specific actual situation of the source article. Of course, there may be cases where: the image definition in the source article is low, so that the related conversion software cannot identify the content in the image, and the identification fails. At this time, the picture which cannot be identified can be uploaded to an expert with a background in the security field, the expert performs manual analysis and arrangement, if the expert cannot identify the content in the picture, the picture is abandoned, and other contents except the picture are identified by related software.

It will be appreciated that the conversion of non-textual content in the source article to textual content is only part of the pre-processing, which also includes other processing steps for the source article, such as: numbering the source articles, classifying the source of the source articles, converting the language type and the like. In summary, the text source article generated in this step is a language type text source article that can be supported by the method and system.

Step S102: performing theme analysis on the text source article, and screening out the text source article related to a safety theme;

the criterion for acquiring the source article in step S101 is whether some predetermined websites for acquiring the source article update the article, and since these websites are basically security-related websites, the article updated on these websites is also basically a security-related article. However, since the website for acquiring the source article does not update the article related to security, but includes other articles unrelated to security, it cannot be determined that one hundred percent of the source articles acquired in step S101 are the source articles related to the security topic. It can be understood that: step S101 is a stage of obtaining raw materials, and the raw materials need to be screened for many times in subsequent steps until a final product is obtained.

The purpose of this step is to select a text source article related to the security topic from the text source articles obtained in S101, and eliminate the text source article unrelated to the security topic. There are many methods for screening text source articles related to the security topic in this step, for example, the frequency of occurrence of keywords, expression of article semantics, etc., and those skilled in the art can select or design the method for screening text source articles related to the security topic according to specific practical situations. Of course, the most common screening method is to screen through document theme related models and algorithms in natural language processing technology.

Of course, there may be cases where: a certain article is published by a plurality of websites, and a plurality of identical articles are obtained in step S101, the names of the articles may be changed slightly, but the contents are substantially consistent, and a step may be added after step S101: text source articles with very similar contents are merged into the same article for filtering in step S102. It can be understood that the same article published by multiple websites may be an extremely important article, which may greatly help to improve the security capability, and the text source article may be highlighted, so that after the IoC data is sorted, a person who reads IoC data is reminded to read the text source article with emphasis.

Step S103: screening out security sentences related to IoC from the text source articles related to security subjects, and judging whether the security sentences are IoC items;

in step S102, the principle that it cannot be determined that one hundred percent of the source articles acquired in step S101 are source articles related to the security topic is the same, and not all the contents of the text source articles of the security topic are contents related to security, so that the specific contents of the text source articles of the security topic need to be screened again.

Generally, the most basic information unit in a sentence is a sentence, and if a sentence is related to safety, the sentence is considered to contain safety information. Therefore, in this step, the term related to IoC is selected on a term-by-term basis. IoC (indicator of compliance, also known as a compromise indicator) in the field of information security, IoC is a specific form of threat intelligence, and can be used for threat intelligence sharing and exchange. Specifically, IoC refers to some feature or phenomenon that can identify that a system has been compromised by a hacker or has been infected with a virus, etc. That is, the security-related statement is the IoC-related statement.

The method for screening IoC related safe sentences from the text source article of the safe topic is basically the same as the method for screening IoC related safe sentences from the text source article in step S102, and the specific method for screening the safe sentences is not limited herein as long as the sentences related to the safe topic can be screened from the whole article.

Step S104: if the safety statement is the IoC entry, standardizing the safety statement to generate standard IoC data;

however, not all of the security sentences screened in step S103 can be IoC data generated according to a certain standard. A security statement may be a statement that has only certain keywords, and the meaning or function of the statement is not related to security. Only those statements that fit IoC relationships can generate standardized IoC data, and thus the security statements still need to be filtered.

In the step, the screening of the safety sentences does not depend on keywords to judge when the source articles or the text source articles are screened, and the meaning to be expressed in the sentences needs to be judged. Multiple steps may be involved in the screening process, such as: lexical analysis, IoC-related term recognition, IoC relationship generation and verification, and generation of verified IoC term relationships, etc., the method of screening out IoC entries is not limited herein as long as IoC entries, that is, secure sentences having IoC relationships can be screened out.

Of course, IoC data has many standards, and may be converted into IoC data of the OpenIoC standard, or may be converted into IoC data of the STIX standard, and is not limited specifically here.

Referring now to fig. 2, fig. 2 is a flow chart of another method for threat intelligence generation according to an embodiment of the present application; in this embodiment, on the basis of the above embodiment, the steps of screening the statements related to the security subject, identifying IoC entries, and the like are limited, and other steps are substantially the same as those in other embodiments, and the same parts may refer to the related parts in other embodiments, and are not described herein again.

The specific steps may include:

step S201: acquiring the source article according to a preset period by using a web crawler;

the web crawler is a program or script for automatically capturing information in a network according to a certain rule, and the web crawler has a plurality of types, and only the source article can be obtained in the step, and the specific form of the web crawler is not limited.

It can be understood that the period in this step is set manually, and the period for acquiring the source article can be shortened when the relevant threat intelligence is urgently needed, or can be increased when the degree of need for the threat intelligence is low, that is, the period for acquiring the source article is not fixed, and can be set according to specific actual conditions.

Step S202: preprocessing the source article through a natural language processing technology and an image processing technology to generate the text source article;

the natural language processing technology is an important classification in the fields of computer science and artificial intelligence, and mainly researches various theories and methods for realizing effective communication between people and computers by using natural language. Through the natural language technology, steps such as extraction can be conveniently carried out for several times. Image processing (image processing) is a technology for analyzing an image by a computer to achieve a required result, and in the step, images and tables in a source article are converted into text source articles of language types which can be supported by the method and the system by using an image processing technology.

Step S203: performing topic analysis on the text source articles through a document topic correlation model and algorithm in the natural language processing technology, and screening out the text source articles related to a safe topic;

there are many methods for screening the text source articles related to the security topic, such as lda (latent Dirichlet allocation), etc., and the screening method is not specifically limited herein.

Step S204: screening the safety sentences related to IoC from the text source articles related to the safety subjects, and performing lexical analysis and term recognition on the safety sentences to obtain recognition results;

there are many methods for screening out IoC-related sentences, and a method for screening out IoC-related sentences is not limited as long as they can screen out IoC-related content sentences, and they may be either IoC-related sentences or security-related sentences.

Step S205: constructing a dependency graph according to the identification result, searching a shortest path on the dependency graph, and generating a dependency relationship;

the sentences of the safety sentences have certain grammatical logic relationship, a dependency graph aiming at the logic relationship is constructed through a related algorithm, and the shortest path is set, so that the grammatical logic relationship in the sentences is converted into the dependency relationship in the dependency graph.

Step S206: judging whether the dependency relationship is an IoC relationship through a machine learning-based correlation algorithm;

the standard IoC item has a certain IoC relationship, if the constructed dependency relationship conforms to the IoC relationship, the security statement is determined to be convertible as standard IoC data, and if the constructed dependency relationship does not conform to the IoC relationship, the security statement is discarded.

Step S207: if the dependency is the IoC relationship, then the secure statement is defined as IoC entries.

Step S208: generating the IoC data of the OpenIoC standard.

Since the embodiment of the system part corresponds to the embodiment of the method part, the embodiment of the system part is described with reference to the embodiment of the method part, and is not repeated here.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a system for threat intelligence generation according to the present application;

the system may include:

the system comprises a preprocessing module 100, a source article generating module and a text source article generating module, wherein the preprocessing module is used for acquiring the source article according to a preset period and preprocessing the source article to generate the text source article;

the article screening module 200 is configured to perform topic analysis on the text source article, and screen out the text source article related to a safe topic;

a sentence screening module 300, configured to screen out security sentences related to IoC from the text source articles related to security topics, and determine whether the security sentences are IoC entries;

a normalization module 400, configured to, when the secure statement is the IoC entry, perform normalization processing on the secure statement to generate standard IoC data.

In another embodiment of the system for generating threat intelligence provided by the present application, the system further comprises: the preprocessing module 100 includes:

Further, the article screening module 200 is specifically a module that performs topic analysis on the text source article through a document topic correlation model and algorithm in the natural language processing technology to screen out the text source article related to a security topic.

Further, the statement filtering module 300 includes:

Further, the normalization module 400 includes:

The method and system for generating threat intelligence provided by the present application are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

The embodiments are described in a progressive mode in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of threat intelligence generation, the method comprising:

acquiring a source article according to a preset period, and preprocessing the source article to generate a text source article; the preset period is set according to the average updating frequency and the quality of the source article;

screening out safety sentences relevant to IoC from the text source articles relevant to the safety subjects, and performing lexical analysis and term recognition on the safety sentences to obtain recognition results;

if the dependency relationship is the IoC relationship, defining the secure statement as IoC entries;

2. The method of claim 1, wherein the obtaining of the source article according to the preset period and the preprocessing of the source article to generate the text source article comprise:

3. The method of claim 2, wherein performing topic analysis on the text source articles and screening the text source articles related to safe topics comprises:

4. The method of claim 1, wherein generating the standard IoC data comprises:

generating the IoC data of the OpenIoC standard.

5. A system for threat intelligence generation, the system comprising:

the system comprises a preprocessing module, a searching module and a display module, wherein the preprocessing module is used for acquiring a source article according to a preset period and preprocessing the source article to generate a text source article; the preset period is set according to the average updating frequency and the quality of the source article;

a standardization module, configured to standardize the secure statement to generate standard IoC data when the secure statement is the IoC entry;

wherein, the statement screening module comprises:

6. The system of claim 5, wherein the preprocessing module comprises:

7. The system of claim 6, wherein the article screening module is specifically a module for screening the text source articles related to the security topic by performing topic analysis on the text source articles through a document topic correlation model and algorithm in the natural language processing technology.

8. The system of claim 5, wherein the normalization module comprises: