CN113961969A

CN113961969A - Security threat collaborative modeling method and system

Info

Publication number: CN113961969A
Application number: CN202111575617.2A
Authority: CN
Inventors: 胡文友; 曲武; 胡永亮
Original assignee: Jinjing Yunhua Shenyang Technology Co ltd; Beijing Jinjingyunhua Technology Co ltd
Current assignee: Jinjing Yunhua Shenyang Technology Co ltd; Beijing Jinjingyunhua Technology Co ltd
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-01-21
Anticipated expiration: 2041-12-22
Also published as: CN113961969B

Abstract

The invention belongs to the technical field of security threat identification, and particularly relates to a security threat collaborative modeling method and a system, wherein the method comprises the following steps: sharing data; fusing data; extracting data characteristics; modeling; and (6) auditing. The invention quantifies and settles the sharing behavior of the data in a contribution form, and encourages data transaction and data sharing; entity IDs are exchanged in a desensitization mode, and the joint modeling capability and the threat IOC matching capability can be still supported on the premise of ensuring data security; the method supports common local machine learning and federal learning, each participant can select data openness degree according to needs, obtain external data resources according to self budget, independently or cooperatively complete AI modeling, and enhance safety and intelligence.

Description

Security threat collaborative modeling method and system

Technical Field

The invention belongs to the technical field of security threat identification, and particularly relates to a security threat collaborative modeling method and system.

Background

Currently, AI technology plays an increasingly important role in the field of network security threat detection. Behavior traces left by network entities such as domain names, IP, URLs and malicious programs in the network space are analyzed and mined by an AI processing program to form rich behavior characteristics, and data support is provided for supervised research and judgment and unsupervised exploration. Malicious judgment based on behavior characteristics is an important task in the field. When the security protection system discovers an unknown network entity, it needs to be able to study whether it is malicious or not to make further decisions. The entity types needing to be subjected to malice study comprise domain names, IP, URLs, malicious programs and the like. The more comprehensively the extracted features reflect the behavior of the entity, the more hopefully the intention of the entity is revealed. In addition, for supervised learning tasks, there is also a need to collect and master high quality training labels. However, both the extraction of behavior features and the collection of training labels are difficult, one of the important reasons being that behavior traces tend to be scattered. This dispersion is shown in the following layers: and (1) mechanism demarcation. The behavior data and the tag data are grasped by different organizations or departments and are not communicated with each other. The cause of this phenomenon is, on the one hand, the security and privacy requirements of the organization and, on the other hand, the drive for the lack of a coherence interest. (2) The device gap. Different safety devices in enterprises observe network entities from the own visual angles respectively to generate blind-man image observation records, and effective integration is not achieved. And (3) solid splitting. The infrastructure of the attack group comprises a host, a domain name, a malicious program and other resource types, and in an attack event, the resources can be grouped for fighting and need to be viewed integrally. If the guardian fails to associate the related resources, some important information is lost. (4) And (5) separating the human body from the machine body. The training dataset is a feed for a machine learning model. Machine learning models require a large number of labeled samples as training data sets. The algorithm model of network security is marked as feed by research and judgment of security experts. These data sets are scarce and valuable and are the bottleneck in the fall-off of secure AI technology. First, safety experts are scarce; secondly, the analysis and study work results of the security experts often cannot be interfaced with an AI scene, i.e., processed into a form that can be used by a machine learning model. A safety expert is required to provide judgment basis and judgment results simultaneously, and the judgment basis and the judgment results are respectively used as a training characteristic and a training label of the model. However, since the primary job of the security expert is not AI-directed, its analytical study often remains in the expert's own mind, or drifts zero in a cluttered report, often difficult to understand by others, not to mention machine recognition and awareness. Therefore, the security industry should actively extract and integrate the output of human individuals to generate machine-readable judgment basis and judgment results, thereby forming a training data set suitable for an AI scene.

Therefore, the protecting party needs to get through the data sources provided by different participants to provide sufficient data fuel for the AI model. Federal learning and threat intelligence sharing are approaches to addressing data islanding caused by organizational boundaries. XDR (extended detection response) is a mechanism to bridge device gaps. To solve the problem of entity splitting, graph analysis needs to be introduced. Integrating these approaches requires a collaborative mechanism to fuse the security data output by different mechanisms and different sentries.

AI tag data is the model or the expert's intelligent crystal. Threat intelligence, especially machine-readable threat intelligence of operation class, is AI tag data. The most difficult environment for AI to land is the collection and management of tag data.

AI feature data, typically obtained by active probing or passive observation, represents an objective event. For example, a sandbox may record a sequence of sample behaviors, a gatekeeper may count traffic levels for a particular IP, and some IDSs may alert for IOC access. Logs generated by the devices can be abstracted into behavior characteristics, and a safety data set is constructed through a data preprocessing means and is used as input of AI model training and prediction. However, if the enterprise shares such data directly to the outside, the enterprise may expose its network details, and there is a certain data security risk.

The security industry has generally agreed upon the importance of intelligence sharing, but has not yet found a generally accepted sharing mechanism. And establishing an information community by security teams and security enterprises. Such communities require a channel-exiting incentive mechanism to facilitate information sharing. The parties that provide valuable intelligence should be reputable and physically compensated. This means that the flow of information is audited, the value of intelligence is verified, and the sharing of parties is quantified. The method is also suitable for the scene that the enterprise exchanges information with a supervision organization.

Disclosure of Invention

In order to solve the technical problem, the invention provides a security threat collaborative modeling method and system.

The invention is realized in such a way, and provides a security threat collaborative modeling method, which comprises the following steps:

1) data sharing, wherein incomplete sharing of network entity data is performed among data sources provided by a plurality of participants, and before sharing, desensitization is performed on network entity data which is necessary to be desensitized;

2) data fusion, namely performing attribute fusion, relationship fusion, behavior fusion and label fusion on the data shared in the step 1) to form attribute fusion data, relationship fusion data, behavior fusion data and label fusion data;

3) data feature extraction, namely respectively extracting data features of the attribute fusion data, the relation fusion data and the behavior fusion data obtained in the step 2);

4) modeling, selecting different types of modeling methods according to needs, selectively loading data characteristics and label fusion data, performing a machine learning specific training process, generating a training model, and outputting the training model;

5) auditing, running in a block chain or a cooperation platform shared by multiple participants, carrying out accounting according to the data flow in the steps 1), 2), 3) and 4), counting contribution points for different participants according to a set rule, and rewarding the participants according to the contribution points.

Preferably, in step 1), the shared data includes the following types:

entity attribute data, association relation data among entities, entity behavior record data and entity studying and judging label data.

Further preferably, in step 1), different network entities are identified by different signs, data of the network entity which needs desensitization is identified by desensitization signs, and a rainbow table between the signs of the data of the network entity which needs desensitization and the desensitization signs is established.

Further preferably, in the step 1), the triggering mode of data sharing includes active sharing and help-response sharing; the data sharing mode comprises community release and point-to-point release; the data sharing scenario includes voluntary sharing and legally defined share of services.

Further preferably, in step 2):

the attribute fusion refers to fusing the attributes obtained by the same network entity at different participants by predefining a fusion strategy, specifically complementing different attribute information recorded by the same network entity at different data sources, and removing duplication and correcting the same attribute information;

the relationship fusion means that the incidence relationship between a pair of network entities is fused to form a mapped network entity relationship library, specifically, different relationship information recorded by the pair of network entities at different data sources is mutually supplemented, and duplicate removal and correction are performed on the same relationship information;

behavior fusion refers to the fusion of behavior information recorded by the same network entity at different data sources, and forms more comprehensive and complete observation records of each network entity by integrating multi-source and scattered behavior information, specifically, different behavior information of the same network entity is arranged according to time sequence, and the same behavior information from different participants is subjected to duplication removal and correction;

the label fusion means that different participants respectively provide research and judgment labels for the same entity, after receiving the research and judgment labels sent by other participants, each reader executes a local trust collection strategy, integrates information of each participant, gives a higher trust degree to the same label given by multiple parties, and supplements different labels given by the multiple parties, so as to obtain the confidence degree of each label.

Further preferably, in step 3):

the data characteristics of the attribute fusion data include: IP position, domain name registration time and file change time;

the data characteristics of the relationship-fused data include: the access degree of the graph nodes, the number of domain name associated IP, the number of domain name associated NS servers and the access degree of the IP type neighbor nodes limited by the domain name nodes;

the data characteristics of the behavior fusion data include: the method comprises the following steps of counting characteristics and sensitive behavior characteristics, wherein the counting characteristics comprise transverse communication times, external connection times and file access times; sensitive behavior features include modification of startup items, overseas contacts, access to registration edges.

Further preferably, in step 4), the modeling method includes local training, federal learning, intelligence aggregation and integrated learning, specifically:

local training, wherein any participant executes a machine learning training task as required according to the grasped data characteristics and label fusion data to obtain an AI model;

federal learning, wherein a plurality of participants agree to train an AI model together under the condition of incomplete data sharing;

information is aggregated, a participant executes a customized letter collection strategy on a network entity marked as malicious, the desensitized network entity with the malicious label is taken as a customized IOC index, the IOC index is communicated with customized confidence level thresholds of all participants to form a class of decision models, whether input network entity data are matched with known IOC indexes with confidence levels larger than the confidence level thresholds can be judged, and the IOC model is obtained;

the method comprises the steps of ensemble learning, wherein local training, federal learning and information aggregation means under different parameter settings are comprehensively operated to generate a plurality of threat studying and judging models, initial studying and judging results of the models are comprehensively obtained in a voting mode to generate final studying and judging results with higher credibility, and local training, federal learning or information aggregation are independently used as special cases of ensemble learning.

More preferably, in step 5), the rule for calculating the contribution score is set as:

attribute data are shared and read by other participants, and a single attribute corresponding to the basic contribution score is recorded as Sa;

sharing relation data, and reading the relation data by other participants, and recording a single relation corresponding to the basic contribution score as Sr;

sharing behavior data, reading the behavior data by other participants, and recording a basic contribution score corresponding to a single behavior event as Sb;

sharing tag data, and reading by other participants, wherein the basic contribution corresponding to a single tag is Sl;

sharing transparent data, and using the data for machine learning, wherein the contribution corresponding to a single transparent data is Sf;

directionally sharing network entity data under the requirements of other participants, wherein the contribution corresponding to single network entity data is St;

after the reader reads the data, the corresponding contribution score of the reader is deducted, and the reader can selectively evaluate the read data and give a score, if the difference between the score given by the reader and the average score obtained by the data is within a certain range, the reader obtains a certain contribution score, when the average score of the data is higher than a certain value, the participant sharing the data obtains a score reward, and when the average score of the data is lower than a certain value, the participant sharing the data deducts a certain score;

the participants initially have a certain initial score by default.

The invention also provides a security threat collaborative modeling system, which comprises the following modules:

the data sharing unit is used for incompletely sharing the network entity data among a plurality of participants and desensitizing the network entity data which is necessary to be desensitized before sharing;

the data fusion unit is used for performing attribute fusion, relationship fusion, behavior fusion and label fusion on the shared data to form attribute fusion data, relationship fusion data, behavior fusion data and label fusion data;

the data feature extraction unit is used for respectively extracting data features of the attribute fusion data, the relationship fusion data and the behavior fusion data;

the modeling unit comprises a data loading unit, a model training unit and a model output unit, different types of modeling methods are selected according to needs, the data loading unit is used for selectively loading data characteristics and label fusion data, the model training unit is used for performing a machine learning specific training process to generate a training model, and the model output unit is used for training model output;

and the auditing unit runs in a block chain or a cooperation platform shared by all the participants and is used for accounting data flows in the data sharing unit, the data fusion unit, the data feature extraction unit and the modeling unit, accounting contribution scores for different participants according to a set rule and rewarding the participants according to the contribution scores.

Compared with the prior art, the invention has the advantages that:

1. a unified data circulation framework is provided for industry sharing and supervision sharing of threat intelligence, quantification and settlement are carried out on sharing behaviors of data in a contribution score mode, data transaction and data sharing are encouraged, and consensus is achieved on contributions of all parties through a block chain or a cooperation platform technology;

2. entity IDs are exchanged in a desensitization mode, and the joint modeling capability and the threat IOC matching capability can be still supported on the premise of ensuring data security;

3. the method supports common local machine learning and federal learning, each participant can select data openness degree according to needs, obtain external data resources according to self budget, independently or cooperatively complete AI modeling, and enhance safety and intelligence.

Drawings

FIG. 1 is a flow chart of a method provided by the present invention;

FIG. 2 is a modeling flow diagram;

FIG. 3 is a block diagram of an apparatus according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention mainly discusses the fusion problem of AI characteristic data and AI label data, and realizes the collaborative machine learning capability based on the fusion mechanism. It is advocated to perform a certain desensitization of the data, and only in a few cases necessary to restore the desensitized data. Specifically, the method comprises the following steps:

referring to fig. 1, the present invention provides a collaborative modeling method for security threats, comprising the following steps:

in step 1), the participants may be different organizations, departments, persons or devices, which do not share data completely, and for each data item, the data item is considered to be provided to the reader by the sharer.

In step 1), the shared data includes the following types:

entity attribute data, association relation data among entities, entity behavior record data and entity studying and judging label data. Specifically, the method comprises the following steps:

entity attribute data is a description of a network entity and is typically used to construct AI characteristic data. For example, the attributes of an IP entity include a home, an operator, an owner unit, and the like. The attributes of the domain name entity include the domain name text itself, the registration date, the first observation date, the last observation date, etc. The attribute characteristics of the file sample include file format, storage space, change time, and the like.

And the association relation data among the entities, such as the resolution relation between the domain name and the IP, the execution relation between the malicious sample and the domain name, the dependency relation between the domain name and the URL, and the like. The relationship between a domain name and its registrars, and register mailboxes also fall into this category.

Entity behavior records data such as firewall logs, IDS logs, DNS logs, and the like.

The entity research and judgment label data represents the research and judgment of each participant on the black gray level of the network entity.

In step 1), in order to label the entities uniquely, different network entities are identified by different signs, network entity data which needs desensitization is identified by desensitization signs, and a rainbow table between the signs and desensitization signs of the network entity data which needs desensitization is established. Specifically, the method comprises the following steps:

marking: a token is a unique identification of a network entity in a conventional representation. For example, the notation of the IPv4 address is a dot decimal notation, the notation of the IPv6 address is a colon separated 4-digit hexadecimal numeric group, the notation of the port service is a combination of a specific IP and a port number notation, and the notation of the domain name and the URL is a text. The sample of the file is signed by a combination of its various hash values, e.g. (MD5, SHA256, SHA 512). In summary, the token is a text expression of the network entity.

Desensitization marking: the document sample does not need desensitization, its desensitization signature is defined as its signature itself. The tokens of the other network entities are passed through a combination of various hash values of their tokens, e.g. (MD5, SHA256, SHA 512). Each participant maintains a mapping relationship between the marks and desensitization marks, namely a rainbow table. The nature of the hash value determines that after data sharing, the reader generally cannot obtain the original token directly, but can still use its own data to perform collision, and only as a successfully collided token can the reader understand the original token according to the rainbow table. In addition, if the collision is not successful, the reader can issue a peer-to-peer application to the data provider and provide a reward if it still wants to obtain the original token information. The sharer can approve the application and issue the answer to the reader point-to-point while getting the contribution score, which is called the transparentization of entity identification.

In addition to selective desensitization of network entity identification, specific values of network entity attributes, relationships, behaviors and label information may also be desensitized. Participants can simply announce to the outside that they have this information in mind, but specific values are not disclosed. When other participants receive the notice, the missing value is taken as a special value, and the original information source is recorded, namely the participant ID of the value is really mastered. This number is said to be opaque data. The reader can make a request to the data sharer to obtain the real value of the opaque data. If the sharer agrees with the application, the relevant data is said to be transparently rendered. In addition, opaque data may be used for federal learning modeling.

In the step 1), the triggering mode of data sharing comprises active sharing and help-response sharing; the data sharing mode comprises community release and point-to-point release; the data sharing scenario includes voluntary sharing and legally defined share of services.

Under the active sharing triggering mode, all the participants establish a subscription relationship and respectively take the roles of a publisher and a subscriber. Each publisher discloses partial data mastered by the publisher by oneself according to batches, and the subscriber selects whether to accept the data of the batch according to needs, and only the selected and accepted subscriber has the right to read the data. For help-response sharing, a subscriber is interested in a certain network entity but lacks knowledge of that entity, so it advertises desensitization tokens or original tokens for other participants to provide informative assistance. Other participants provide relevant data according to the information grasped by the participants, and the subscribers who initiate help seeking decide whether to accept the data.

Under a specific supervision environment, the method supports the realization of a law definition service sharing mechanism. The regulatory body may perform the data sharing function as a special party. For information that the regulation dictates must be disclosed by the regulatory body, the regulatory body needs to provide to all other participants through an active sharing mode; for the information which must be actively reported to the monitoring organization according to the relevant regulations, other participants must take the corresponding monitoring organization as a subscriber in an active sharing mode; for information that must respond to a regulatory agency query, the information publisher must properly respond to the regulatory agency's request for help in accordance with relevant regulations. If no legal obligation exists, the data sharing of all the participants belongs to the voluntary sharing of the industry.

in step 2), each participant voluntarily shares the batch of network entity attribute data that it observes.

In step 2):

In the data fusion process, data sources need to be recorded, so that data can be audited and traced conveniently, and support is provided for problem troubleshooting and value transaction.

and 3) establishing on the basis of data fusion, and performing feature extraction and judgment analysis by taking team resources as a unit instead of analyzing a single network entity in an isolated manner.

In step 3):

the data characteristics of the relationship-fused data include: the access degree of the graph nodes, the number of domain name associated IP, the number of domain name associated NS servers and the access degree of the IP type neighbor nodes limited by the domain name nodes; and constructing a topological structure based on the incidence relation among the entities, and further extracting characteristic information from the topological structure.

In step 3), the label data shared by a particular external participant may also be used as feature data, rather than training labels for the final model. The credibility of different participants is different, and the judgment labels provided by the participants with low credibility can not be directly trusted and are fed to the machine learning model; in addition, the learning target is not necessarily the same. For example, an external party may provide a URL link tagged with an "advertisement" while other parties may be more concerned with network intrusion behavior, and this inconsistency in business goals may result in some external tags being used only as candidate features.

The machine learning model requires that the feature data is in a specific form such as continuous real number, discrete ordinal number, Boolean type variable and the like. Part of the attribute data does not meet the requirement, but can be converted into valuable characteristic information through certain preprocessing. This process generally requires extracting information on demand in conjunction with business semantics.

The feature extraction process is decided by each participant. The participators should adopt PDCA methodology, namely continuously adjust the specific mode of the feature engineering according to the service direction and the model effect.

4) Modeling, referring to fig. 2, selecting different types of modeling methods according to needs, selectively loading data features and label fusion data, performing a machine learning specific training process, generating a training model, and outputting the training model;

in the step 4), the modeling method comprises local training, federal learning, intelligence aggregation and integrated learning, and specifically comprises the following steps:

And when the data characteristics and the label fusion data are selectively loaded, the characteristic data and the label data are selectively loaded. If the type strategy is set to local training or federal learning, the loading strategy is set according to modeling requirements, and a feature selection function is executed. If the type strategy is set as information aggregation, desensitization marks meeting requirements are loaded according to the information collection strategy, and a user-defined IOC index is generated.

And performing a machine learning specific training process, and executing a specific process of local supervised training or federal learning training when generating a training model. If the type strategy is set as information aggregation, the model training process converts the user-defined IOC index into an IOC matching model, and the model provides IOC matching capability and is responsible for judging whether the input desensitization mark hits the IOC index.

And outputting the training model, namely outputting the model capability to the outside, wherein the model capability comprises the prediction capability of an AI model and the matching judgment capability of an IOC model.

5) Auditing, namely running in a block chain or a cooperation platform shared by multiple parties, accounting according to the data flow in the steps 1), 2), 3) and 4), accounting contribution points for different participants according to set rules, and rewarding the participants according to the contribution points.

And 5), running auditing on a block chain shared by multiple parties or a cooperation platform. Storing the data exchange record on the blocks, and continuously increasing new blocks; the auditing process can also be operated on a public collaboration platform, the data exchange records are stored in a platform database, and the auditing system obtains the contribution condition of each party by reading the content of the database.

In step 5), the rule for counting the contribution score is set as:

only after the shared data is read by other participants can the participants obtain the contribution score. The contribution obtained is divided into the product of the basic contribution and the number of readers. For example, if some associated data is read by 3 participants, the sharing party may obtain a contribution score of 3 Sr.

After the reader reads the data, the corresponding contribution of the reader should be deducted, for example, when the reader reads 3 related data, the contribution of 3Sr is deducted. And the reader can selectively evaluate the read data to give a score, such as 0-10, and if the score given by the reader is within a certain range, such as 2 points, from the average score obtained by the data, the reader obtains a certain contribution score, which is recorded as Ss. When the average score of the data is higher than a certain value, for example 8 points, the participant who shares the data gets a point reward, the basic reward point Sw, and the sharer gets a contribution point Sw (r-8). When the value is lower than a certain value, for example, 3 points, the participants sharing the data deduct a certain value and mark as Sw (3-r);

the participants initially have a certain initial score by default.

The sum of the contribution points currently held by the participants in the system is constantly changing.

Referring to fig. 3, the present invention further provides a collaborative modeling system for security threats, including the following modules:

the data sharing unit is used for incompletely sharing the network entity data among a plurality of participants and desensitizing the network entity data which is necessary to be desensitized before sharing; if the original mark of the network entity is not sensitive in the current environment, the desensitization function can be closed, i.e. the desensitization mark is directly taken as the original mark. For example, when the parties are affiliated with the same department or are fully trusted by each other, desensitization is not required during the information exchange process.

The data fusion unit is used for performing attribute fusion, relationship fusion, behavior fusion and label fusion on the shared data to form attribute fusion data, relationship fusion data, behavior fusion data and label fusion data; the data fusion unit provides interface support for the process and can provide a visual interface to present a fusion process of entities and relationships. The incidence relation between the network entities can be visually presented in a graph mode, and a convenient interface for editing and changing is provided for a user. To distinguish the data sources, the dots and lines are labeled with different label icons, with different icons representing different parties. The color and shape of the points and lines may be used to reflect the type of entity or relationship.

Claims

1. A collaborative modeling method for security threats is characterized by comprising the following steps:

2. The collaborative modeling method for security threats according to claim 1, wherein in step 1), the shared data includes the following types:

3. The collaborative modeling method for security threats according to claim 1, wherein in step 1), different network entities are identified by different signs, data of network entities which need desensitization is identified by desensitization signs, and a rainbow table between the signs and the desensitization signs of the data of network entities which need desensitization is established.

4. The collaborative modeling method for security threats according to claim 1, wherein in the step 1), the triggering mode of data sharing includes active sharing and help-response sharing; the data sharing mode comprises community release and point-to-point release; the data sharing scenario includes voluntary sharing and legally defined share of services.

5. The collaborative modeling method for security threats according to claim 1, wherein in step 2):

6. The collaborative modeling method for security threats according to claim 1, wherein in step 3):

7. The collaborative modeling method for security threats according to claim 1, wherein in the step 4), the modeling method includes local training, federal learning, intelligence aggregation and integrated learning, specifically:

8. The collaborative modeling method for security threats according to claim 1, wherein in the step 5), the rule for calculating the contribution score is set as:

the participants initially have a certain initial score by default.

9. A security threat collaborative modeling system, comprising the following modules: