CN115344563B

CN115344563B - Data deduplication method and device, storage medium and electronic equipment

Info

Publication number: CN115344563B
Application number: CN202210987429.9A
Authority: CN
Inventors: 高岩; 袁涵; 郭实秋; 鞠港
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2024-02-02
Anticipated expiration: 2042-08-17
Also published as: CN115344563A

Abstract

The disclosure belongs to the technical field of network security, and relates to a data deduplication method and device, a storage medium and electronic equipment. The method comprises the following steps: acquiring threat information data, and preprocessing the threat information data to determine a data type; when the data type is unstructured, text similarity calculation is carried out on threat information data to obtain semantic feature vectors, and duplication removal processing is carried out on the threat information data according to the semantic feature vectors; or when the data type is a structured type, performing data compression processing on the data type, and storing the compressed threat information data so as to perform de-duplication processing. The method solves the problems that the memory occupied in the process of removing the duplicate threat information data is overlarge and the processing flow is time consuming, solves the problem that the text information cannot be captured by the original duplicate removal method, improves the retrieval efficiency of unstructured threat information data, and solves the problem that the system resources are excessively consumed due to the fact that the duplicate threat information data is removed and the threat information data is stored.

Description

Data deduplication method and device, storage medium and electronic equipment

Technical Field

The disclosure relates to the technical field of network security, and in particular relates to a data deduplication method, a data deduplication device, a computer readable storage medium and electronic equipment.

Background

With the high-speed development of the Internet, particularly the mobile Internet, more and more network devices and Internet of things devices are connected to a backbone network, the topological environment of the Internet is more complex, different attack behaviors are more industrialized, and intrusion techniques are more diversified and complicated, so that the traditional security solution is more and more challenged. Meanwhile, with the continuous improvement of the state status, the network attack suffered by China tends to be diversified and complicated. Under the background, threat information is increasingly focused by enterprises, the safety equipment can play a larger role in combination with the threat information, and the safety operation of the enterprises can respond to the safety event more quickly in combination with the threat information.

Along with the more frequent occurrence of network attack events, millions of threat information are generated daily, however, the quality of the threat information is good and bad whether the threat information is business threat information or threat information in an open source website, a large amount of repeated data exists in non-homologous threat information, the situation that the homologous threat information is repeated with the previous data also exists, the occupied memory is overlarge, and adverse effects are caused on various aspects such as platform operation, storage and operation and maintenance.

In view of this, there is a need in the art to develop a new data deduplication method and apparatus.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The disclosure aims to provide a data deduplication method, a data deduplication device, a computer readable storage medium and an electronic device, so as to overcome the technical problem of excessive occupied memory caused by the limitation of related technology at least to a certain extent.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

According to a first aspect of an embodiment of the present invention, there is provided a data deduplication method, the method including:

acquiring threat information data, and preprocessing the threat information data to determine a data type;

when the data type is unstructured, performing text similarity calculation on the threat information data to obtain a semantic feature vector, and performing de-duplication processing on the threat information data according to the semantic feature vector; or (b)

And when the data type is a structured type, carrying out data compression processing on the data type, and storing the compressed threat information data so as to carry out de-duplication processing.

In an exemplary embodiment of the present invention, the preprocessing the threat intelligence data to determine a data type includes:

carrying out data standardization processing on the threat information data, and extracting the processed threat information data to obtain key data;

and carrying out data cleaning treatment on the key data, and classifying the cleaned key data to obtain data types.

In an exemplary embodiment of the present invention, the performing data compression processing on the data type includes:

encoding the data type to obtain a first bit vector, and carrying out hash calculation on the key data to obtain a second bit vector;

and calculating the first bit vector and the second bit vector to obtain a target bit vector so as to obtain the compressed threat information data.

In an exemplary embodiment of the present invention, before the text similarity calculation is performed on the threat intelligence data to obtain a semantic feature vector, the method further includes:

inputting the threat information data into a joint extraction model so that the joint extraction model outputs information keywords and information categories;

And scoring the information keywords and the information categories by using a structured deduplication algorithm to obtain a first deduplication score.

In an exemplary embodiment of the present invention, the joint extraction model is trained by the following method:

training the training samples by using a pre-training algorithm to obtain text vectors, and encoding the text vectors to obtain encoded vectors;

and performing sequence label prediction on the coding vector to obtain key word data, and performing category prediction on the coding vector to obtain category data.

In an exemplary embodiment of the invention, the semantic feature vectors include a high-level semantic vector and a medium-level semantic vector,

the text similarity calculation is carried out on the threat information data to obtain semantic feature vectors, and the method comprises the following steps:

and inputting the threat information data into a full binary quantized language representation model so that the language representation model outputs the high-level semantic vector and the medium-level semantic vector.

In an exemplary embodiment of the present invention, the performing a deduplication process on the threat intelligence data according to the semantic feature vector includes:

acquiring stored information data in an information database, and performing first distance calculation on the medium-level semantic vector and the stored information data to determine an information candidate set;

Performing second distance calculation on the advanced semantic vector and the stored information data in the information candidate set to determine a second deduplication score, and performing calculation on the first deduplication score and the second deduplication score to obtain a duplicate confidence;

and carrying out de-duplication processing on the threat information data according to the repeated confidence level.

According to a second aspect of an embodiment of the present invention, there is provided a data deduplication apparatus, including:

the data acquisition module is configured to acquire threat information data and preprocess the threat information data to determine a data type;

the first deduplication module is configured to calculate text similarity of the threat intelligence data to obtain a semantic feature vector when the data type is unstructured, and perform deduplication processing on the threat intelligence data according to the semantic feature vector; or (b)

And the second deduplication module is configured to perform data compression processing on the data type when the data type is a structured type, and store the compressed threat intelligence data so as to perform deduplication processing.

According to a third aspect of an embodiment of the present invention, there is provided an electronic apparatus including: a processor and a memory; wherein the memory has stored thereon computer readable instructions which, when executed by the processor, implement the data deduplication method of any of the exemplary embodiments described above.

According to a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data deduplication method in any of the above-described exemplary embodiments.

As can be seen from the above technical solutions, the data deduplication method, the data deduplication device, the computer storage medium, and the electronic device in the exemplary embodiments of the present disclosure have at least the following advantages and positive effects:

in the method and the device provided by the exemplary embodiment of the disclosure, the data type of threat intelligence data is determined, and a data base and theoretical support are provided for providing different deduplication modes for different types of threat intelligence data. On one hand, the threat information data is subjected to duplication elimination processing according to the semantic feature vector, so that the problems that the memory occupied in the duplication elimination process of the threat information data is overlarge and the processing flow is time-consuming are solved, the problem that text information cannot be captured by an original duplication elimination method is effectively solved, and meanwhile, the searching efficiency of unstructured threat information data is improved. On the other hand, the data type of the structured threat information data is subjected to data compression processing, so that the problem that excessive system resources are consumed due to the duplication removal process of massive threat information data is solved, and the resource consumption caused by storing the threat information data is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

FIG. 1 schematically illustrates a flow diagram of a data deduplication method in an exemplary embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow diagram of a method of determining a data type in an exemplary embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram of a method for determining a first deduplication score of threat intelligence data using a joint extraction model in an exemplary embodiment of the disclosure;

FIG. 4 schematically illustrates a flow diagram of a method of training a joint extraction model in an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow diagram of a method of deduplication of threat intelligence data in an exemplary embodiment of the disclosure;

FIG. 6 schematically illustrates a flow diagram of a method of data compression processing of data types in an exemplary embodiment of the present disclosure;

fig. 7 schematically illustrates a structural diagram of a data deduplication system in an application scenario in an exemplary embodiment of the present disclosure;

FIG. 8 schematically illustrates a structural schematic of a data processing module in an exemplary embodiment of the present disclosure;

FIG. 9 schematically illustrates a structural diagram of an information keyword-type joint extraction model in an exemplary embodiment of the present disclosure;

FIG. 10 schematically illustrates a structural schematic of a full binary quantized linguistic representation model in an exemplary embodiment of the present disclosure;

fig. 11 schematically illustrates a structural diagram of a data compression module in an exemplary embodiment of the present disclosure;

fig. 12 schematically illustrates a structural diagram of a data deduplication apparatus in an exemplary embodiment of the present disclosure;

FIG. 13 schematically illustrates an electronic device for implementing a data deduplication method in an exemplary embodiment of the present disclosure;

fig. 14 schematically illustrates a computer-readable storage medium for implementing a data deduplication method in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

The terms "a," "an," "the," and "said" are used in this specification to denote the presence of one or more elements/components/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. in addition to the listed elements/components/etc.; the terms "first" and "second" and the like are used merely as labels, and are not intended to limit the number of their objects.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

With the high-speed development of the Internet, particularly the mobile Internet, more and more network devices and Internet of things devices are connected to a backbone network, the topological environment of the Internet is more complex, different attack behaviors are more industrialized, and intrusion techniques are more diversified and complicated, so that the traditional security solution is more and more challenged. Meanwhile, with the continuous improvement of the state status, the network attack suffered by China tends to be diversified and complicated.

Under the background, threat information is increasingly focused by enterprises, the safety equipment can play a larger role in combination with the threat information, and the safety operation of the enterprises can respond to the safety event more quickly in combination with the threat information. Threat intelligence is therefore becoming increasingly important in network security.

Threat intelligence refers to an intelligence knowledge base that contains multiple types, multiple dimensions. Threat intelligence may include vulnerability intelligence, asset intelligence, IOC (Indicator of compromise, threat indicator) intelligence, event intelligence, and the like, among others.

Threat information is used as a knowledge set based on evidence, scene, mechanism, index and operational advice, and can effectively make up network security defense blind areas and change passive protection into active defense. Threat tracing, evidence discovery, attack prediction, attack spectrum establishment and the like can be performed while the existing attack is detected, so that the protection capability of the network security equipment is improved on the whole, the influence caused by the network attack is reduced, and meanwhile, an important reference for network defense is provided for a security decision maker.

With the more frequent occurrence of network attack events, millions of threat information are generated daily, however, the quality of the threat information is good and bad whether the threat information is business threat information or threat information in an open source website, a large amount of repeated data exists in non-homologous threat information, and the situation that the homologous threat information is repeated with the previous data exists. As a threat intelligence platform, it is necessary to provide accurate and high-quality data. The repeated data generated by the data source every day affects the platform operation, storage and operation and maintenance, so that the threat information data deduplication step becomes an important component part of information processing, and the information quality and threat information platform construction are directly related.

In view of the problems in the related art, the present disclosure proposes a data deduplication method. Fig. 1 shows a flow chart of a data deduplication method, and as shown in fig. 1, the data deduplication method at least comprises the following steps:

s110, threat information data are acquired, and preprocessing is carried out on the threat information data to determine the data type.

And S120, when the data type is unstructured, calculating the text similarity of the threat information data to obtain a semantic feature vector, and performing de-duplication processing on the threat information data according to the semantic feature vector.

And S130, when the data type is a structured type, performing data compression processing on the data type, and storing the compressed threat information data so as to perform duplication removal processing.

In an exemplary embodiment of the present disclosure, determining the data type of threat intelligence data provides data base and theoretical support for providing different deduplication approaches for different types of threat intelligence data. On one hand, the threat information data is subjected to duplication elimination processing according to the semantic feature vector, so that the problems that the memory occupied in the duplication elimination process of the threat information data is overlarge and the processing flow is time-consuming are solved, the problem that text information cannot be captured by an original duplication elimination method is effectively solved, and meanwhile, the searching efficiency of unstructured threat information data is improved. On the other hand, the data type of the structured threat information data is subjected to data compression processing, so that the problem that excessive system resources are consumed due to the duplication removal process of massive threat information data is solved, and the resource consumption caused by storing the threat information data is reduced.

The steps of the data deduplication method are described in detail below.

In step S110, threat intelligence data is acquired, and the threat intelligence data is preprocessed to determine a data type.

In exemplary embodiments of the present disclosure, threat intelligence refers to an intelligence knowledge base that includes multiple types, multiple dimensions.

The threat intelligence may include vulnerability intelligence, asset intelligence, IOC intelligence, event intelligence, etc.

Threat information is used as a knowledge set based on evidence, scene, mechanism, index and operational advice, can effectively make up network security defense blind areas, changes passive protection into active defense, detects existing attacks, and simultaneously can perform threat tracing, evidence discovery, attack prediction, attack map establishment and the like, so that the protection capability of network security equipment is integrally improved, the influence caused by network attacks is reduced, and meanwhile, important references for network defense are provided for security decision makers.

The threat information data is classified according to the attributes, so that the threat information and the use scene can be matched.

Based on this, threat intelligence data may be divided into basic intelligence classes, asset classes, vulnerability classes, event classes, IOC classes, attack organization classes, and other intelligence types, etc.

The basic information includes common objects in the network, such as IP (Internet Protocol ) address (192.168.0. X), domain name (www.xxxxx.com), mailbox ([email protected]), URL (Uniform Resource Locator ) (http:// www.xxxxxx.com), and certificate.

The basic information for each category includes, for example, the port used, the type of service provided, whois (domain name query protocol) information (including details of whether or not a domain name has been registered, and registered), and geographical location information of IP, domain name, URL, such as latitude and longitude, city of the country to which the information belongs, and the like.

Asset information can be classified into risk asset information, asset change information, and asset discovery information according to the content. By asset is meant a physical or virtual device in the internet, such as a router, switch, server, host, etc.

Vulnerability information refers to a knowledge base formed by carrying out data acquisition, analysis and description on the existing vulnerability by using threat information technology.

Among them, for example, a country-related vulnerability library (for example, NVD (National Vulnerability Database, country vulnerability database), CNVD (China National Vulnerability Database, country information security vulnerability sharing platform), CNNVD (China National Vulnerability Database of Information Security, chinese country information security vulnerability library)) or a general vulnerability disclosure (CVE), mainly describing the name, description, type, hazard score, implementation principle, influence, patch measure, etc. of the vulnerability.

Event-like intelligence refers to information about various types of information and related events. Such as the time of occurrence, the effect caused, etc. The type, the source, the potential influence, the associated loopholes or attack organizations and the like of the security event are described in literal detail, so that the security operator or a non-professional person can be helped to know the external threat condition in time, and the security operator or the non-professional person can respond.

IOCs refer to threat indicators that are used to describe the detection features of network attacks. Such as the attack source IP, domain name, MD5 (MD 5 Message-Digest Algorithm) hash value of the malicious file, or traffic characteristics, mailbox to which the phishing mail belongs, etc. The security personnel can conduct risk study and judgment, security reinforcement and the like through the IOC information.

The attack organization contains threat subject names, such as hack organization names, etc., as well as attack subject roles, such as hackers, white caps, etc., and attack by the attack organization is directed to industries, countries, etc.

Other intelligence types may include threat reports, critical activity security classes, internal intelligence, and the like.

After the threat intelligence data is acquired, the threat intelligence data may be preprocessed to determine a data type of the threat intelligence data.

In an alternative embodiment, fig. 2 shows a flow diagram of a method of determining a data type, as shown in fig. 2, which may at least comprise the steps of: in step S210, data normalization processing is performed on threat information data, and extraction processing is performed on the processed threat information data to obtain key data.

Threat information data from different information sources are subjected to data standardization processing. For example, the data normalization process may be forming JSON (JavaScript Object Notation, JS object profile) format, or the like, which is not particularly limited by the present exemplary embodiment.

Further, the threat information data after the data standardization process may be extracted to obtain key data such as an attacker IP, an attack type, and a threat level.

Different types of threat intelligence extraction platforms require different keywords, after which JSON documents in a unified format can be formed as key data.

In step S220, the data cleaning process is performed on the critical data, and the cleaned critical data is classified to obtain a data type.

Because the information quality of different sources is different, characters such as a line feed character "\n", a tab character "\t" and the like are included, the key data can be subjected to character deletion, replacement, removal of sensitive words, stop words and the like through data cleaning processing, so that the cleaned key data can meet the requirements of subsequent processing flows.

The data classification is a deduplication process for threat intelligence data. The original cleaned key data can divide attack organization information, event information, reports and the like into unstructured information data according to threat information data types, and basic information, vulnerability information, IOC information and the like into structured information data.

The different ways of dividing threat intelligence data refer to table 1:

TABLE 1

Specifically, the types of informations can be classified into structured informations and unstructured informations according to the informations data format.

The structured threat information refers to data such as IP, assets, vulnerabilities, etc. that can be uniquely identified by a character string, such as a specific IP address and a vulnerability number, and threat reports, important activity assurance classes, and internal information included in other information types may also be structured types. By which a piece of informative information can be uniquely identified.

Unstructured threat intelligence data refers to event-like threat intelligence and the like. An attack event is described through literal, which includes vulnerability information, attack organization information, etc. Such intelligence cannot be used directly, and typically requires human or machine reading to extract the desired information for carding to generate usable intelligence.

In the present exemplary embodiment, the data type of threat intelligence data can be determined through preprocessing, so that data base and theoretical support are provided for subsequent deduplication processing, and accuracy and timeliness of data deduplication are ensured.

In step S120, when the data type is unstructured, text similarity calculation is performed on the threat intelligence data to obtain a semantic feature vector, and duplication removal is performed on the threat intelligence data according to the semantic feature vector.

In an exemplary embodiment of the present disclosure, when threat intelligence data is unstructured intelligence data, since a large amount of data is found by observation, threat intelligence keywords are not identical in the case of similar text, such as intelligence 1: trojan back door, vulnerability exploitation: CVE-2022-26134; intelligence 2: trojan back door, security hole: CVE-2022-30716.

By the word co-occurrence text similarity calculation method, due to the fact that the co-occurrence words such as 'Trojan backdoor', 'vulnerability' exist, the similarity of 62.5% can be obtained through SimHash (a hash method most commonly used for webpage deduplication), however, the keyword CVE vulnerability numbers of two texts are different, and two different emotion texts are obvious.

Therefore, in order to solve the above problems, an information keyword-type joint extraction model is proposed. For the intelligence text extraction intelligence keywords such as IP address, attack organization, IOC intelligence, vulnerability number (CVE), etc., for example "trojan backdoor, security vulnerability: CVE-2022-30716", CVE-2022-30716 is extracted. Meanwhile, the model judges the information type while extracting the information keywords.

In an alternative embodiment, fig. 3 shows a flow chart of a method for determining a first deduplication score of threat intelligence data using a joint extraction model, as shown in fig. 3, the method may comprise at least the steps of: in step S310, threat intelligence data is input into the joint extraction model such that the joint extraction model outputs intelligence keywords and intelligence categories.

In an alternative embodiment, fig. 4 shows a flow diagram of a method of training a joint extraction model, as shown in fig. 4, which may include at least the steps of: in step S410, the training samples are character vector trained by using a pre-training algorithm to obtain text vectors, and the text vectors are encoded to obtain encoded vectors.

The combined extraction model utilizes the idea of combined training, inputs an information text vector which is unstructured information, passes through a word embedding layer, a coding layer, a conditional random field layer, a sequence prediction layer and a category prediction layer, and finally outputs the results of the starting position and the ending position of an information keyword and the information category.

Wherein, the sequence labels are [ B_T, O_T, E_T, X ] which respectively represent a keyword starting position, a keyword interval position, a keyword ending position and a non-keyword.

The intelligence type tag is {0: basic information, 1: vulnerability information, 2: asset intelligence, 3: event intelligence, 4: IOC intelligence, 5: attack organization intelligence, 6: other types of intelligence }.

Specifically, all the marked data are subjected to character vector training through a threat information vector pre-training algorithm.

The basic idea of the character vector is that each character is characterized as a K-dimensional vector, the relation between the characters can be learned in the character vector training process, and meanwhile, the vocabulary representation mode of the vector form is beneficial to calculation. The specific calculation formula is as follows:

wherein,for character embedding matrix, x _i Index number for the ith character, +.>For the i-th character vector representation.

Furthermore, the text vector is encoded through the bidirectional long-short-time memory neural network, and the deep representation of the text vector is obtained, wherein the formula is as follows:

wherein,is a character vector, and is coded by a forward and reverse long-short-term memory neural network to obtain a forward hidden layer +.>And reverse hidden layer->Two vectors are concatenated as a text encoded representation, using h _i And (3) representing.

In step S420, the coded vector is subjected to sequence tag prediction to obtain key data, and the coded vector is subjected to class prediction to obtain class data.

In the process of sequence label prediction, the influence of the front and back information of the coded vector on label prediction is considered, and the coded vector can be processed through a conditional random field layer.

Further, each hidden layer state of the aligned coding features is subjected to label prediction by a sequence label prediction method.

In general, the model-making tag prediction phase can be processed using a softmax function (a logistic regression model). For each character, predicting the probability of the character as a keyword start, a keyword end, a keyword interval and a non-keyword, finally selecting the item with the highest probability as the label of each character, and extracting the keyword through the start label and the end label to obtain keyword data.

At the same time pass through hidden layer vector h _i The information type is predicted by a feedforward neural network, and the formula is as follows:

P＝softmax(Wh _i +b)i∈{1,…,M} (5)

wherein W, b is the parameter to be learned, h _i Hidden layer vector identification for threat information text, P isThe probability that the threat information text belongs to a certain category is taken, and the category with the highest probability is judged as the type of the information, so that category data are obtained.

In terms of the loss function, since the keyword extraction and class prediction joint learning is used, the loss function formula is as follows:

L＝(αl ₉ +βl _β ) (6)

wherein l _α 、l _β And respectively extracting a loss function of the model and the class prediction model for the keywords, wherein alpha and beta are hyper-parameters to be learned, and performing iterative optimization through training.

And marking by a free threat information data set, and completing training of the information keyword-type combined extraction model so as to enable the model effect to reach an expected result.

Further, in the aspect of model prediction, the input text is a section of unstructured threat information data to be identified, and the extracted information keywords and information categories are output through model prediction.

For example, when unstructured threat intelligence data is "trojan backdoor, security hole: when CVE-2022-30716", the result of outputting the information key word of threat information is CVE-2022-30716", and the type of the output information is vulnerability.

The combined extraction model can extract information types and information keywords from the event information and attack organization information, so that the characteristics generate a reset reliability score A through a structural de-duplication algorithm.

In step S320, the intelligence keywords and the intelligence categories are scored using a structured deduplication algorithm to obtain a first deduplication score.

Specifically, a series of rules for determining the first deduplication score are stored in the structured deduplication algorithm, and the first deduplication score corresponding to the intelligence keyword and the intelligence category can be obtained through the combination of the rules.

For example, the rule includes that the information keyword and/or the information type has a corresponding score in the database, and also includes that the information keyword and/or the information type does not have a corresponding score in the database, and also includes that the information category corresponds to a threat information data source in a consistent manner, a non-consistent manner, and the like.

For unstructured intelligence, the deduplication effect is not yet achieved by the key alone. Part of the intelligence, for example, "Vim, is a cross-platform text editor. The pre-Vim 8.2 version has a security hole which is derived from the problem of re-use after release, and it can be seen that the informative text has no obvious keywords, such as an attack source IP address, a CVE number, etc., so that for part of text informative without informative keywords, it is necessary to de-duplicate the local informative library by text similarity calculation.

However, the text similarity calculation algorithm based on the hash feature is generally based on the word co-occurrence degree, is not applicable to information type data, and has a certain degree of misjudgment, so that the similarity judgment needs to be performed by deep semantic features. In recent years, text similarity judgment based on deep learning has been developed, but similarity measurement algorithms aiming at text threat information are less. Meanwhile, the platform of the scheme needs to judge about millions of important information data every day, and the deep learning similarity calculation cannot meet the performance.

Therefore, through a Bit-BERT (binary-quantized BERT (BidirectionalEncoder Representations fromTransformer, transform-based bi-directional encoder representation, a pre-trained language representation model)) algorithm, semantic vectors are learned by a feedforward neural network to generate binary Bit vectors, coarse-granularity candidate sets are obtained through coarse-robustness similarity calculation, and then fine-granularity similarity calculation is carried out from the coarse-granularity candidate sets to obtain the de-duplication confidence B.

The BERT pre-training model is a language representation model formed by training *** in an unsupervised mode by utilizing massive unlabeled texts. The BERT pre-training Model is a general semantic representation Model with strong migration capability, takes a transducer as a network basic component, takes a mask Bi-Language Model and Next Sentence Prediction (next sentence prediction) as training targets, and obtains general semantic representation through pre-training.

Among them, whether supervision (supervised) depends on whether the input data has a tag (label). The input data has labels, and is supervised learning, and the input data has labels, and the input data has no labels.

In contrast to conventional Word2Vec (Word to vector) which is used to generate a related model of Word vectors, gloVe (Global Vectors for Word Representation) which is a Word representation (Word representation) tool based on global Word frequency statistics (count-based & overall statistics), etc., BERT satisfies the concept of context Word representation (contextual Word representation) which has been very popular in recent years, i.e., considering the context content, the same Word has different representations in different contexts. Intuitively, this also satisfies the reality of human natural language, i.e. the meaning of the same vocabulary is very likely to be different in different scenarios.

In an alternative embodiment, the semantic feature vectors include a high-level semantic vector and a medium-level semantic vector.

The threat information data is input into a language characterization model of full binary quantization, so that the language characterization model outputs a high-level semantic vector and a medium-level semantic vector.

The unstructured threat information text is encoded, a BERT pre-training language model is used for encoding character vectors, text vectors are generated through a maximum pooling layer, and the text vectors can be generated to calculate information text similarity, and cosine similarity and the like are generally adopted. However, the method cannot meet the performance requirement due to the large amount of threat information data, so that a bit coding layer is adopted to generate a representative hash value for the text vector, and a binary code identification learning layer is introduced.

Specifically, a layer is added between the output layer and the semantic hiding layer for hash expression learning, the layer adopts a full-connection structure, adopts a sigmoid (S-shaped growth curve, which is used as an activation function and logistic regression of a neural network) activation function, and hides each one-dimensional floating point hidden expression into a binary expression [0,1] of a Boolean type. Through training, binary Encoding (Bit Encoding) with mid-level semantic features and advanced semantic representation (semantic Hidden Layer) are generated. The specific formula is as follows:

h _sematic ＝MaxP”l(h _bert ) (8)

h _bit ＝sigmod(Wh _sematic +B) (9)

Wherein,BERT is a pre-trained language model for text feature extraction and representation, maxPool is a maximum pooling layer for extracting important components in features, h _sematic 256-dimensional high-level semantic vector extracted for model, h _bit The 64-dimensional mid-level semantic vector is obtained through the activation function and the full connection layer.

After the semantic feature vector is obtained, the threat intelligence data can be subjected to deduplication processing by utilizing the semantic feature vector.

In an alternative embodiment, fig. 5 shows a flow chart of a method for deduplicating threat intelligence data, as shown in fig. 5, and the method may at least include the following steps: in step S510, stored information data in the information database is acquired, and a first distance calculation is performed on the intermediate semantic vector and the stored information data to determine an information candidate set.

For each piece of information data, the high-performance storage database Redis stores medium-level semantic features and high-level semantic features, so that stored information data in the information database can be acquired.

For threat information data to be checked, a medium-level semantic vector and a high-level semantic vector can be obtained after the threat information data passes through the Bit-BERT model. Therefore, threat intelligence data higher than a preset threshold value can be sent into the intelligence candidate set by adopting Hamming distance calculation through the medium-level semantic vector and the stored intelligence data.

In step S520, a second distance calculation is performed on the advanced semantic vector in the information candidate set and the stored information data to determine a second duplicate removal score, and the first duplicate removal score and the second duplicate removal score are calculated to obtain a duplicate confidence.

After the intelligence candidate set is determined, similarity calculation can be performed on the corresponding advanced semantic vector and the stored intelligence data in a cosine similar mode, and a similarity score higher than a corresponding threshold value is used as a second deduplication score, and a judgment that the similarity score is lower than the corresponding threshold value is not repeated.

Because the Haiming distance is the exclusive OR operation, the calculation efficiency is greatly improved, the comparison range is reduced by rapid coarse-granularity calculation, and the similarity score with higher confidence is generated by fine-granularity calculation, so that the method has both accuracy and calculation efficiency.

After the first and second deduplication scores a and B are calculated, the final duplicate confidence may be calculated by 0.6×confidence a+0.4×confidence b=duplicate confidence.

In step S530, the threat intelligence data is deduplicated according to the repetition confidence.

After calculating the duplicate confidence level, the duplicate confidence level may be compared to a corresponding threshold to deduplicate threat intelligence data. Typically, the threshold may be set to 0.6.

When the repetition confidence is greater than or equal to 0.6, determining that the threat intelligence data is repetitive; when the repetition confidence is less than 0.6, the threat intelligence data may be stored and the storage format may be < medium level semantic vector, high level semantic vector >.

In the embodiment, the similarity calculation and vector hash learning of unstructured threat information data can be realized through the full binary quantized language characterization model, the detection rate of repeated texts is effectively improved, and the similarity measurement method based on the medium-level semantic vectors and the high-level semantic vectors with medium-level and high-level granularity can effectively improve the calculation efficiency of the similarity.

In step S130, when the data type is the structured type, data compression processing is performed on the data type, and the compressed threat intelligence data is stored to perform deduplication processing.

In the exemplary embodiment of the present disclosure, when threat information data is structured information data, the structured information data deduplication processing service often adopts a high-performance storage database such as Redis due to high concurrency, but because the data volume is huge, filling the data into the Redis can cause the problem of memory overflow and the like, so that for the deduplication service, compression processing is required for the data, and the system overhead of the deduplication service is reduced.

In an alternative embodiment, fig. 6 shows a flow chart of a method for performing data compression processing on data types, and as shown in fig. 6, the method may at least include the following steps: in step S610, the data type is encoded to obtain a first bit vector, and the key data is hashed to obtain a second bit vector.

And processing the threat information data of the structured type through bit hash. Specifically, a 4-bit vector is generated according to the data type of threat intelligence data to be used as a first bit vector, and a 60-bit vector is generated according to key data of threat intelligence data through a hash algorithm to be used as a second bit vector.

In step S620, the first bit vector and the second bit vector are calculated to obtain a target bit vector, so as to obtain compressed threat information data.

After the first bit vector and the second bit vector are generated, the first bit vector and the second bit vector can be weighted and summed to obtain a 64-bit vector as a target bit vector to convert each feature into a 64-bit hash vector, for example:

10 0 10 10 0 10 … 0 10 0 characteristic value 1

1 1 0 1 0 1 1 0 1 0 … 0 1 0 0 characteristic value 2

1 1 0 1 0 0 0 0 1 0 … 0 1 0 0 eigenvalue 3

The weight corresponding to the first bit vector is the frequency of the corresponding content in the threat intelligence data, and the weight corresponding to the second bit vector is the frequency of the corresponding content in the threat intelligence data.

In the present exemplary embodiment, since the deduplication process uses the Redis as a temporary database, and the Redis places data into memory. When the mass data is de-duplicated, the key words are converted into bit features, so that the resources required by the system can be effectively reduced, and the retrieval speed of the repeated data is improved.

After the compressed threat intelligence data is obtained, the threat intelligence data can be stored, so that the retrieval speed is increased and the resource consumption is reduced during the duplicate removal process.

The following describes the data deduplication method in the embodiment of the present disclosure in detail in connection with an application scenario.

Threat intelligence refers to an intelligence knowledge base that contains multiple types, multiple dimensions. The threat intelligence may include vulnerability intelligence, asset intelligence, IOC intelligence, event intelligence, etc.

Fig. 7 shows a schematic structural diagram of a data deduplication system in an application scenario, and as shown in fig. 7, the system includes a data processing module, a data compression module, an information keyword-type joint presentation model, a Bit-BERT semantic coding model and a data storage module.

Fig. 8 shows a schematic diagram of a data processing module, and as shown in fig. 8, the data processing module is composed of three parts of data extraction, data cleaning and data classification.

Threat intelligence refers to an intelligence knowledge base that contains multiple types, multiple dimensions.

The basic information includes common objects in the network, such as IP (Internet Protocol ) address (192.168.0. X), domain name (www.xxxxx.com), mailbox ([email protected]), URL (http:// www.xxxxxx.com), and certificate.

Among other things, for example, a national relevant vulnerability library (e.g., NVD, CNVD, CNNVD) or a generic vulnerability disclosure, mainly describes the name, description, type, hazard score, implementation principle, impact, patch measures, etc. of the vulnerability.

IOCs refer to threat indicators that are used to describe the detection features of network attacks. Such as the attack source IP, domain name, and MD5 hash of malicious files, or traffic characteristics, mailboxes to which phishing mail belongs, etc. The security personnel can conduct risk study and judgment, security reinforcement and the like through the IOC information.

The data extraction part performs data standardization processing on threat information data of different information sources. For example, the data normalization process may be forming JSON format or the like, which is not particularly limited by the present exemplary embodiment.

The data cleaning process means that the information from different sources has different quality and contains characters such as a line feed character "\n", a tab character "\t", and the like, so that the key data can be subjected to character deletion, replacement, removal of sensitive words, stop words, and the like through data cleaning processing, and the cleaned key data can meet the requirements of subsequent processing flows.

The data classification part is a process of removing the duplication of threat information, and the original cleaned key data can divide attack organization information, event information, reports and the like into unstructured information data according to the threat information data types, and basic information, vulnerability information, IOC information and the like into structured information data.

The data type of threat information data can be determined through preprocessing, a data base and theoretical support are provided for subsequent deduplication processing, and the accuracy and timeliness of data deduplication are ensured.

When threat information data is unstructured information data, the unstructured information data is subjected to duplication removal and is divided into two steps, keyword extraction is firstly carried out on one information text, meanwhile information type judgment is carried out on the information text, and the extracted information keywords and information types are subjected to duplication removal through a structured duplication removal flow to obtain a duplication confidence A. And then carrying out similarity calculation on the information text and the existing text in the database to obtain a repeated confidence coefficient B, and weighting the two scores to obtain a final score so as to judge whether the text is repeated.

Since the threat intelligence text is found to be similar by observing a large amount of data, the intelligence keywords are not identical, for example: intelligence 1: trojan back door, vulnerability exploitation: CVE-2022-26134; intelligence 2: trojan back door, security hole: CVE-2022-30716.

By the word co-occurrence text similarity calculation method, 62.5% of similarity can be obtained through SimHash due to the fact that co-occurrence words such as 'Trojan backdoor', 'vulnerability' exist, however, keyword CVE vulnerability numbers of two texts are different and obviously two different informative texts are different, and therefore, an information keyword-type combined extraction model is provided for the problems. For the intelligence text extraction intelligence keywords such as IP address, attack organization, IOC intelligence, vulnerability number (CVE), etc., for example "trojan backdoor, security vulnerability: CVE-2022-30716", CVE-2022-30716 is extracted. Meanwhile, the model judges the information type while extracting the information keywords.

Fig. 9 shows a schematic structural diagram of an information keyword-type joint extraction model, and as shown in fig. 9, the joint extraction model uses the idea of joint training to input an information text vector which is unstructured information, and finally outputs the result of the initial and final positions of the information keyword and the information category through a word embedding layer, a coding layer, a conditional random field layer, a sequence prediction layer and a category prediction layer.

The basic idea of the character vector is that each character is characterized as a K-dimensional vector, the relation between the characters can be learned in the character vector training process, and meanwhile, the vocabulary representation mode of the vector form is beneficial to calculation. The specific calculation formula is shown in (1).

Further, the text vector is encoded by the bi-directional long-short-term memory neural network to obtain a deep representation of the text vector, and the formulas are shown in (2) - (4).

In general, the model label prediction phase can be processed using a softmax function. For each character, predicting the probability of the character as a keyword start, a keyword end, a keyword interval and a non-keyword, finally selecting the item with the highest probability as the label of each character, and extracting the keyword through the start label and the end label to obtain keyword data.

At the same time pass through hidden layer vector h _i The intelligence type is predicted by a feed-forward neural network, and the formula is shown as (5).

In terms of the loss function, since the keyword extraction and the class prediction are used in combination learning, the loss function formula is shown as formula (6).

Scoring the intelligence keywords and the intelligence categories using a structured deduplication algorithm to obtain a first deduplication score.

For unstructured intelligence, the deduplication effect is not yet achieved by the key alone. Part of the intelligence, for example, "Vim, is a cross-platform text editor. The pre-Vim 8.2 version has security holes that stem from the problem of re-use after release, it can be seen that the informative text does not have obvious keywords, such as attack source IP address, CVE number, etc., so for part of text informative without informative keywords, it is necessary to de-duplicate the local informative library by text similarity calculation.

Therefore, through a Bit-BERT algorithm, a binary Bit vector is generated by learning a semantic vector through a feedforward neural network, coarse-granularity candidate sets are obtained through coarse-robustness similarity calculation, and then fine-granularity similarity calculation is carried out from the coarse-granularity candidate sets, so that the de-duplication confidence B is obtained.

FIG. 10 is a schematic diagram of a full binary quantized language representation model, as shown in FIG. 10, in which unstructured threat information text is encoded, a BERT pre-training language model is used to encode character vectors and generate text vectors through a maximum pooling layer, and the text vectors are generated to enable information text similarity calculation, generally cosine similarity and the like, but because threat information data volume is large, the method cannot meet performance requirements, so that a bit encoding layer is adopted, a representative hash value is generated for the text vectors, a binary encoding identification learning layer is introduced, specifically, a layer is added between an output layer and a semantic hiding layer for hash representation learning, a full connection structure is adopted, a sigmoid activation function is adopted, and hidden representations of floating point numbers in each dimension are hidden into binary representations [0,1]. Through training, binary codes with medium-level semantic features and high-level semantic representations are generated. Specifically, as shown in formulas (7) to (9).

The similarity calculation and vector hash learning of unstructured threat information data can be realized through the full binary quantized language characterization model, the detection rate of repeated texts is effectively improved, and the similarity measurement method based on medium-level semantic vectors and high-level semantic vectors with medium-level and high granularity can effectively improve the calculation efficiency of similarity.

When threat information data is structured information data, high-performance storage databases such as Redis are often adopted by the structured information data deduplication processing service due to high concurrency, but due to huge data volume, memory overflow and the like can be caused by filling data into the Redis, so that compression processing is required to be performed on the data for deduplication service, and the system overhead of the deduplication service is reduced.

Fig. 11 shows a schematic diagram of a data compression module, and as shown in fig. 11, threat intelligence data of a structured type is processed through bit hashing. Specifically, a 4-bit vector is generated according to the data type of threat intelligence data to be used as a first bit vector, and a 60-bit vector is generated according to key data of threat intelligence data through a hash algorithm to be used as a second bit vector.

After the first bit vector and the second bit vector are generated, the first bit vector and the second bit vector can be weighted and summed to obtain a 64-bit vector as a target bit vector to convert each feature into a 64-bit hash vector.

Because the deduplication process uses Redis as a temporary database, and Redis places the data into memory. When the mass data is de-duplicated, the key words are converted into bit features, so that the resources required by the system can be effectively reduced, and the retrieval speed of the repeated data is improved.

Based on this, for structured threat intelligence data, intelligence keywords, such as IP intelligence, vulnerability numbers, etc., may be transferred into bit vectors using a bit hashing algorithm, and the storage format is < bit vector, intelligence ID >.

For unstructured threat information data, extracting information keywords and information types through an information keyword-type joint extraction model, and sending the keywords and the information types into a structured information database, wherein the storage format is < bit vector, information ID >. If the structured database has the element, generating a duplicate confidence A (default value of 1); and if not, storing the data in a database.

Further, a middle-level semantic vector (Bit vector) and a high-level semantic vector (deep semantic vector) are generated through a Bit-BERT model, a candidate set is generated through a middle-level semantic vector retrieval semantic storage module (Redis), and finally, the maximum approximation value is calculated through the high-level semantic vector to generate the repetition confidence coefficient B (the default value is 1).

Further, the final duplicate confidence level is derived from 0.6 x confidence level a+0.4 x confidence level B. If the value is more than or equal to 0.6, the data is judged to be repeated, otherwise, the data is written into a semantic vector retrieval semantic storage module, and the storage format is < medium-level semantic vector and high-level semantic vector >.

If the threat information data newly added every day is judged to be repeated, the threat information data is not written into a system background database; otherwise, the processes of database writing, subsequent information data fusion, aging and the like are carried out.

The data deduplication method in the application scene determines the data type of threat information data, and provides data basis and theoretical support for providing different deduplication modes for different types of threat information data. On one hand, the threat information data is subjected to duplication elimination processing according to the semantic feature vector, so that the problems that the memory occupied in the duplication elimination process of the threat information data is overlarge and the processing flow is time-consuming are solved, the problem that text information cannot be captured by an original duplication elimination method is effectively solved, and meanwhile, the searching efficiency of unstructured threat information data is improved. On the other hand, the data type of the structured threat information data is subjected to data compression processing, so that the problem that excessive system resources are consumed due to the duplication removal process of massive threat information data is solved, and the resource consumption caused by storing the threat information data is reduced.

Fig. 12 shows a schematic structural diagram of a data deduplication apparatus, and as shown in fig. 12, the data deduplication apparatus 1200 may include: a data acquisition module 1210, a first deduplication module 1220, and a second deduplication module 1230. Wherein:

a data acquisition module 1210 configured to acquire threat intelligence data and to pre-process the threat intelligence data to determine a data type;

The first deduplication module 1220 is configured to perform text similarity calculation on the threat intelligence data to obtain a semantic feature vector when the data type is unstructured, and perform deduplication processing on the threat intelligence data according to the semantic feature vector; or (b)

And a second deduplication module 1230 configured to perform data compression processing on the data type when the data type is a structured type, and store the threat intelligence data after compression to perform deduplication processing.

The details of the data deduplication device 1200 are described in detail in the corresponding data deduplication method, and thus are not described herein.

It should be noted that although several modules or units of the data deduplication apparatus 1200 are mentioned in the above detailed description, this partitioning is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

An electronic device 1300 according to such an embodiment of the invention is described below with reference to fig. 13. The electronic device 1300 shown in fig. 13 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 13, the electronic device 1300 is embodied in the form of a general purpose computing device. The components of the electronic device 1300 may include, but are not limited to: the at least one processing unit 1310, the at least one memory unit 1320, a bus 1330 connecting the different system components (including the memory unit 1320 and the processing unit 1310), and a display unit 1340.

Wherein the storage unit stores program code that is executable by the processing unit 1310 such that the processing unit 1310 performs steps according to various exemplary embodiments of the present invention described in the above section of the "exemplary method" of the present specification.

The storage unit 1320 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 1321 and/or cache memory 1322, and may further include Read Only Memory (ROM) 1323.

The storage unit 1320 may also include a program/utility 1324 having a set (at least one) of program modules 1325, such program modules 1325 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 1330 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.

The electronic device 1300 may also communicate with one or more external devices 1500 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 1300, and/or any device (e.g., router, modem, etc.) that enables the electronic device 1300 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1350. Also, the electronic device 1300 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, for example, the Internet, through a network adapter 1360. As shown, the network adapter 1360 communicates with other modules of the electronic device 1300 over the bus 1330. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 1300, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the "exemplary methods" section of this specification, when said program product is run on the terminal device.

Referring to fig. 14, a program product 1400 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method of deduplication of data, the method comprising:

when the data type is unstructured, text similarity calculation is carried out on the threat information data to obtain semantic feature vectors, and duplicate removal processing is carried out on the threat information data according to the semantic feature vectors, wherein the semantic feature vectors comprise high-level semantic vectors and medium-level semantic vectors;

when the data type is a structured type, carrying out data compression processing on the data type, and storing the compressed threat information data so as to carry out de-duplication processing;

The step of performing deduplication processing on the threat intelligence data according to the semantic feature vector comprises the following steps:

performing second distance calculation on the advanced semantic vector and the stored information data in the information candidate set to determine a second duplicate removal score, and performing calculation on a first duplicate removal score and the second duplicate removal score to obtain duplicate confidence, wherein the first duplicate removal score is obtained by scoring information keywords and information categories obtained based on the threat information data;

2. The method of claim 1, wherein preprocessing the threat intelligence data to determine a data type comprises:

3. The method for deduplication of data according to claim 2, wherein the performing data compression processing on the data type includes:

4. The data deduplication method of claim 1, wherein prior to the text similarity calculation of the threat intelligence data resulting in a semantic feature vector, the method further comprises:

5. The method of data deduplication according to claim 4, wherein the joint extraction model is trained by:

6. The method of data deduplication according to claim 4, wherein the text similarity calculation of the threat intelligence data obtains a semantic feature vector, comprising:

7. A data deduplication apparatus, comprising:

the first deduplication module is configured to calculate text similarity of the threat information data to obtain a semantic feature vector when the data type is unstructured, and perform deduplication processing on the threat information data according to the semantic feature vector, wherein the semantic feature vector comprises a high-level semantic vector and a medium-level semantic vector;

the second deduplication module is configured to perform data compression processing on the data type when the data type is a structured type, and store the compressed threat information data so as to perform deduplication processing;

The first deduplication module is further configured to acquire stored information data in an information database, and perform first distance calculation on the medium-level semantic vector and the stored information data to determine an information candidate set;

8. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the data deduplication method of any of claims 1 to 6.

9. An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the data deduplication method of any of claims 1-6 via execution of the executable instructions.