CN111209735A

CN111209735A - Document sensitivity calculation method and device

Info

Publication number: CN111209735A
Application number: CN202010004721.5A
Authority: CN
Inventors: 蒋仕宝
Original assignee: GCI Science and Technology Co Ltd
Current assignee: GCI Science and Technology Co Ltd
Priority date: 2020-01-03
Filing date: 2020-01-03
Publication date: 2020-05-29
Anticipated expiration: 2040-01-03
Also published as: CN111209735B

Abstract

The invention discloses a method for calculating document sensitivity, which comprises the following steps: obtaining value elements of a document to be identified, vectorizing each value element, and carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified; and calculating the similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document, and further calculating the sensitivity of the document to be identified. The embodiment of the invention also discloses a corresponding device for calculating the document sensitivity, and by adopting the method of feature vector similarity, the embodiment of the invention realizes the identification and analysis of the sensitive data by identifying the value elements of the document, realizes the calculation of the document sensitivity, effectively improves the accuracy of the calculation of the document sensitivity, and has simple and convenient calculation method.

Description

Document sensitivity calculation method and device

Technical Field

The invention relates to the technical field of information security, in particular to a method and a device for calculating document sensitivity.

Background

Data confidentiality, integrity and usability are related to multiple aspects such as national security, enterprise core competitiveness and personal privacy, and data security is receiving more and more attention as an important subject in the field of information security. With the wide application of e-mail, instant messaging and removable storage media, the working efficiency of people is improved, meanwhile, a data leakage channel is inevitably expanded, and the worry of users on data storage safety is aggravated. Currently, many scholars at home and abroad have conducted a lot of research on a security storage method of data sensitive information, such as a hierarchical model based on data security requirements provided for data sensitive attributes; and identifying and grading sensitive attributes of the structured data set, and the like.

However, in the process of implementing the invention, the inventor finds that the prior art has at least the following problems: in the sensitive data classification method in the prior art, data sensitive attributes are set in advance, most of the data sensitive attributes are realized by aiming at a structured data set, and the method is not suitable for sensitive identification and classification of various semi-structured or unstructured data of cloud computing.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for calculating the document sensitivity, which realize the calculation of the document sensitivity by identifying the value elements of a document, effectively improve the accuracy of the calculation of the document sensitivity and have a simple and convenient calculation method.

In order to achieve the above object, an embodiment of the present invention provides a method for calculating document sensitivity, including:

obtaining value elements of a document to be identified, and vectorizing each value element to obtain a corresponding value element vector; wherein the value element comprises metadata of the document to be identified;

carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified;

calculating a similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document; wherein the similarity entropy is positively correlated with the similarity;

calculating the sensitivity of the document to be identified according to the similarity entropy of the document to be identified; wherein the sensitivity of the document to be identified is in negative correlation with the similarity entropy.

As an improvement of the above scheme, the calculating a similarity entropy of the document to be recognized according to the similarity between the target value element vector and a preset value element vector of a preset document specifically includes:

calculating the similarity between the target value element vector of the document to be identified and a preset value element vector of each preset document in a preset document set, and taking the sum of the similarities between all the target value element vectors and the preset value element vector of each preset document as the similarity of the document to be identified;

calculating the similarity of each document in a platform data set and a preset value element vector of each preset document, and taking the sum of the similarities of all documents in the platform data set and the preset value element vector of each preset document as the similarity of the platform data set; wherein the platform data set comprises the document to be identified;

according to the similarity of the document to be recognized and the similarity of the platform data set, calculating the similarity entropy of the document to be recognized through the following calculation formula:

wherein, H (d)_r) The similarity entropy of the document to be identified is obtained;

for the similarity of the documents to be identified,

similarity of the platform data set; m is the number of the preset documents, and k is the number of the documents of the platform data set.

As an improvement of the above solution, the method for calculating document sensitivity further includes:

calculating the use entropy of the document to be identified according to the user access times of the document to be identified; wherein the usage entropy is positively correlated with the user access times of the document to be identified;

calculating the quality entropy of the document to be identified according to the source credibility of the document to be identified; wherein the quality entropy is positively correlated with the source credibility of the document to be identified;

then, the calculating the sensitivity of the document to be recognized according to the similarity entropy of the document to be recognized specifically includes:

calculating the product of the similarity entropy, the use entropy and the quality entropy of the document to be identified as the combined entropy of the document to be identified;

calculating the sensitivity of the document to be identified according to the combined entropy of the document to be identified; wherein the sensitivity of the document to be identified is inversely related to the combined entropy.

As an improvement of the above scheme, the calculating the use entropy of the document to be recognized according to the user access times of the document to be recognized specifically includes:

acquiring the times of each user accessing the document to be identified, and calculating the sum of the times of all the users accessing the document to be identified as the user access times of the document to be identified;

acquiring the times of each user respectively accessing each document in a platform data set, and calculating the sum of the times of all users respectively accessing each document in the platform data set as the user access times of the platform data set; wherein the platform data set comprises the document to be identified;

according to the user access times of the document to be recognized and the user access times of the platform data set, calculating the use entropy of the document to be recognized through the following calculation formula:

wherein, H (c)_r) Usage entropy for the document to be recognized；

For the number of user accesses of the document to be identified,

the number of user accesses to the platform data set; n is the number of the users, and k is the number of the documents of the platform data set.

As an improvement of the above scheme, the calculating the quality entropy of the document to be recognized according to the source reliability of the document to be recognized specifically includes:

acquiring source credibility of the document to be identified under each credibility factor, and taking the sum of the source credibility of the document to be identified under all credibility factors as the source credibility of the document to be identified; wherein the credibility factors comprise uploading level, document source, document creation or modification date;

acquiring the source credibility of each document in a platform data set under each credibility factor, and taking the sum of the source credibility of all documents in the platform data set under each credibility factor as the source credibility of the platform data set; wherein the platform data set comprises the document to be identified;

according to the source credibility of the document to be identified and the source credibility of the platform data set, calculating the quality entropy of the document to be identified through the following calculation formula:

wherein, H (N)_r) The quality entropy of the document to be identified is obtained;

for the source confidence of the document to be identified,

source confidence for the platform dataset; l is the number of the credibility factors, and k is the number of documents of the platform data set.

As an improvement of the above scheme, the calculating the sensitivity of the document to be recognized according to the combined entropy of the document to be recognized specifically includes:

according to the combined entropy of the documents to be recognized, calculating the sensitivity of the documents to be recognized through the following calculation formula:

wherein C (r) is the sensitivity of the document to be identified; h (r) is the combined entropy of the documents to be identified;

k is the sum of the combined entropy for each document in the platform dataset and k is the number of documents in the platform dataset.

As an improvement of the above scheme, the vector splicing is performed on the value element vectors corresponding to the value elements that meet the preset feature contribution threshold to obtain the target value element vector of the document to be identified, and the method specifically includes:

calculating the characteristic contribution degree of each value element of the document to be identified;

obtaining value elements corresponding to the maximum N characteristic contribution degrees as the value elements meeting the preset contribution degree threshold;

carrying out vector splicing on the value element vectors corresponding to the value elements which accord with the preset feature contribution degree threshold value to obtain target value element vectors of the document to be identified;

wherein, the calculating the feature contribution degree of each value element of the document to be identified specifically comprises: calculating the characteristic contribution degree of each value element by the following calculation formula:

FCD(t,c_i) A characteristic contribution degree of a value element t of the document to be identified, df (t, c)_i) As value element t at c_iNumber of documents present in class document, df (t, c)_j) As value element t being non-c_iThe number of documents appearing in the class document, M is the number of document categories of the platform data set, c_iClass documents are sensitive document categories.

As an improvement of the above scheme, the method for calculating the document sensitivity further includes:

determining the sensitivity level of the document to be identified according to the sensitivity of the document to be identified; wherein the sensitivity level is inversely related to the sensitivity of the document to be identified.

The embodiment of the invention also provides a device for calculating the document sensitivity, which comprises: the device comprises a value element acquisition module, a target value element vector acquisition module, a similarity entropy calculation module and a sensitivity calculation module; wherein the content of the first and second substances,

the value element acquisition module is used for acquiring the value elements of the document to be identified and vectorizing each value element to obtain a corresponding value element vector; wherein the value element comprises metadata of the document to be identified;

the target value element vector acquisition module is used for carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified;

the similarity entropy calculation module is used for calculating the similarity entropy of the document to be identified according to the similarity between the target value element vector and a preset value element vector of a preset document; wherein the similarity entropy is positively correlated with the similarity;

the sensitivity calculation module is used for calculating the sensitivity of the document to be identified according to the similarity entropy of the document to be identified; wherein the sensitivity of the document to be identified is in negative correlation with the similarity entropy.

The embodiment of the present invention further provides a document sensitivity calculation apparatus, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the processor implements the document sensitivity calculation method according to any one of the preceding claims.

Compared with the prior art, the document sensitivity calculation method and the document sensitivity calculation device disclosed by the invention have the advantages that the value elements of the document to be identified are obtained, and each value element is vectorized; carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified; and calculating the similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document. And then, calculating the use entropy and the quality entropy of the document to be identified according to the user access times and the source credibility of the document to be identified so as to calculate the combined entropy of the document to be identified, and further calculating the sensitivity of the document to be identified according to the combined entropy of the document to be identified. Compared with the prior art, the method takes the unstructured data set of the value elements of the document as an object, realizes the identification and analysis of sensitive data by adopting a feature vector similarity method, and can realize the sensitivity calculation of the unstructured data document without acquiring the content attribute, the sensitive dictionary and other information of the document in advance by combining the use frequency and the source credibility of the document, thereby effectively improving the accuracy of the sensitivity calculation of the document, and the calculation method is simple and convenient.

Drawings

FIG. 1 is a flowchart illustrating steps of a method for calculating document sensitivity according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of calculating feature contribution degrees in a method for calculating document sensitivity according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating steps of a method for calculating document sensitivity according to a second embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a document sensitivity calculation apparatus according to a third embodiment of the present invention;

FIG. 5 is a schematic structural diagram of another document sensitivity calculation apparatus according to the fourth embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flowchart illustrating steps of a method for calculating document sensitivity according to an embodiment of the present invention. The method for calculating the document sensitivity according to the first embodiment of the present invention is executed through steps S11 to S14:

s11, obtaining the value elements of the document to be identified, and vectorizing each value element to obtain a corresponding value element vector; wherein the value element comprises metadata of the document to be identified.

Specifically, the document to be identified is a document in the platform data set, which needs to be subjected to sensitivity identification. And extracting accurate and useful information in the document to be identified through a big data semantic identification technology to serve as a value element of the document to be identified so as to identify the data quality and the value information of the document to be identified. The value element is an unstructured dataset comprising metadata of the document to be identified. Metadata (Meta Data) is key Data generated during the construction of a Data warehouse and related to Data source definition, target definition, conversion rule, and the like, and also contains business information and the like related to the meaning of the Data, specifically including information such as Data source, information producer, title, keyword, abstract, creation time, language type used, format, browsing times, and the like of a document.

After the value elements of the document to be recognized are extracted through a big data semantic recognition technology, a word vectorization method, such as word2vec, is adopted to carry out vectorization on each value element. The dimensions of each value element vector are of equal length, i.e. each value element vector is 128 dimensions. The symbolic information of the natural language is converted into the digital information in the form of vectors, so that the distance information between the symbolic information and the value element vector of a specific document can be calculated conveniently in the subsequent sensitivity calculation process, and the similarity of the value element vector can be calculated.

And S12, carrying out vector splicing on the value element vectors corresponding to the value elements meeting the preset feature contribution degree threshold value to obtain the target value element vector of the document to be identified.

Fig. 2 is a schematic flow chart illustrating steps of calculating feature contribution degrees in the document sensitivity calculation method according to an embodiment of the present invention. Step S12 is executed by steps S121 to S123:

s121, calculating the feature contribution degree of each value element of the document to be recognized.

Feature Contribution Degree (FCD) is a method of feature selection for identifying the degree of contribution of a value element to the discriminative power between different classes.

Specifically, the feature contribution degree of each value element is calculated by the following calculation formula:

wherein FCD (t, c)_i) A characteristic contribution degree of a value element t of the document to be identified, df (t, c)_i) As value element t at c_iNumber of documents present in class document, df (t, c)_j) As value element t being non-c_iThe number of documents present in a class document, M is the number of document classes of the platform dataset.

In one embodiment, the documents in the platform data set are divided into two document categories, i.e., M-2, of sensitive documents and non-sensitive documents according to a preset algorithm or a division rule. Calculating the characteristic contribution degree of each value element for measuring the sensitivity text of each value elementThe ability to distinguish between a class of documents and a class of non-sensitive documents. When c is going to_iWhen the class document is a sensitive document category, the larger the feature contribution FCD is, the larger the differentiation contribution degree of the value element t to the sensitive document is, that is, the greater the guiding significance of the differentiation of the sensitive document is. When c is going to_iWhen the class document is a non-sensitive document category, the larger the feature contribution FCD is, the larger the differentiation contribution degree of the value element t to the non-sensitive document is, that is, the greater the guiding significance of the differentiation of the non-sensitive document is. The value range of FCD is [0,1]]An interval.

S122, obtaining value elements corresponding to the maximum N characteristic contribution degrees as the value elements meeting the preset contribution degree threshold value; wherein N is more than or equal to 1.

In the present embodiment, c_iAnd the class document is a sensitive document class, and is used for effectively measuring the guiding significance of distinguishing the value elements from the sensitive document, and after calculating the characteristic contribution degree of each value element of the document to be identified, obtaining the value elements corresponding to the maximum N characteristic contribution degrees as the value elements meeting the preset contribution degree threshold. Preferably, N ═ 8. In the embodiment of the invention, based on the value element vector distribution condition, the key value elements are selected by adopting a characteristic contribution degree method, and the complexity of sensitivity calculation is reduced.

And S123, carrying out vector splicing on the value element vectors corresponding to the value elements which accord with the preset feature contribution degree threshold value to obtain the target value element vector of the document to be identified.

And carrying out vector splicing on the value element vectors corresponding to the value elements which accord with the preset feature contribution degree threshold value to serve as target value element vectors of the document to be identified. When the number N of the obtained value elements meeting the preset feature contribution threshold is 8, the dimension of the target value element vector is 128 × 8, 1024.

S13, calculating a similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document; wherein the similarity entropy is positively correlated with the similarity.

The preset document is representative of a typical sensitive document. Specifically, value element vectors corresponding to N value elements with large feature contribution degrees in each preset document are obtained and vector splicing is carried out to serve as the preset value element vectors of the preset documents, and the dimensionality of the preset value element vectors of the preset documents is also 1024 dimensionalities. And then, carrying out similarity calculation on the target value element vector and a preset value element vector of a preset document, namely calculating the distance between the target value element vector and the preset value element vector of the preset document, wherein the greater the distance is, the greater the similarity is. The similarity calculation method adopts a calculation method in the prior art, such as Cosine similarity, Tanimoto coefficient, etc., and is not limited in detail herein.

As a preferred embodiment, step S13 is executed by steps S131 to S133:

s131, calculating the similarity between the target value element vector of the document to be identified and the preset value element vector of each preset document in a preset document set, and taking the sum of the similarities between all the target value element vectors and the preset value element vectors of each preset document in the preset document set as the similarity of the document to be identified.

The preset document set is a data set of preset sensitive documents, the preset document set comprises m preset sensitive documents, and m is larger than or equal to 1. Each preset document is pre-calculated with a corresponding preset value element vector for measuring the sensitivity distinguishing contribution degree of the corresponding preset document.

Specifically, the similarity between the target value element vector of the document to be identified and the preset value element vector of each preset document in a preset document set is calculated

And calculating the sum of the similarity of all the target value element vectors and the preset value element vector of each preset document

As the similarity of the documents to be identified.

S132, calculating the similarity between each document in the platform data set and the preset value element vector of each preset document, and taking the sum of the similarities between all the documents in the platform data set and the preset value element vector of each preset document as the similarity of the platform data set; wherein the platform data set comprises the document to be identified;

the platform data set is a collection of each document, including the document to be identified. The platform data set comprises k documents, and k is larger than or equal to 1. Calculating the similarity of each document in the platform data set and the preset value element vector of each preset document respectively

And calculating the total similarity of all the documents and the preset value element vector of each preset document

As the similarity of the platform data sets.

S133, calculating a similarity entropy of the document to be identified according to a ratio of the similarity of the document to be identified to the similarity of the platform data set, wherein the similarity entropy specifically satisfies a calculation formula:

for the similarity of the documents to be identified,

S14, calculating the sensitivity of the document to be recognized according to the similarity entropy of the document to be recognized; wherein the sensitivity of the document to be identified is positively correlated with the similarity entropy.

Calculating the similarity of the value element vector of a preset sensitive document set and obtaining the similarity entropy of the document to be identified by adopting a discrete random variable entropy method, and further evaluating the sensitivity of the document to be identified. And calculating the sensitivity of the document to be identified by presetting a function relation which enables the similarity entropy to be in negative correlation with the document sensitivity and substituting the similarity entropy of the document to be identified into the function relation. Specifically, the higher the similarity entropy is, the smaller the sensitivity of the document to be recognized is, and the more sensitive the document to be recognized is.

The embodiment of the invention provides a method for calculating document sensitivity, which comprises the steps of obtaining value elements of a document to be identified and vectorizing each value element; carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified; and calculating the similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document, and further calculating the sensitivity of the document to be identified according to the similarity entropy of the document to be identified. In the embodiment of the invention, the unstructured data set of the value element of the document is taken as an object, the identification and analysis of sensitive data are realized by adopting a feature vector similarity method, the sensitivity calculation of the unstructured data document can be realized without acquiring the content attribute, the sensitive dictionary and other information of the document in advance, the accuracy of the sensitivity calculation of the document is effectively improved, and the calculation method is simple and convenient.

Fig. 3 is a schematic flow chart illustrating steps of a document sensitivity calculation method according to a second embodiment of the present invention. The method for calculating the document sensitivity provided by the second embodiment of the present invention is implemented on the basis of the first embodiment, and specifically includes steps S21 to S27:

s21, obtaining the value elements of the document to be identified, and vectorizing each value element to obtain a corresponding value element vector; wherein the value element comprises metadata of the document to be identified.

And S22, carrying out vector splicing on the value element vectors corresponding to the value elements meeting the preset feature contribution degree threshold value to obtain the target value element vector of the document to be identified.

S23, calculating a similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document; wherein the similarity entropy is positively correlated with the similarity.

In the embodiment of the present invention, the steps S21 to S23 are the same as the steps S11 to S13 in the first embodiment, and are not repeated herein.

S24, calculating the use entropy of the document to be identified according to the user access times of the document to be identified; wherein the usage entropy is positively correlated with the user access times of the document to be identified.

As a preferred embodiment, step S24 is executed by steps S241 to S243:

s241, obtaining the times of each user accessing the document to be identified, and calculating the sum of the times of all the users accessing the document to be identified as the user access times of the document to be identified.

S242, obtaining the times of each user respectively accessing each document in the platform data set, and calculating the sum of the times of all users respectively accessing each document in the platform data set as the user access times of the platform data set; wherein the platform data set includes the document to be identified.

The number of users accessing the platform data set is assumed to be n, wherein n is more than or equal to 1; the platform data set comprises k documents, and k is larger than or equal to 1. By counting the times of each user accessing the document to be identified

And calculating the sum of the times of accessing the document to be identified by n users

As the user access times of the document to be identified. By counting the times that each user respectively accesses each document in the platform data set

And calculating the sum of the times of respectively accessing each document in the platform data set by n users

As the number of user accesses to the platform data set.

S243, calculating the use entropy of the document to be identified according to the ratio of the user access times of the document to be identified to the user access times of the platform data set; the formula is specifically satisfied:

wherein, H (c)_r) The usage entropy of the document to be identified is obtained;

for the number of user accesses of the document to be identified,

In the embodiment of the invention, the more frequently the document data is used, that is, the more times the document is accessed by the user, the higher the possibility of being abused is, and the more sensitive the document is. The use entropy of the document to be identified is obtained by adopting a discrete random variable entropy method, and then the sensitivity of the document to be identified is evaluated, so that the sensitivity calculation of the document to be identified is more accurate and reliable.

S25, calculating the quality entropy of the document to be identified according to the source credibility of the document to be identified; wherein the quality entropy is positively correlated with the source credibility of the document to be identified.

As a preferred embodiment, step S25 is executed by steps S251 to S253:

s251, obtaining the source credibility of the document to be identified under each credibility factor, and taking the sum of the source credibility of the document to be identified under all credibility factors as the source credibility of the document to be identified; wherein the credibility factors comprise uploading level, document source, document creation or modification date.

S252, obtaining the source credibility of each document in the platform data set under each credibility factor, and taking the sum of the source credibility of all documents in the platform data set under each credibility factor as the source credibility of the platform data set; wherein the platform data set includes the document to be identified.

Specifically, the source credibility of the document is related to credibility factors such as uploading level, document source and document creation or modification date. Setting mapping relations between three credibility factors of uploader level, document source place and document creation or modification date and source credibility respectively, and further acquiring source credibility corresponding to the document to be identified under each credibility factor

And calculating the source credibility sum of the documents to be identified under the three credibility factors

The source credibility of the document to be identified is obtained; obtaining the source credibility of each document in the platform data set under each credibility factor

And summing the source credibility of all documents in the platform data set under each credibility factor respectively

As a source confidence for the platform data set.

S253, calculating the quality entropy of the document to be identified according to the ratio of the source credibility of the document to be identified to the source credibility of the platform data set; the method specifically satisfies the calculation formula:

for the source confidence of the document to be identified,

source confidence for the platform dataset; l is the number of the credibility factors, and k is the number of documents of the platform data set. In the present embodiment, l ═ 3.

In the embodiment of the invention, the higher the quality of the document data, namely the more reliable the document source is, the more sensitive the document is. The method comprises the steps of obtaining the quality entropy of a document to be identified by adopting a discrete random variable entropy method, and further evaluating the sensitivity of the document to be identified, so that the sensitivity calculation of the document to be identified is more accurate and reliable.

S26, calculating the product of the similarity entropy, the use entropy and the quality entropy of the to-be-identified document as the combined entropy of the to-be-identified document.

The combined entropy H (r) of the document to be recognized specifically satisfies the following calculation formula:

H(r)＝H(c_r)·H(d_r)·H(N_r)；

wherein, H (d)_r) Entropy of similarity for said document to be recognized, H (c)_r) Entropy of use for said document to be identified, H (N)_r) And the quality entropy of the document to be identified is obtained.

S27, calculating the sensitivity of the document to be recognized according to the combined entropy of the document to be recognized; wherein the sensitivity of the document to be identified is inversely related to the combined entropy.

Specifically, as with the method of calculating the combined entropy of the documents to be identified, the combined entropy H (r) of each document in the platform data set is calculated by the above-described method_i) And calculating to obtain the sum of the combined entropy of each document in the platform data set

According to the combined entropy H (r) of the document to be recognized and the combined entropy of the platform data set

Calculating the sensitivity C (r) of the document to be identified by the following calculation formula:

the value range of the sensitivity C (r) of the document to be recognized is [0,1], the closer the sensitivity value is to 0, the more sensitive the document to be recognized is, and the closer the sensitivity value is to 1, the less sensitive the document to be recognized is.

In the embodiment of the invention, the sensitivity of the document to be recognized in the platform data set is evaluated by combining the similarity entropy, the use entropy and the quality entropy of the document, so that the sensitivity recognition and analysis of the document to be recognized are realized.

Further, the method for calculating the document sensitivity further comprises the following steps: determining the sensitivity level of the document to be identified according to the sensitivity of the document to be identified; wherein the sensitivity level is inversely related to the sensitivity of the document to be identified.

Determining the sensitivity level of a document to be identified according to a preset mapping relation between the sensitivity level and the sensitivity level of the document, wherein the sensitivity level comprises the following steps: a non-sensitive rating, a light sensitive rating, a medium sensitive rating, and a high sensitive rating. Specifically, the mapping relationship between the sensitivity and the sensitivity level is specifically as shown in table 1:

TABLE 1 mapping of sensitivity to sensitivity rating

The embodiment two of the invention provides a method for calculating the document sensitivity, which comprises the steps of obtaining the value elements of a document to be identified and vectorizing each value element; carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified; and calculating the similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document. And then, calculating the use entropy and the quality entropy of the document to be identified according to the user access times and the source credibility of the document to be identified so as to calculate the combined entropy of the document to be identified, and further calculating the sensitivity of the document to be identified according to the combined entropy of the document to be identified. In the embodiment of the invention, the unstructured data set of the value element of the document is taken as an object, the identification and analysis of sensitive data are realized by adopting a feature vector similarity method, the use frequency and the source credibility of the document are combined, the sensitivity calculation of the unstructured data document can be realized without acquiring the information of the content attribute, the sensitive dictionary and the like of the document in advance, the sensitivity grade of the document is determined according to the sensitivity of the document, the accuracy of the sensitivity calculation of the document is effectively improved, and the calculation method is simple and convenient.

Fig. 4 is a schematic structural diagram of a document sensitivity calculation apparatus according to a third embodiment of the present invention. In the embodiment of the present invention, the document sensitivity calculating device 30 includes: a value element acquisition module 31, a target value element vector acquisition module 32, a similarity entropy calculation module 33 and a sensitivity calculation module 34; wherein the content of the first and second substances,

the value element obtaining module 31 is configured to obtain value elements of a document to be identified, and vectorize each value element to obtain a corresponding value element vector; wherein the value element comprises metadata of the document to be identified;

the target value element vector obtaining module 32 is configured to perform vector splicing on the value element vectors corresponding to the value elements that meet a preset feature contribution threshold value, so as to obtain a target value element vector of the document to be identified;

the similarity entropy calculation module 33 is configured to calculate a similarity entropy of the document to be recognized according to a similarity between the target value element vector and a preset value element vector of a preset document; wherein the similarity entropy is positively correlated with the similarity;

the sensitivity calculation module 34 is configured to calculate the sensitivity of the document to be identified according to the similarity entropy of the document to be identified; wherein the sensitivity of the document to be identified is in negative correlation with the similarity entropy.

It should be noted that, the document sensitivity calculation apparatus provided in the embodiment of the present invention is configured to execute all the process steps of the document sensitivity calculation method in the first embodiment or the second embodiment, and working principles and beneficial effects of the two are in one-to-one correspondence, so that details are not repeated.

The third embodiment of the invention provides a device for calculating the document sensitivity, which is characterized in that value elements of a document to be identified are obtained, and each value element is vectorized; carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified; and calculating the similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document. And then, calculating the use entropy and the quality entropy of the document to be identified according to the user access times and the source credibility of the document to be identified so as to calculate the combined entropy of the document to be identified, and further calculating the sensitivity of the document to be identified according to the combined entropy of the document to be identified. In the embodiment of the invention, the unstructured data set of the value element of the document is taken as an object, the identification and analysis of sensitive data are realized by adopting a feature vector similarity method, and the sensitivity calculation of the unstructured data document can be realized without acquiring information such as content attributes and sensitive dictionaries of the document in advance by combining the use frequency and source credibility of the document, so that the accuracy of the sensitivity calculation of the document is effectively improved, and the calculation method is simple and convenient.

Fig. 5 is a schematic structural diagram of another document sensitivity calculation apparatus according to the fourth embodiment of the present invention. In an embodiment of the present invention, the document sensitivity calculation apparatus 40 includes a processor 41, a memory 42, and a computer program stored in the memory and configured to be executed by the processor, and the processor executes the computer program to implement the document sensitivity calculation method according to any one of the first embodiment and the second embodiment.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method for calculating document sensitivity is characterized by comprising the following steps:

2. The method for calculating document sensitivity according to claim 1, wherein the calculating the similarity entropy of the document to be recognized according to the similarity between the target value element vector and a preset value element vector of a preset document specifically comprises:

for the similarity of the documents to be identified,

3. The method of calculating document sensitivity of claim 1, wherein the method of calculating document sensitivity further comprises:

4. The method for calculating the document sensitivity according to claim 3, wherein the calculating the use entropy of the document to be recognized according to the user access times of the document to be recognized specifically comprises:

for the number of user accesses of the document to be identified,

5. The method for calculating the sensitivity of a document according to claim 3, wherein the calculating the quality entropy of the document to be recognized according to the source reliability of the document to be recognized specifically comprises:

for the source confidence of the document to be identified,

6. The method for calculating the sensitivity of a document according to claim 3, wherein the calculating the sensitivity of the document to be recognized according to the combined entropy of the document to be recognized specifically comprises:

7. The method for calculating document sensitivity according to any one of claims 1 to 6, wherein the vector splicing is performed on the value element vectors corresponding to the value elements that meet a preset feature contribution threshold value to obtain the target value element vector of the document to be identified, and specifically includes:

8. The method of calculating document sensitivity of claim 1, further comprising:

9. A document sensitivity calculation apparatus, comprising: the device comprises a value element acquisition module, a target value element vector acquisition module, a similarity entropy calculation module and a sensitivity calculation module; wherein the content of the first and second substances,

10. A document sensitivity calculation apparatus, comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein the processor implements the document sensitivity calculation method according to any one of claims 1 to 8 when executing the computer program.