CN111209735A - Document sensitivity calculation method and device - Google Patents

Document sensitivity calculation method and device Download PDF

Info

Publication number
CN111209735A
CN111209735A CN202010004721.5A CN202010004721A CN111209735A CN 111209735 A CN111209735 A CN 111209735A CN 202010004721 A CN202010004721 A CN 202010004721A CN 111209735 A CN111209735 A CN 111209735A
Authority
CN
China
Prior art keywords
document
identified
calculating
entropy
sensitivity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010004721.5A
Other languages
Chinese (zh)
Other versions
CN111209735B (en
Inventor
蒋仕宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GCI Science and Technology Co Ltd
Original Assignee
GCI Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GCI Science and Technology Co Ltd filed Critical GCI Science and Technology Co Ltd
Priority to CN202010004721.5A priority Critical patent/CN111209735B/en
Publication of CN111209735A publication Critical patent/CN111209735A/en
Application granted granted Critical
Publication of CN111209735B publication Critical patent/CN111209735B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for calculating document sensitivity, which comprises the following steps: obtaining value elements of a document to be identified, vectorizing each value element, and carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified; and calculating the similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document, and further calculating the sensitivity of the document to be identified. The embodiment of the invention also discloses a corresponding device for calculating the document sensitivity, and by adopting the method of feature vector similarity, the embodiment of the invention realizes the identification and analysis of the sensitive data by identifying the value elements of the document, realizes the calculation of the document sensitivity, effectively improves the accuracy of the calculation of the document sensitivity, and has simple and convenient calculation method.

Description

Document sensitivity calculation method and device
Technical Field
The invention relates to the technical field of information security, in particular to a method and a device for calculating document sensitivity.
Background
Data confidentiality, integrity and usability are related to multiple aspects such as national security, enterprise core competitiveness and personal privacy, and data security is receiving more and more attention as an important subject in the field of information security. With the wide application of e-mail, instant messaging and removable storage media, the working efficiency of people is improved, meanwhile, a data leakage channel is inevitably expanded, and the worry of users on data storage safety is aggravated. Currently, many scholars at home and abroad have conducted a lot of research on a security storage method of data sensitive information, such as a hierarchical model based on data security requirements provided for data sensitive attributes; and identifying and grading sensitive attributes of the structured data set, and the like.
However, in the process of implementing the invention, the inventor finds that the prior art has at least the following problems: in the sensitive data classification method in the prior art, data sensitive attributes are set in advance, most of the data sensitive attributes are realized by aiming at a structured data set, and the method is not suitable for sensitive identification and classification of various semi-structured or unstructured data of cloud computing.
Disclosure of Invention
The embodiment of the invention aims to provide a method and a device for calculating the document sensitivity, which realize the calculation of the document sensitivity by identifying the value elements of a document, effectively improve the accuracy of the calculation of the document sensitivity and have a simple and convenient calculation method.
In order to achieve the above object, an embodiment of the present invention provides a method for calculating document sensitivity, including:
obtaining value elements of a document to be identified, and vectorizing each value element to obtain a corresponding value element vector; wherein the value element comprises metadata of the document to be identified;
carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified;
calculating a similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document; wherein the similarity entropy is positively correlated with the similarity;
calculating the sensitivity of the document to be identified according to the similarity entropy of the document to be identified; wherein the sensitivity of the document to be identified is in negative correlation with the similarity entropy.
As an improvement of the above scheme, the calculating a similarity entropy of the document to be recognized according to the similarity between the target value element vector and a preset value element vector of a preset document specifically includes:
calculating the similarity between the target value element vector of the document to be identified and a preset value element vector of each preset document in a preset document set, and taking the sum of the similarities between all the target value element vectors and the preset value element vector of each preset document as the similarity of the document to be identified;
calculating the similarity of each document in a platform data set and a preset value element vector of each preset document, and taking the sum of the similarities of all documents in the platform data set and the preset value element vector of each preset document as the similarity of the platform data set; wherein the platform data set comprises the document to be identified;
according to the similarity of the document to be recognized and the similarity of the platform data set, calculating the similarity entropy of the document to be recognized through the following calculation formula:
Figure BDA0002354799470000021
wherein, H (d)r) The similarity entropy of the document to be identified is obtained;
Figure BDA0002354799470000022
for the similarity of the documents to be identified,
Figure BDA0002354799470000023
similarity of the platform data set; m is the number of the preset documents, and k is the number of the documents of the platform data set.
As an improvement of the above solution, the method for calculating document sensitivity further includes:
calculating the use entropy of the document to be identified according to the user access times of the document to be identified; wherein the usage entropy is positively correlated with the user access times of the document to be identified;
calculating the quality entropy of the document to be identified according to the source credibility of the document to be identified; wherein the quality entropy is positively correlated with the source credibility of the document to be identified;
then, the calculating the sensitivity of the document to be recognized according to the similarity entropy of the document to be recognized specifically includes:
calculating the product of the similarity entropy, the use entropy and the quality entropy of the document to be identified as the combined entropy of the document to be identified;
calculating the sensitivity of the document to be identified according to the combined entropy of the document to be identified; wherein the sensitivity of the document to be identified is inversely related to the combined entropy.
As an improvement of the above scheme, the calculating the use entropy of the document to be recognized according to the user access times of the document to be recognized specifically includes:
acquiring the times of each user accessing the document to be identified, and calculating the sum of the times of all the users accessing the document to be identified as the user access times of the document to be identified;
acquiring the times of each user respectively accessing each document in a platform data set, and calculating the sum of the times of all users respectively accessing each document in the platform data set as the user access times of the platform data set; wherein the platform data set comprises the document to be identified;
according to the user access times of the document to be recognized and the user access times of the platform data set, calculating the use entropy of the document to be recognized through the following calculation formula:
Figure BDA0002354799470000031
wherein, H (c)r) Usage entropy for the document to be recognized;
Figure BDA0002354799470000032
For the number of user accesses of the document to be identified,
Figure BDA0002354799470000033
the number of user accesses to the platform data set; n is the number of the users, and k is the number of the documents of the platform data set.
As an improvement of the above scheme, the calculating the quality entropy of the document to be recognized according to the source reliability of the document to be recognized specifically includes:
acquiring source credibility of the document to be identified under each credibility factor, and taking the sum of the source credibility of the document to be identified under all credibility factors as the source credibility of the document to be identified; wherein the credibility factors comprise uploading level, document source, document creation or modification date;
acquiring the source credibility of each document in a platform data set under each credibility factor, and taking the sum of the source credibility of all documents in the platform data set under each credibility factor as the source credibility of the platform data set; wherein the platform data set comprises the document to be identified;
according to the source credibility of the document to be identified and the source credibility of the platform data set, calculating the quality entropy of the document to be identified through the following calculation formula:
Figure BDA0002354799470000041
wherein, H (N)r) The quality entropy of the document to be identified is obtained;
Figure BDA0002354799470000042
for the source confidence of the document to be identified,
Figure BDA0002354799470000043
source confidence for the platform dataset; l is the number of the credibility factors, and k is the number of documents of the platform data set.
As an improvement of the above scheme, the calculating the sensitivity of the document to be recognized according to the combined entropy of the document to be recognized specifically includes:
according to the combined entropy of the documents to be recognized, calculating the sensitivity of the documents to be recognized through the following calculation formula:
Figure BDA0002354799470000044
wherein C (r) is the sensitivity of the document to be identified; h (r) is the combined entropy of the documents to be identified;
Figure BDA0002354799470000051
k is the sum of the combined entropy for each document in the platform dataset and k is the number of documents in the platform dataset.
As an improvement of the above scheme, the vector splicing is performed on the value element vectors corresponding to the value elements that meet the preset feature contribution threshold to obtain the target value element vector of the document to be identified, and the method specifically includes:
calculating the characteristic contribution degree of each value element of the document to be identified;
obtaining value elements corresponding to the maximum N characteristic contribution degrees as the value elements meeting the preset contribution degree threshold;
carrying out vector splicing on the value element vectors corresponding to the value elements which accord with the preset feature contribution degree threshold value to obtain target value element vectors of the document to be identified;
wherein, the calculating the feature contribution degree of each value element of the document to be identified specifically comprises: calculating the characteristic contribution degree of each value element by the following calculation formula:
Figure BDA0002354799470000052
FCD(t,ci) A characteristic contribution degree of a value element t of the document to be identified, df (t, c)i) As value element t at ciNumber of documents present in class document, df (t, c)j) As value element t being non-ciThe number of documents appearing in the class document, M is the number of document categories of the platform data set, ciClass documents are sensitive document categories.
As an improvement of the above scheme, the method for calculating the document sensitivity further includes:
determining the sensitivity level of the document to be identified according to the sensitivity of the document to be identified; wherein the sensitivity level is inversely related to the sensitivity of the document to be identified.
The embodiment of the invention also provides a device for calculating the document sensitivity, which comprises: the device comprises a value element acquisition module, a target value element vector acquisition module, a similarity entropy calculation module and a sensitivity calculation module; wherein the content of the first and second substances,
the value element acquisition module is used for acquiring the value elements of the document to be identified and vectorizing each value element to obtain a corresponding value element vector; wherein the value element comprises metadata of the document to be identified;
the target value element vector acquisition module is used for carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified;
the similarity entropy calculation module is used for calculating the similarity entropy of the document to be identified according to the similarity between the target value element vector and a preset value element vector of a preset document; wherein the similarity entropy is positively correlated with the similarity;
the sensitivity calculation module is used for calculating the sensitivity of the document to be identified according to the similarity entropy of the document to be identified; wherein the sensitivity of the document to be identified is in negative correlation with the similarity entropy.
The embodiment of the present invention further provides a document sensitivity calculation apparatus, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the processor implements the document sensitivity calculation method according to any one of the preceding claims.
Compared with the prior art, the document sensitivity calculation method and the document sensitivity calculation device disclosed by the invention have the advantages that the value elements of the document to be identified are obtained, and each value element is vectorized; carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified; and calculating the similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document. And then, calculating the use entropy and the quality entropy of the document to be identified according to the user access times and the source credibility of the document to be identified so as to calculate the combined entropy of the document to be identified, and further calculating the sensitivity of the document to be identified according to the combined entropy of the document to be identified. Compared with the prior art, the method takes the unstructured data set of the value elements of the document as an object, realizes the identification and analysis of sensitive data by adopting a feature vector similarity method, and can realize the sensitivity calculation of the unstructured data document without acquiring the content attribute, the sensitive dictionary and other information of the document in advance by combining the use frequency and the source credibility of the document, thereby effectively improving the accuracy of the sensitivity calculation of the document, and the calculation method is simple and convenient.
Drawings
FIG. 1 is a flowchart illustrating steps of a method for calculating document sensitivity according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating steps of calculating feature contribution degrees in a method for calculating document sensitivity according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating steps of a method for calculating document sensitivity according to a second embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a document sensitivity calculation apparatus according to a third embodiment of the present invention;
FIG. 5 is a schematic structural diagram of another document sensitivity calculation apparatus according to the fourth embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flowchart illustrating steps of a method for calculating document sensitivity according to an embodiment of the present invention. The method for calculating the document sensitivity according to the first embodiment of the present invention is executed through steps S11 to S14:
s11, obtaining the value elements of the document to be identified, and vectorizing each value element to obtain a corresponding value element vector; wherein the value element comprises metadata of the document to be identified.
Specifically, the document to be identified is a document in the platform data set, which needs to be subjected to sensitivity identification. And extracting accurate and useful information in the document to be identified through a big data semantic identification technology to serve as a value element of the document to be identified so as to identify the data quality and the value information of the document to be identified. The value element is an unstructured dataset comprising metadata of the document to be identified. Metadata (Meta Data) is key Data generated during the construction of a Data warehouse and related to Data source definition, target definition, conversion rule, and the like, and also contains business information and the like related to the meaning of the Data, specifically including information such as Data source, information producer, title, keyword, abstract, creation time, language type used, format, browsing times, and the like of a document.
After the value elements of the document to be recognized are extracted through a big data semantic recognition technology, a word vectorization method, such as word2vec, is adopted to carry out vectorization on each value element. The dimensions of each value element vector are of equal length, i.e. each value element vector is 128 dimensions. The symbolic information of the natural language is converted into the digital information in the form of vectors, so that the distance information between the symbolic information and the value element vector of a specific document can be calculated conveniently in the subsequent sensitivity calculation process, and the similarity of the value element vector can be calculated.
And S12, carrying out vector splicing on the value element vectors corresponding to the value elements meeting the preset feature contribution degree threshold value to obtain the target value element vector of the document to be identified.
Fig. 2 is a schematic flow chart illustrating steps of calculating feature contribution degrees in the document sensitivity calculation method according to an embodiment of the present invention. Step S12 is executed by steps S121 to S123:
s121, calculating the feature contribution degree of each value element of the document to be recognized.
Feature Contribution Degree (FCD) is a method of feature selection for identifying the degree of contribution of a value element to the discriminative power between different classes.
Specifically, the feature contribution degree of each value element is calculated by the following calculation formula:
Figure BDA0002354799470000081
wherein FCD (t, c)i) A characteristic contribution degree of a value element t of the document to be identified, df (t, c)i) As value element t at ciNumber of documents present in class document, df (t, c)j) As value element t being non-ciThe number of documents present in a class document, M is the number of document classes of the platform dataset.
In one embodiment, the documents in the platform data set are divided into two document categories, i.e., M-2, of sensitive documents and non-sensitive documents according to a preset algorithm or a division rule. Calculating the characteristic contribution degree of each value element for measuring the sensitivity text of each value elementThe ability to distinguish between a class of documents and a class of non-sensitive documents. When c is going toiWhen the class document is a sensitive document category, the larger the feature contribution FCD is, the larger the differentiation contribution degree of the value element t to the sensitive document is, that is, the greater the guiding significance of the differentiation of the sensitive document is. When c is going toiWhen the class document is a non-sensitive document category, the larger the feature contribution FCD is, the larger the differentiation contribution degree of the value element t to the non-sensitive document is, that is, the greater the guiding significance of the differentiation of the non-sensitive document is. The value range of FCD is [0,1]]An interval.
S122, obtaining value elements corresponding to the maximum N characteristic contribution degrees as the value elements meeting the preset contribution degree threshold value; wherein N is more than or equal to 1.
In the present embodiment, ciAnd the class document is a sensitive document class, and is used for effectively measuring the guiding significance of distinguishing the value elements from the sensitive document, and after calculating the characteristic contribution degree of each value element of the document to be identified, obtaining the value elements corresponding to the maximum N characteristic contribution degrees as the value elements meeting the preset contribution degree threshold. Preferably, N ═ 8. In the embodiment of the invention, based on the value element vector distribution condition, the key value elements are selected by adopting a characteristic contribution degree method, and the complexity of sensitivity calculation is reduced.
And S123, carrying out vector splicing on the value element vectors corresponding to the value elements which accord with the preset feature contribution degree threshold value to obtain the target value element vector of the document to be identified.
And carrying out vector splicing on the value element vectors corresponding to the value elements which accord with the preset feature contribution degree threshold value to serve as target value element vectors of the document to be identified. When the number N of the obtained value elements meeting the preset feature contribution threshold is 8, the dimension of the target value element vector is 128 × 8, 1024.
S13, calculating a similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document; wherein the similarity entropy is positively correlated with the similarity.
The preset document is representative of a typical sensitive document. Specifically, value element vectors corresponding to N value elements with large feature contribution degrees in each preset document are obtained and vector splicing is carried out to serve as the preset value element vectors of the preset documents, and the dimensionality of the preset value element vectors of the preset documents is also 1024 dimensionalities. And then, carrying out similarity calculation on the target value element vector and a preset value element vector of a preset document, namely calculating the distance between the target value element vector and the preset value element vector of the preset document, wherein the greater the distance is, the greater the similarity is. The similarity calculation method adopts a calculation method in the prior art, such as Cosine similarity, Tanimoto coefficient, etc., and is not limited in detail herein.
As a preferred embodiment, step S13 is executed by steps S131 to S133:
s131, calculating the similarity between the target value element vector of the document to be identified and the preset value element vector of each preset document in a preset document set, and taking the sum of the similarities between all the target value element vectors and the preset value element vectors of each preset document in the preset document set as the similarity of the document to be identified.
The preset document set is a data set of preset sensitive documents, the preset document set comprises m preset sensitive documents, and m is larger than or equal to 1. Each preset document is pre-calculated with a corresponding preset value element vector for measuring the sensitivity distinguishing contribution degree of the corresponding preset document.
Specifically, the similarity between the target value element vector of the document to be identified and the preset value element vector of each preset document in a preset document set is calculated
Figure BDA0002354799470000101
And calculating the sum of the similarity of all the target value element vectors and the preset value element vector of each preset document
Figure BDA0002354799470000102
As the similarity of the documents to be identified.
S132, calculating the similarity between each document in the platform data set and the preset value element vector of each preset document, and taking the sum of the similarities between all the documents in the platform data set and the preset value element vector of each preset document as the similarity of the platform data set; wherein the platform data set comprises the document to be identified;
the platform data set is a collection of each document, including the document to be identified. The platform data set comprises k documents, and k is larger than or equal to 1. Calculating the similarity of each document in the platform data set and the preset value element vector of each preset document respectively
Figure BDA0002354799470000103
And calculating the total similarity of all the documents and the preset value element vector of each preset document
Figure BDA0002354799470000104
As the similarity of the platform data sets.
S133, calculating a similarity entropy of the document to be identified according to a ratio of the similarity of the document to be identified to the similarity of the platform data set, wherein the similarity entropy specifically satisfies a calculation formula:
Figure BDA0002354799470000111
wherein, H (d)r) The similarity entropy of the document to be identified is obtained;
Figure BDA0002354799470000112
for the similarity of the documents to be identified,
Figure BDA0002354799470000113
similarity of the platform data set; m is the number of the preset documents, and k is the number of the documents of the platform data set.
S14, calculating the sensitivity of the document to be recognized according to the similarity entropy of the document to be recognized; wherein the sensitivity of the document to be identified is positively correlated with the similarity entropy.
Calculating the similarity of the value element vector of a preset sensitive document set and obtaining the similarity entropy of the document to be identified by adopting a discrete random variable entropy method, and further evaluating the sensitivity of the document to be identified. And calculating the sensitivity of the document to be identified by presetting a function relation which enables the similarity entropy to be in negative correlation with the document sensitivity and substituting the similarity entropy of the document to be identified into the function relation. Specifically, the higher the similarity entropy is, the smaller the sensitivity of the document to be recognized is, and the more sensitive the document to be recognized is.
The embodiment of the invention provides a method for calculating document sensitivity, which comprises the steps of obtaining value elements of a document to be identified and vectorizing each value element; carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified; and calculating the similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document, and further calculating the sensitivity of the document to be identified according to the similarity entropy of the document to be identified. In the embodiment of the invention, the unstructured data set of the value element of the document is taken as an object, the identification and analysis of sensitive data are realized by adopting a feature vector similarity method, the sensitivity calculation of the unstructured data document can be realized without acquiring the content attribute, the sensitive dictionary and other information of the document in advance, the accuracy of the sensitivity calculation of the document is effectively improved, and the calculation method is simple and convenient.
Fig. 3 is a schematic flow chart illustrating steps of a document sensitivity calculation method according to a second embodiment of the present invention. The method for calculating the document sensitivity provided by the second embodiment of the present invention is implemented on the basis of the first embodiment, and specifically includes steps S21 to S27:
s21, obtaining the value elements of the document to be identified, and vectorizing each value element to obtain a corresponding value element vector; wherein the value element comprises metadata of the document to be identified.
And S22, carrying out vector splicing on the value element vectors corresponding to the value elements meeting the preset feature contribution degree threshold value to obtain the target value element vector of the document to be identified.
S23, calculating a similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document; wherein the similarity entropy is positively correlated with the similarity.
In the embodiment of the present invention, the steps S21 to S23 are the same as the steps S11 to S13 in the first embodiment, and are not repeated herein.
S24, calculating the use entropy of the document to be identified according to the user access times of the document to be identified; wherein the usage entropy is positively correlated with the user access times of the document to be identified.
As a preferred embodiment, step S24 is executed by steps S241 to S243:
s241, obtaining the times of each user accessing the document to be identified, and calculating the sum of the times of all the users accessing the document to be identified as the user access times of the document to be identified.
S242, obtaining the times of each user respectively accessing each document in the platform data set, and calculating the sum of the times of all users respectively accessing each document in the platform data set as the user access times of the platform data set; wherein the platform data set includes the document to be identified.
The number of users accessing the platform data set is assumed to be n, wherein n is more than or equal to 1; the platform data set comprises k documents, and k is larger than or equal to 1. By counting the times of each user accessing the document to be identified
Figure BDA0002354799470000121
And calculating the sum of the times of accessing the document to be identified by n users
Figure BDA0002354799470000122
As the user access times of the document to be identified. By counting the times that each user respectively accesses each document in the platform data set
Figure BDA0002354799470000123
And calculating the sum of the times of respectively accessing each document in the platform data set by n users
Figure BDA0002354799470000124
As the number of user accesses to the platform data set.
S243, calculating the use entropy of the document to be identified according to the ratio of the user access times of the document to be identified to the user access times of the platform data set; the formula is specifically satisfied:
Figure BDA0002354799470000131
wherein, H (c)r) The usage entropy of the document to be identified is obtained;
Figure BDA0002354799470000132
for the number of user accesses of the document to be identified,
Figure BDA0002354799470000133
the number of user accesses to the platform data set; n is the number of the users, and k is the number of the documents of the platform data set.
In the embodiment of the invention, the more frequently the document data is used, that is, the more times the document is accessed by the user, the higher the possibility of being abused is, and the more sensitive the document is. The use entropy of the document to be identified is obtained by adopting a discrete random variable entropy method, and then the sensitivity of the document to be identified is evaluated, so that the sensitivity calculation of the document to be identified is more accurate and reliable.
S25, calculating the quality entropy of the document to be identified according to the source credibility of the document to be identified; wherein the quality entropy is positively correlated with the source credibility of the document to be identified.
As a preferred embodiment, step S25 is executed by steps S251 to S253:
s251, obtaining the source credibility of the document to be identified under each credibility factor, and taking the sum of the source credibility of the document to be identified under all credibility factors as the source credibility of the document to be identified; wherein the credibility factors comprise uploading level, document source, document creation or modification date.
S252, obtaining the source credibility of each document in the platform data set under each credibility factor, and taking the sum of the source credibility of all documents in the platform data set under each credibility factor as the source credibility of the platform data set; wherein the platform data set includes the document to be identified.
Specifically, the source credibility of the document is related to credibility factors such as uploading level, document source and document creation or modification date. Setting mapping relations between three credibility factors of uploader level, document source place and document creation or modification date and source credibility respectively, and further acquiring source credibility corresponding to the document to be identified under each credibility factor
Figure BDA0002354799470000141
And calculating the source credibility sum of the documents to be identified under the three credibility factors
Figure BDA0002354799470000142
The source credibility of the document to be identified is obtained; obtaining the source credibility of each document in the platform data set under each credibility factor
Figure BDA0002354799470000143
And summing the source credibility of all documents in the platform data set under each credibility factor respectively
Figure BDA0002354799470000144
As a source confidence for the platform data set.
S253, calculating the quality entropy of the document to be identified according to the ratio of the source credibility of the document to be identified to the source credibility of the platform data set; the method specifically satisfies the calculation formula:
Figure BDA0002354799470000145
wherein, H (N)r) The quality entropy of the document to be identified is obtained;
Figure BDA0002354799470000146
for the source confidence of the document to be identified,
Figure BDA0002354799470000147
source confidence for the platform dataset; l is the number of the credibility factors, and k is the number of documents of the platform data set. In the present embodiment, l ═ 3.
In the embodiment of the invention, the higher the quality of the document data, namely the more reliable the document source is, the more sensitive the document is. The method comprises the steps of obtaining the quality entropy of a document to be identified by adopting a discrete random variable entropy method, and further evaluating the sensitivity of the document to be identified, so that the sensitivity calculation of the document to be identified is more accurate and reliable.
S26, calculating the product of the similarity entropy, the use entropy and the quality entropy of the to-be-identified document as the combined entropy of the to-be-identified document.
The combined entropy H (r) of the document to be recognized specifically satisfies the following calculation formula:
H(r)=H(cr)·H(dr)·H(Nr);
wherein, H (d)r) Entropy of similarity for said document to be recognized, H (c)r) Entropy of use for said document to be identified, H (N)r) And the quality entropy of the document to be identified is obtained.
S27, calculating the sensitivity of the document to be recognized according to the combined entropy of the document to be recognized; wherein the sensitivity of the document to be identified is inversely related to the combined entropy.
Specifically, as with the method of calculating the combined entropy of the documents to be identified, the combined entropy H (r) of each document in the platform data set is calculated by the above-described methodi) And calculating to obtain the sum of the combined entropy of each document in the platform data set
Figure BDA0002354799470000151
According to the combined entropy H (r) of the document to be recognized and the combined entropy of the platform data set
Figure BDA0002354799470000152
Calculating the sensitivity C (r) of the document to be identified by the following calculation formula:
Figure BDA0002354799470000153
the value range of the sensitivity C (r) of the document to be recognized is [0,1], the closer the sensitivity value is to 0, the more sensitive the document to be recognized is, and the closer the sensitivity value is to 1, the less sensitive the document to be recognized is.
In the embodiment of the invention, the sensitivity of the document to be recognized in the platform data set is evaluated by combining the similarity entropy, the use entropy and the quality entropy of the document, so that the sensitivity recognition and analysis of the document to be recognized are realized.
Further, the method for calculating the document sensitivity further comprises the following steps: determining the sensitivity level of the document to be identified according to the sensitivity of the document to be identified; wherein the sensitivity level is inversely related to the sensitivity of the document to be identified.
Determining the sensitivity level of a document to be identified according to a preset mapping relation between the sensitivity level and the sensitivity level of the document, wherein the sensitivity level comprises the following steps: a non-sensitive rating, a light sensitive rating, a medium sensitive rating, and a high sensitive rating. Specifically, the mapping relationship between the sensitivity and the sensitivity level is specifically as shown in table 1:
TABLE 1 mapping of sensitivity to sensitivity rating
Figure BDA0002354799470000154
Figure BDA0002354799470000161
The embodiment two of the invention provides a method for calculating the document sensitivity, which comprises the steps of obtaining the value elements of a document to be identified and vectorizing each value element; carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified; and calculating the similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document. And then, calculating the use entropy and the quality entropy of the document to be identified according to the user access times and the source credibility of the document to be identified so as to calculate the combined entropy of the document to be identified, and further calculating the sensitivity of the document to be identified according to the combined entropy of the document to be identified. In the embodiment of the invention, the unstructured data set of the value element of the document is taken as an object, the identification and analysis of sensitive data are realized by adopting a feature vector similarity method, the use frequency and the source credibility of the document are combined, the sensitivity calculation of the unstructured data document can be realized without acquiring the information of the content attribute, the sensitive dictionary and the like of the document in advance, the sensitivity grade of the document is determined according to the sensitivity of the document, the accuracy of the sensitivity calculation of the document is effectively improved, and the calculation method is simple and convenient.
Fig. 4 is a schematic structural diagram of a document sensitivity calculation apparatus according to a third embodiment of the present invention. In the embodiment of the present invention, the document sensitivity calculating device 30 includes: a value element acquisition module 31, a target value element vector acquisition module 32, a similarity entropy calculation module 33 and a sensitivity calculation module 34; wherein the content of the first and second substances,
the value element obtaining module 31 is configured to obtain value elements of a document to be identified, and vectorize each value element to obtain a corresponding value element vector; wherein the value element comprises metadata of the document to be identified;
the target value element vector obtaining module 32 is configured to perform vector splicing on the value element vectors corresponding to the value elements that meet a preset feature contribution threshold value, so as to obtain a target value element vector of the document to be identified;
the similarity entropy calculation module 33 is configured to calculate a similarity entropy of the document to be recognized according to a similarity between the target value element vector and a preset value element vector of a preset document; wherein the similarity entropy is positively correlated with the similarity;
the sensitivity calculation module 34 is configured to calculate the sensitivity of the document to be identified according to the similarity entropy of the document to be identified; wherein the sensitivity of the document to be identified is in negative correlation with the similarity entropy.
It should be noted that, the document sensitivity calculation apparatus provided in the embodiment of the present invention is configured to execute all the process steps of the document sensitivity calculation method in the first embodiment or the second embodiment, and working principles and beneficial effects of the two are in one-to-one correspondence, so that details are not repeated.
The third embodiment of the invention provides a device for calculating the document sensitivity, which is characterized in that value elements of a document to be identified are obtained, and each value element is vectorized; carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified; and calculating the similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document. And then, calculating the use entropy and the quality entropy of the document to be identified according to the user access times and the source credibility of the document to be identified so as to calculate the combined entropy of the document to be identified, and further calculating the sensitivity of the document to be identified according to the combined entropy of the document to be identified. In the embodiment of the invention, the unstructured data set of the value element of the document is taken as an object, the identification and analysis of sensitive data are realized by adopting a feature vector similarity method, and the sensitivity calculation of the unstructured data document can be realized without acquiring information such as content attributes and sensitive dictionaries of the document in advance by combining the use frequency and source credibility of the document, so that the accuracy of the sensitivity calculation of the document is effectively improved, and the calculation method is simple and convenient.
Fig. 5 is a schematic structural diagram of another document sensitivity calculation apparatus according to the fourth embodiment of the present invention. In an embodiment of the present invention, the document sensitivity calculation apparatus 40 includes a processor 41, a memory 42, and a computer program stored in the memory and configured to be executed by the processor, and the processor executes the computer program to implement the document sensitivity calculation method according to any one of the first embodiment and the second embodiment.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A method for calculating document sensitivity is characterized by comprising the following steps:
obtaining value elements of a document to be identified, and vectorizing each value element to obtain a corresponding value element vector; wherein the value element comprises metadata of the document to be identified;
carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified;
calculating a similarity entropy of the document to be identified according to the similarity of the target value element vector and a preset value element vector of a preset document; wherein the similarity entropy is positively correlated with the similarity;
calculating the sensitivity of the document to be identified according to the similarity entropy of the document to be identified; wherein the sensitivity of the document to be identified is in negative correlation with the similarity entropy.
2. The method for calculating document sensitivity according to claim 1, wherein the calculating the similarity entropy of the document to be recognized according to the similarity between the target value element vector and a preset value element vector of a preset document specifically comprises:
calculating the similarity between the target value element vector of the document to be identified and a preset value element vector of each preset document in a preset document set, and taking the sum of the similarities between all the target value element vectors and the preset value element vector of each preset document as the similarity of the document to be identified;
calculating the similarity of each document in a platform data set and a preset value element vector of each preset document, and taking the sum of the similarities of all documents in the platform data set and the preset value element vector of each preset document as the similarity of the platform data set; wherein the platform data set comprises the document to be identified;
according to the similarity of the document to be recognized and the similarity of the platform data set, calculating the similarity entropy of the document to be recognized through the following calculation formula:
Figure FDA0002354799460000021
wherein, H (d)r) The similarity entropy of the document to be identified is obtained;
Figure FDA0002354799460000022
for the similarity of the documents to be identified,
Figure FDA0002354799460000023
similarity of the platform data set; m is the number of the preset documents, and k is the number of the documents of the platform data set.
3. The method of calculating document sensitivity of claim 1, wherein the method of calculating document sensitivity further comprises:
calculating the use entropy of the document to be identified according to the user access times of the document to be identified; wherein the usage entropy is positively correlated with the user access times of the document to be identified;
calculating the quality entropy of the document to be identified according to the source credibility of the document to be identified; wherein the quality entropy is positively correlated with the source credibility of the document to be identified;
then, the calculating the sensitivity of the document to be recognized according to the similarity entropy of the document to be recognized specifically includes:
calculating the product of the similarity entropy, the use entropy and the quality entropy of the document to be identified as the combined entropy of the document to be identified;
calculating the sensitivity of the document to be identified according to the combined entropy of the document to be identified; wherein the sensitivity of the document to be identified is inversely related to the combined entropy.
4. The method for calculating the document sensitivity according to claim 3, wherein the calculating the use entropy of the document to be recognized according to the user access times of the document to be recognized specifically comprises:
acquiring the times of each user accessing the document to be identified, and calculating the sum of the times of all the users accessing the document to be identified as the user access times of the document to be identified;
acquiring the times of each user respectively accessing each document in a platform data set, and calculating the sum of the times of all users respectively accessing each document in the platform data set as the user access times of the platform data set; wherein the platform data set comprises the document to be identified;
according to the user access times of the document to be recognized and the user access times of the platform data set, calculating the use entropy of the document to be recognized through the following calculation formula:
Figure FDA0002354799460000031
wherein, H (c)r) The usage entropy of the document to be identified is obtained;
Figure FDA0002354799460000032
for the number of user accesses of the document to be identified,
Figure FDA0002354799460000033
the number of user accesses to the platform data set; n is the number of the users, and k is the number of the documents of the platform data set.
5. The method for calculating the sensitivity of a document according to claim 3, wherein the calculating the quality entropy of the document to be recognized according to the source reliability of the document to be recognized specifically comprises:
acquiring source credibility of the document to be identified under each credibility factor, and taking the sum of the source credibility of the document to be identified under all credibility factors as the source credibility of the document to be identified; wherein the credibility factors comprise uploading level, document source, document creation or modification date;
acquiring the source credibility of each document in a platform data set under each credibility factor, and taking the sum of the source credibility of all documents in the platform data set under each credibility factor as the source credibility of the platform data set; wherein the platform data set comprises the document to be identified;
according to the source credibility of the document to be identified and the source credibility of the platform data set, calculating the quality entropy of the document to be identified through the following calculation formula:
Figure FDA0002354799460000041
wherein, H (N)r) The quality entropy of the document to be identified is obtained;
Figure FDA0002354799460000042
for the source confidence of the document to be identified,
Figure FDA0002354799460000043
source confidence for the platform dataset; l is the number of the credibility factors, and k is the number of documents of the platform data set.
6. The method for calculating the sensitivity of a document according to claim 3, wherein the calculating the sensitivity of the document to be recognized according to the combined entropy of the document to be recognized specifically comprises:
according to the combined entropy of the documents to be recognized, calculating the sensitivity of the documents to be recognized through the following calculation formula:
Figure FDA0002354799460000044
wherein C (r) is the sensitivity of the document to be identified; h (r) is the combined entropy of the documents to be identified;
Figure FDA0002354799460000045
k is the sum of the combined entropy for each document in the platform dataset and k is the number of documents in the platform dataset.
7. The method for calculating document sensitivity according to any one of claims 1 to 6, wherein the vector splicing is performed on the value element vectors corresponding to the value elements that meet a preset feature contribution threshold value to obtain the target value element vector of the document to be identified, and specifically includes:
calculating the characteristic contribution degree of each value element of the document to be identified;
obtaining value elements corresponding to the maximum N characteristic contribution degrees as the value elements meeting the preset contribution degree threshold;
carrying out vector splicing on the value element vectors corresponding to the value elements which accord with the preset feature contribution degree threshold value to obtain target value element vectors of the document to be identified;
wherein, the calculating the feature contribution degree of each value element of the document to be identified specifically comprises: calculating the characteristic contribution degree of each value element by the following calculation formula:
Figure FDA0002354799460000051
FCD(t,ci) A characteristic contribution degree of a value element t of the document to be identified, df (t, c)i) As value element t at ciNumber of documents present in class document, df (t, c)j) As value element t being non-ciThe number of documents appearing in the class document, M is the number of document categories of the platform data set, ciClass documents are sensitive document categories.
8. The method of calculating document sensitivity of claim 1, further comprising:
determining the sensitivity level of the document to be identified according to the sensitivity of the document to be identified; wherein the sensitivity level is inversely related to the sensitivity of the document to be identified.
9. A document sensitivity calculation apparatus, comprising: the device comprises a value element acquisition module, a target value element vector acquisition module, a similarity entropy calculation module and a sensitivity calculation module; wherein the content of the first and second substances,
the value element acquisition module is used for acquiring the value elements of the document to be identified and vectorizing each value element to obtain a corresponding value element vector; wherein the value element comprises metadata of the document to be identified;
the target value element vector acquisition module is used for carrying out vector splicing on value element vectors corresponding to the value elements which accord with a preset feature contribution degree threshold value to obtain a target value element vector of the document to be identified;
the similarity entropy calculation module is used for calculating the similarity entropy of the document to be identified according to the similarity between the target value element vector and a preset value element vector of a preset document; wherein the similarity entropy is positively correlated with the similarity;
the sensitivity calculation module is used for calculating the sensitivity of the document to be identified according to the similarity entropy of the document to be identified; wherein the sensitivity of the document to be identified is in negative correlation with the similarity entropy.
10. A document sensitivity calculation apparatus, comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein the processor implements the document sensitivity calculation method according to any one of claims 1 to 8 when executing the computer program.
CN202010004721.5A 2020-01-03 2020-01-03 Document sensitivity calculation method and device Active CN111209735B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010004721.5A CN111209735B (en) 2020-01-03 2020-01-03 Document sensitivity calculation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010004721.5A CN111209735B (en) 2020-01-03 2020-01-03 Document sensitivity calculation method and device

Publications (2)

Publication Number Publication Date
CN111209735A true CN111209735A (en) 2020-05-29
CN111209735B CN111209735B (en) 2023-06-02

Family

ID=70789555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010004721.5A Active CN111209735B (en) 2020-01-03 2020-01-03 Document sensitivity calculation method and device

Country Status (1)

Country Link
CN (1) CN111209735B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157603A1 (en) * 2007-12-14 2009-06-18 Petter Moe Method for improving security in distribution of electronic documents
CN104391835A (en) * 2014-09-30 2015-03-04 中南大学 Method and device for selecting feature words in texts
US20170004128A1 (en) * 2015-07-01 2017-01-05 Institute for Sustainable Development Device and method for analyzing reputation for objects by data mining
CN106649262A (en) * 2016-10-31 2017-05-10 复旦大学 Protection method for enterprise hardware facility sensitive information in social media
CN106845265A (en) * 2016-12-01 2017-06-13 北京计算机技术及应用研究所 A kind of document security level automatic identifying method
CN109446288A (en) * 2018-10-18 2019-03-08 重庆邮电大学 One kind being based on the internet Spark concerning security matters map detection algorithm

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157603A1 (en) * 2007-12-14 2009-06-18 Petter Moe Method for improving security in distribution of electronic documents
CN104391835A (en) * 2014-09-30 2015-03-04 中南大学 Method and device for selecting feature words in texts
US20170004128A1 (en) * 2015-07-01 2017-01-05 Institute for Sustainable Development Device and method for analyzing reputation for objects by data mining
CN106649262A (en) * 2016-10-31 2017-05-10 复旦大学 Protection method for enterprise hardware facility sensitive information in social media
CN106845265A (en) * 2016-12-01 2017-06-13 北京计算机技术及应用研究所 A kind of document security level automatic identifying method
CN109446288A (en) * 2018-10-18 2019-03-08 重庆邮电大学 One kind being based on the internet Spark concerning security matters map detection algorithm

Also Published As

Publication number Publication date
CN111209735B (en) 2023-06-02

Similar Documents

Publication Publication Date Title
US5778363A (en) Method for measuring thresholded relevance of a document to a specified topic
US20030004942A1 (en) Method and apparatus of metadata generation
Liang et al. GLTM: A global and local word embedding-based topic model for short texts
Lan Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF‐IDF Method
Pang et al. A text similarity measurement based on semantic fingerprint of characteristic phrases
Zhang et al. A multi-level author name disambiguation algorithm
CN106649262B (en) Method for protecting sensitive information of enterprise hardware facilities in social media
CN112579783B (en) Short text clustering method based on Laplace atlas
CN112417082B (en) Scientific research achievement data disambiguation filing storage method
CN115687960B (en) Text clustering method for open source security information
Zhang et al. Text information classification method based on secondly fuzzy clustering algorithm
CN111209735A (en) Document sensitivity calculation method and device
CN113486191B (en) Secret-related electronic file fixed decryption method
Zhang et al. A new term significance weighting approach
Alboni et al. The search for topics related to electric mobility: a comparative analysis of some of the most widely used methods in the literature
Yanagisawa et al. Automatic classification of manga characters using density-based clustering
Zeng et al. Statutes recommendation based on text similarity
Zhang et al. Recommending multiple positive citations for manuscript via content-dependent modeling and multi-positive triplet
CN110598192A (en) Text feature reduction method based on neighborhood rough set
Liu et al. Text classification method with combination of fuzzy relation and feature distribution variance
Zhao et al. A text classification method of power grid assets based on improved FastText
CN117391071B (en) News topic data mining method, device and storage medium
Guo Research on Image Retrieval with Multi-features
Luo et al. Analysis of association degree algorithm based on complex network public opinion
Zhao et al. Classification and pruning strategy of knowledge data decision tree based on rough set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant