CN109800600B

CN109800600B - Ocean big data sensitivity evaluation system and prevention method facing privacy requirements

Info

Publication number: CN109800600B
Application number: CN201910060928.1A
Authority: CN
Inventors: 王晓东; 罗祥裕; 解玮玮; 魏志强; 王雪
Original assignee: Ocean University of China
Current assignee: Shandong Ocean Instrument Technology Center Co ltd
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2020-11-24
Anticipated expiration: 2039-01-23
Also published as: CN109800600A

Abstract

The invention discloses a security requirement-oriented marine big data sensitivity evaluation system and a prevention method, which comprise the following steps: step one, establishing a sensitive feature library; step two, matching and analyzing a data set to be processed; step three: constructing a sensitivity evaluation model; step four: calculating attribute sensitivity; step five: protecting sensitive data; step six: and dynamically optimizing the sensitive feature library. The method is used for realizing the identification and protection of the sea-related data from various marine service systems and with different data sensitivities and value intensity.

Description

Ocean big data sensitivity evaluation system and prevention method facing privacy requirements

Technical Field

The invention belongs to the technical field of data identification processing, and particularly relates to a security requirement-oriented marine big data sensitivity evaluation system and a prevention method.

Background

Data is an important asset for organizations, businesses, and individuals, and data leakage causes significant losses to data owners. In recent years, data leakage is the source of the black industry chain of information selling. The safety protection of the sensitive data in the big data environment has extremely high theoretical research value and engineering practice significance, and how to protect the sensitive data becomes the key point of big data safety attention.

Data is mostly protected in the prior art by: access control, sensitive data privacy techniques based on data distortion (preserving certain statistical characteristics), data encryption, restricting distribution (not distributing certain attributes in the data). The protection technology for sensitive data is mostly directly processed on the sensitive data and does not consider how to discover the sensitive data. The existing sensitive data evaluation method mainly depends on the prior definition of experts, and cannot verify the accuracy of the sensitive data evaluation method and meet the sensitivity evaluation requirement of dynamic data. The advanced information security standards do not design a method for fully calculating data sensitivity.

Secondly, current research focuses on general big data, and special research on ocean data is still in the beginning. Aiming at the characteristics of multiple elements, multiple types, multiple modes and spatio-temporal characteristics of ocean data, how to discover and identify ocean sensitive data still needs deep and systematic scientific research. According to the regulations of the marine work Chinese secret and the secret level specific range thereof, the marine secret mainly comprises relevant marine survey results, schemes, important guidelines, policies, sensitive area original data and the like. The internal schemes, countermeasures and materials are mostly text data, and the original survey data is numerical data. The text data is valuable secret information generated by processing and processing data through human mental activities on the basis of original data.

Therefore, the key for preventing the leakage of the marine secret information is to identify and discover sensitive data and take targeted protection measures according to the characteristics of the identified and discovered sensitive data.

In the prior art, sensitive data identification mostly adopts a mode of combining a dictionary matching method and a manual identification method, and the main process is as follows: defining a sensitive data pattern matching formula by a user, determining a dictionary matching range according to a predefined model, then performing matching scanning on a target by using dictionary matching, filtering a matching result manually after the scanning is completed, and optimizing the pattern data pattern matching formula.

The sensitive data dictionary matching method has the following defects: 1. the recognition accuracy is low, and the dictionary matching adopts a mode of patterned matching, so that the establishment of a data dictionary determines the recognition accuracy of sensitive data, and when the dictionary is incomplete or the dictionary is wrongly established, the problem of reduced recognition accuracy can occur; 2. the classification result is interfered, because dictionary matching is adopted, the same data information can be matched with a plurality of data dictionaries, and because the traditional data dictionaries are not subjected to weighting calculation, the interference of the classification result can be caused, and the inaccuracy of the classification result is caused.

The sensitive data manual identification method has the following defects: 1. the identification speed is low, and because a manual processing mode is adopted, the manual carding speed has a longer period relative to the machine identification speed when a large amount of data is faced, and the requirement on the professional quality of processing personnel is higher; 2. the judgment standards are not uniform, and the sensitive data identification process mainly depends on subjective judgment of people, so different people may have different judgment standards on the same data, and even the results identified by the same person at different times are still different, which may cause the difference of the sensitive data identification results.

Patent numbers: CN104933443A discloses a method for automatically recognizing and classifying sensitive data, which is an improvement on the conventional dictionary matching method and manual recognition method, and recognizes sensitive data when the data dictionary and matching rule are incomplete, but still has many defects.

In the first step, a training data set is preprocessed through a word segmentation technology, a word set is obtained from the training data set, word weights are obtained through a TF-IDF algorithm, and therefore a basic corpus is established. The self-built corpus may have a problem that the corpus is not comprehensive enough due to the insufficient amount of training data. In the second step, the basic corpus still needs to be identified and classified manually, so that the subjectivity is high, different people may have different judgment criteria on the same data, and even the results identified by the same person at different times are still different, which may lead to the difference of the identification results of the sensitive data. In the third step, the category of the target data is obtained through the weighted sorting of the sensitive words, but the sensitivity of the data is not quantified.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a security-requirement-oriented marine big data sensitivity evaluation system and a prevention method, which are used for realizing the identification and protection of marine data from various marine business systems with different data sensitivities and value intensity.

In order to solve the technical problems, the invention adopts the technical scheme that:

a marine big data sensitivity evaluation system facing privacy requirements comprises:

the data feature extraction module is used for extracting data features and establishing a sensitive feature library; the system is used for extracting data attribute characteristics from the marine big data target processing data;

the sensitive characteristic matching module is used for matching the target processing data characteristics in the ocean big data with the sensitive characteristic library and carrying out similarity analysis on the target processing data set and the original sensitive data set;

the sensitive characteristic quantification module is used for establishing a sensitivity evaluation model by taking the extracted sensitive characteristics as parameters and quantifying the characteristic weight by a method of combining a machine and a worker;

the data relevance analysis module is used for further analyzing and finding the sensitive data, obtaining a relevant data set of the data set identified through feature matching through relevance analysis, and forming a sensitive database by the matching data set and the relevant data set;

the data sensitivity measurement module is used for determining the sensitivity of the sensitive data from the data set found by the matching analysis through analysis calculation;

and the sensitive data protection module is used for protecting sensitive data in a targeted manner according to the sensitivity of the sensitive attribute obtained by analysis and quantification.

Further, the data feature extraction module is used for extracting data samples and metadata information from the structured marine secret original data, extracting keywords, subject words and associated words from the unstructured marine secret information document to establish a corpus, and performing feature extraction to form a sensitive feature library; and the method is used for extracting the data attribute characteristics of the marine big data target processing data and preparing for finding the sensitive data by matching with the sensitive characteristic library in the next step.

The precaution method by using the marine big data sensitivity evaluation system facing the confidentiality requirement comprises the following steps of:

step one, establishing a sensitive feature library;

step two, matching and analyzing a data set to be processed;

step three, constructing a sensitivity evaluation model;

step four, calculating attribute sensitivity;

protecting sensitive data;

and step six, dynamically optimizing the sensitive feature library.

Further, in the first step, a mode of combining a machine and a human is adopted to perform feature extraction on the generated secret information set, document metadata and original sensitive data attributes of the secret information text, keywords, subject words, associated words and original sensitive data set metadata of the text content are extracted, candidate word weights are determined by using a feature weight method based on word frequency and confidentiality to perform feature selection, the feature words are classified and belong to different attribute categories, and a sensitive feature library is formed.

The steps for extracting the sensitive features are as follows:

(1) classifying the sample data set;

(2) performing numerical semantic analysis;

(3) extracting keywords: performing word segmentation on each type of data set by using a word segmentation tool, counting word frequency, recording the type of the data set to which the word frequency belongs, determining candidate word weight by using a feature weight method based on the word frequency and the classification to construct a word weight function, and endowing different weights to candidate words from different classification types; the word frequency statistics in each type of samples uses a maximum value taking algorithm based on a TF-IDF algorithm to determine the weight of a candidate word; sorting the weights, taking the first n words as final feature words, performing semantic analysis on the feature words by combining with the knowledge of the ocean field, and expanding a sensitive feature word library to form a final sensitive feature library;

(4) classifying sensitive features: and analyzing the feature words in the sensitive feature library, and classifying the feature words into different attribute categories including but not limited to sensitive time, sensitive area and sensitive observation elements.

Further, in the second step, firstly, a feature extraction method based on word frequency is used for extracting the features of the data to be processed from the ocean big data target processing data set, a discretization feature vector method is used for converting the data set into a binary vector, similarity matching is carried out on the data set to be processed and the sample data set, and therefore the data set with sensitive features is found, and a matched data set is obtained; and performing relevance analysis from the matching data set to find a relevance data set of the matching data set, wherein the relevance data set and the sensitive data set jointly form a sensitive data set.

The matching analysis comprises the following specific processes:

a. constructing a sensitive attribute group, extracting the classified sensitive attribute categories according to the step one, and setting a sensitive attribute set as follows:

S＝{s₁，s₂，…，s_n}(n≥2)

b. binarization of a data set, namely converting an original data set into a binary data set, wherein all attribute values are only 0 or 1;

c. matching the extracted attribute features with the sensitive feature library, recording the hit as 1, otherwise recording the hit as 0, converting the data set into a binary data set, and accumulating the number of different values of all the attributes to obtain the dimension of the data set; after the original data set is binarized, counting the frequency of occurrence of "1" in each sensitive attribute value, and obtaining the characteristic frequency of the data set:

V_Ⅰ＝{v₁₁,v₁₂,...,v_nm}

further, in the third step, each type of data in the sample set is converted into a feature vector, the center and the threshold of each type of data set are determined, and the distance between the vector position of the data to be processed and the class center is judged, so that the data sensitivity class can be determined; within a certain threshold, the class belongs to the sensitivity class.

Further, in the fourth step, the sensitivity attribute s can be calculated according to the weight of the sensitivity attribute value extracted in the first step and the characteristic frequency of the binary data set established in the second step_iSensitivity of (2):

wherein

Weighted TF-IDF value, v, for the jth attribute value of the ith sensitivity attribute_ijThe characteristic frequency of the j attribute value of the ith sensitive attribute.

Compared with the prior art, the invention has the advantages that:

(1) the first step is to automatically extract the characteristics from the existing secret information to form a sensitive characteristic library, so that the secrecy requirement is met, the limitation of manually defining the sensitive characteristics is avoided, and the efficiency and the accuracy are improved. In addition, the invention is not limited to the vocabulary itself, and also considers various associated forms of the sensitive words, thereby improving the recall rate of matching.

(2) And secondly, extracting the features of the target data set, matching the features, so that the operation is simpler and more convenient, the identification speed is higher, the relevance among the data sets is considered on the basis of matching the found sensitive data, the sensitive data sets are further expanded, and the integrity of the identification result is improved.

(3) And thirdly, a sensitivity evaluation method is constructed, the sensitive data found by identification are quantized, and the category of the sensitive data is determined by comparing the data to be processed with the position of the feature vector of the set of the secret data, so that the defect that the manual identification and judgment standards are not uniform is overcome.

(4) And step four, the sensitivity of the sensitive attribute can be obtained through calculation, so that a sensitive data desensitization strategy can be adopted in a targeted manner, and the data can be shared openly.

(5) And fifthly, on the premise of the known sensitivity attribute sensitivity, the desensitization of the sensitive data is performed in a targeted manner, so that the desensitization strategy is more convenient and efficient.

(6) And step six, the real-time property and the dynamic property of the ocean data are fully considered, and the sensitive feature library is dynamically updated and circularly optimized, so that the identification and protection rules of the sensitive data meet the real requirements.

Drawings

FIG. 1 is a schematic flow chart of a prevention method according to the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

Example 1

The embodiment provides a sea big data sensitivity evaluation system facing to confidentiality requirements, which mainly comprises the following parts;

the data feature extraction module is used for extracting data samples and metadata information from the structured marine secret original data, extracting keywords, subject words and associated words from the unstructured marine secret information document by adopting a natural language processing correlation technology to establish a corpus, and performing feature extraction to form a sensitive feature library; the method is used for extracting data attribute characteristics of the marine big data target processing data and preparing for finding sensitive data by matching with a sensitive characteristic library in the next step.

And the sensitive characteristic matching module is used for matching the target processing data characteristics in the ocean big data with the sensitive characteristic library, performing similarity analysis on the target processing data set and the original sensitive data set, and finding out possible sensitive data sets from mass data.

And the sensitive characteristic quantification module is used for establishing a sensitivity evaluation model by taking the extracted sensitive characteristics as parameters, quantifying the characteristic weight by a method of combining a machine and a worker, and facilitating accurate identification of sensitive data.

And the data relevance analysis module is used for further analyzing and finding the sensitive data, obtaining a relevant data set of the data set identified through feature matching through relevance analysis, and forming a sensitive database by the matching data set and the relevant data set.

And the data sensitivity measurement module is used for determining the sensitivity of the sensitive data from the data set found by the matching analysis through analysis and calculation, so that different protective measures can be taken for the data with different sensitivities in the next step.

Example 2

The precaution method based on the marine big data sensitivity evaluation system facing the confidentiality requirement in the embodiment 1 includes that sensitive data is identified, evaluated and protected, sensitive information hidden in big data is accurately found through identification of sensitive content, correlation analysis of the sensitive data and sensitivity evaluation of the sensitive data, an illegal user is effectively prevented from stealing the hidden sensitive information in the big data through data mining, data integration and data correlation analysis, and reliable protection of the sensitive information is achieved.

Referring to fig. 1, the method of the present embodiment includes the following steps:

step one, establishing a sensitive feature library.

The method starts with the relevant regulation of a secret range in ocean work, the formed secret information and the internal information of a business, adopts a mode of combining a machine and a worker to extract the characteristics of a generated secret information set, extracts the document metadata of a secret information text, the keywords, the subject words, the associated words and the metadata of an original sensitive data set of the text content by the technologies of extracting the keywords, modeling the subject and the like, determines the weight of the candidate words by using a characteristic weight method based on the word frequency and the security level to select the characteristics, classifies the characteristic words and belongs to different attribute categories to form a sensitive characteristic library.

As can be known from the analysis of the stipulation of the specific range of the national secrets and the confidentiality levels thereof in marine work, the marine secrets mainly comprise text information and original sensitive data, for unstructured text information, text feature items are required to be selected through text analysis, and for structured and semi-structured original sensitive data, sensitive attributes are extracted through data description of the original sensitive data.

The steps for extracting the sensitive features are as follows:

(1) and classifying the sample data set. The sample data set includes: the system comprises a file of 'the secret of China at ocean work and the specification of the secret level thereof', a 'secret catalogue at China at ocean work' document, secret information with determined secret, and existing sensitive data metadata information, including names, types and annotation information of data tables and fields, such as data set names and data set abstracts. Firstly, the contents of three secret levels listed in the stipulation of secret and secret level specific range of China 'secret in ocean work' are stored in documents and classified into three different sample sets, namely secret item names and remarks in the secret level of secret, secret and secret 'in the secret catalogue of China in ocean work' respectively. The classified information and the original data are classified into the document set corresponding to the classification according to the classification. According to the service standard, the internal data is also added into the sample data set. And adding a secret class attribute for the sample data in each class of sample set, and respectively using '1', 2 ', 3 and 4' to represent 'secret, secret and internal', so as to be convenient for determining the weight value in the next step.

(2) And (5) carrying out numerical semantic analysis. In the ocean original data, as data such as time, longitude and latitude and the like mostly exist in a numerical form, the data needs to be converted into a text description form through semantic analysis, so that text key words can be conveniently extracted, and particularly in the aspect of defining a sensitive area and sensitive time, the numerical information needs to be correspondingly linked with the text description. For example, the national secret has a specific secret duration, a sensitive time range needs to be determined according to a secret date and the secret duration, and if a certain time point is within the range, the time is marked as sensitive time; the military region (including the training region) and the non-open sea region belong to the sensitive region, if the longitude and latitude of the observation data correspond to the region, the region belongs to the sensitive region, and the name of the region is brought into the sensitive feature library.

(3) And (5) extracting keywords.

The method comprises the steps of taking oceanographic terms in national standards as a basic corpus as a basis of word segmentation, using a word segmentation tool to segment each class of data sets, counting word frequency and recording the class of the data sets, determining candidate word weight by using a feature weight method based on the word frequency and the classification to construct a word weight function, and endowing different weights to candidate words from different classification classes to enable key words in high-density sample data to be more prominent in criticality.

The word frequency statistics in each type of samples use a maximum value taking algorithm based on the TF-IDF algorithm, namely: the a word may appear in n documents, and at this time, the TF-IDF values of the n documents corresponding to the a word are obtained, and the maximum value is taken as the TF-IDF value of the a word in the sample set: x is the number of_i＝max{tf_i*idf_iN), where x is ═ 1,2_iFor the TF-IDF value, TF, of the word a in the class i sample_iFor the frequency, idf, of occurrence of the word x in a certain document of the class i sample_iFor the inverse document frequency in the ith class sample, i.e.:

n represents the total number of texts in the sample library, and N (a) represents the total number of texts containing a words in the corpus.

In the whole sample set, keeping secret according to the maximum security levelThe time limit, the secret weighting (M, f) is respectively set as: (absolute, 1/2), (secret, 1/3), (secret, 1/6), the participating final feature word weighted average calculation formula is:

wherein

Is an average weight value, x_iTF-IDF value, f, for the word a in each type of sample_iFor each class of security weight, i is the security class, i is 1,2, 3. (since the internal information has no clear security deadline, only the keywords that do not appear in the first three sample sets are x-wise regardless of the weighting of the internal information_iValues are referred to in weight sorting. )

And sorting the weights, and taking the first n words as final feature words. And semantic analysis is performed on the feature words by combining with the knowledge of the ocean field, and a sensitive feature library is expanded by adopting concept words, synonyms and associated words to embody the requirements and analysis of the semantic level and form a final sensitive feature word library.

(4) And classifying the sensitive features.

According to experience knowledge, each data observation has corresponding information such as an acquisition position, an observation period and the like due to the space-time process of the ocean data; due to timeliness of information, the time attribute of the ocean data is also an important factor influencing data sensitivity; according to the stipulation of national secrets and the specific range of the confidentiality level in ocean work, observation indexes of ocean currents, warm salt, tides and the like in certain areas belong to the national secrets, so the observation indexes are also one of the factors influencing the sensitivity. The feature words in the sensitive feature library are analyzed and classified into different attribute categories including, but not limited to, sensitive time, sensitive area, sensitive observation element, and the like. If the extracted keywords comprise 'yellow sea, east sea, salinity and tide', the 'yellow sea and east sea' are classified into 'region' category, and the 'salinity and tide' are classified into 'observation index' category.

And step two, matching and analyzing the data set to be processed.

Firstly, extracting the characteristics of data to be processed from a large ocean data target processing data set by using a characteristic extraction method based on word frequency, converting the data set into a binary vector by using a discretization characteristic vector method, and performing similarity matching on the data set to be processed and a sample data set, thereby finding a data set with sensitive characteristics, reducing the data amount to be processed, reducing the calculation expense and obtaining a matched data set. And performing relevance analysis on the matched data sets to find out the relevant data sets of the matched data sets, wherein the relevant data sets and the sensitive data sets jointly form a sensitive data set.

The specific process of matching analysis is as follows (referring to a discretized feature vector construction method):

(1) and preprocessing the attribute values. In order to facilitate the calculation of the feature vector, for the attribute including the continuous value, an MDL value domain method (prior art, which is not described herein) is adopted to divide the continuous value into intervals and convert the intervals into discrete attribute values. For the attribute values of the same attribute, the attribute values may be from different detection instruments, and the format and standard of the collected data may also be different, for which, regular expression is required for pattern matching. Taking the temporal attributes of the ocean as an example, there are a number of data formats, such as: "20181126, 2018/11/26, 2018, 11/26/month", data of different formats can be matched and identified by setting a pattern matching rule.

(2) And constructing a sensitive attribute group. According to the classified sensitive attribute categories extracted in the first step, setting a sensitive attribute set as follows:

S＝{s₁，s₂，…，s_n}(n≥2)

(3) binarization of the data set. Converting the original data set into a binary data set, wherein all attribute values are only 0 or 1; the sensitive attribute of each instance is represented by a binary vector, and a certain attribute value set is assumed as follows:

Vs₁＝{s₁₁，s₁₂}

wherein s is₁₁、s₁₂Is attribute s₁Two different values. As shown in Table (a):

(a) sensitive attribute type and attribute value

Sensitive attribute	Different attribute values
		s₁	s₁₁，s₁₂
s₂	s₂₁，s₂₂，s₂₃
		…	…
s_n	s_n1，s_n2，…，s_nm

Then s₁The attribute values of (a) can be expressed as:

Vs₁＝{1，0}²

for example: sensitivity attribute s₁For data purposes, which are divided into military data and civil data, then s₁₁Military data, represented as [1,0 ] in binary vectors]，s₁₂For civil data, represented as [0,1 ] in binary vectors]。

And matching the attribute features extracted from the data set with the sensitive feature lexicon, recording the hit as 1, otherwise recording the hit as 0, converting the data set into a binary data set, and accumulating the number of different values of all the attributes to obtain the dimension of the data set. After the original data set is binarized, counting the frequency of occurrence of "1" in each sensitive attribute value, and obtaining the characteristic frequency of the data set:

V_Ⅰ＝{v₁₁,v₁₂,...,v_nm}

assume that a binary data set of a data set is shown in table (b):

(b) binary data set

V can be calculated_Ⅰ＝{0.6,0.4,0.4，0.4，0.2...}

(4) A similar data set is determined. For each data set, feature quantization can be performed by the above method to obtain a uniquely determined multi-dimensional vector, and thus, the determination of similarity of data sets becomes a positional relationship between comparison vectors. And calculating the Euclidean distance from the existing sensitive data set in the target processing data set by using a calculation method of the Euclidean distance as a basis for comparing the similarity between the data sets, wherein the Euclidean distance can be determined as the sensitive data when the distance is less than a certain limit.

And finding out a related data set through relevance analysis after obtaining the matched data set, finding out a data set which does not have the same attribute as the sensitive data set but has relevance with the sensitive data set, and forming the sensitive data set by the two data sets.

And step three, constructing a sensitivity evaluation model.

The data sensitivity category can be determined by converting the data in the sample set into the characteristic vector, determining the center and the threshold value of each category, and judging the distance between the data to be processed and the category center.

The sensitivity evaluation procedure was as follows:

define 1 a marine data pattern. The ocean data mode is M ═<Q,S,R,C>Q is a quasi-identifier attribute set, S is a sensitive attribute set, R is a non-sensitive attribute set, C is a sensitivity level, and Q ═ Q { (Q) is obtained₁,…,q_m},S＝{s₁,…,s_n},R＝{r₁,…,r_p}，C＝{c₁,c₂,c₃,c₄,c₅}。

The sensitivity attribute of the marine data is mainly considered in the sensitivity evaluation of the marine data, and the influence of other attributes on the sensitivity is ignored.

Define 2 ocean data sensitivity. The ocean data sensitivity refers to the importance degree of a data file in national security, economic construction and social life and the harm degree after leakage, and is divided into five grades, namely: c. C₁: first order sensitivity, c₂: second order sensitivity, c₃Third order sensitivity, c₄: quaternary sensitivity, c₅: and (4) non-sensitive. The first three levels correspond to three secret levels of absolute secret, secret and secret respectively, the fourth level corresponds to internal data, and the fifth level corresponds to public data.

Converting each type of data in the sample into a binary data set according to a binary data set conversion method in the second step to obtain a feature vector of each data set, calculating a class center of each type of data set, namely a central point of a feature vector set, setting a threshold, comparing the distance between the vector position of the data set to be processed and the class central point, and if the distance is within a certain threshold range, determining that the data set belongs to the sensitive class, and if the distance is greater than four threshold values, determining that the data set belongs to the non-sensitive class.

And step four, calculating the attribute sensitivity.

Because the sensitivities corresponding to different attributes of the ocean data are different, the attribute sensitivities need to be calculated, and different desensitization strategies are convenient to adopt for the attributes with different sensitivities.

Calculating the sensitivity attribute s according to the weight value of the sensitivity attribute value extracted in the step one and the characteristic frequency of the binary data set established in the step two_iSensitivity of (2):

wherein the content of the first and second substances,

And step five, protecting the sensitive data.

According to the sensitivity of the sensitive attribute obtained by analysis and quantification, different protective measures are taken in a targeted manner. And sorting the sensitivity attributes according to the sensitivities, and setting desensitization strategies with different intensities for the sensitivity attributes with different sensitivities.

And step six, dynamically optimizing the sensitive feature library.

Because the ocean data is generated in real time and dynamically changes, the sensitive feature library also needs to be continuously updated and optimized. And dynamically extracting features from the determined sensitive database, and continuously optimizing the sensitive feature library. In the first step, because the existing secret information sample is limited, the extracted sensitive characteristic is limited, and the determined sensitive data set and the original secret information can jointly form a characteristic library sample through identifying sensitive data for multiple times, the sample size is enlarged, and the characteristic extraction in the first step is repeated on the determined sensitive data set, so that the sensitive characteristic library is continuously supplemented and completed.

In summary, the key points of the technical scheme of the invention are as follows:

(1) for marine data

Starting from the characteristics of the ocean big data, a sensitive data sensitivity evaluation model system and a prevention method specially aiming at the ocean big data are designed.

(2) Reverse thinking

Aiming at the confidentiality requirement, the generated secret information set is used as an original sample set to carry out feature extraction, a sensitive data feature library is obtained through reverse deduction, the sensitivity grading of original data is established on the basis, and a strategy for improving the blocking value is generated, namely a precaution method.

(3) Dynamic circulation

And the sensitive feature library is updated in real time, real-time features of the ocean data are met, and the feature perfection sensitive feature library is extracted in a circulating mode according to the original secret information set and the identified sensitive data.

It is understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims

1. A marine big data sensitivity evaluation system facing privacy requirements is characterized by comprising:

the data feature extraction module is used for extracting data features and establishing a sensitive feature library; the system is used for extracting data attribute characteristics from the marine big data target processing data; the data feature extraction module is used for extracting data samples and metadata information from structured marine secret original data, extracting keywords, subject words and associated words from unstructured marine secret information documents to establish a corpus, performing feature extraction, determining candidate word weights by using a feature weight method based on word frequency and confidentiality to perform feature selection, classifying the feature words, and attributing the feature words to different attribute categories to form a sensitive feature library; the method is used for extracting data attribute characteristics of the marine big data target processing data and preparing for finding sensitive data by matching with a sensitive characteristic library in the next step;

the specific steps for extracting the sensitive features are as follows:

(1) classifying the sample data set;

(2) performing numerical semantic analysis;

(4) classifying sensitive features: analyzing the feature words in the sensitive feature library, and classifying the feature words into different attribute categories including but not limited to sensitive time, sensitive areas and sensitive observation elements;

the sensitive characteristic quantification module is used for establishing a sensitivity evaluation model by taking the extracted sensitive characteristics as parameters and quantifying the characteristic weight;

the data sensitivity measurement module is used for determining the sensitivity of the sensitive data from the data set found by the matching analysis through analysis calculation; the specific process is as follows:

a. constructing a sensitive attribute group, extracting the classified sensitive attribute categories according to a data feature extraction module, and setting a sensitive attribute set as follows:

S＝{s₁，s₂，…，s_n}(n≥2)

V_Ⅰ＝{v₁₁,v₁₂,...,v_nm}

finally, according to the extracted weight value of the sensitive attribute value and the characteristic frequency of the data set, the sensitive attribute s can be calculated_iSensitivity of (2):

wherein the content of the first and second substances,

weighted TF-IDF value, v, for the jth attribute value of the ith sensitivity attribute_ijCharacteristic frequency of j attribute value of i sensitive attribute;

2. A precautionary method using the marine big data sensitivity evaluation system facing privacy requirements of claim 1, comprising the steps of:

step one, establishing a sensitive feature library;

step two, matching and analyzing a data set to be processed;

step three, constructing a sensitivity evaluation model;

step four, calculating attribute sensitivity;

protecting sensitive data;

and step six, dynamically optimizing the sensitive feature library.

3. The precaution method according to claim 2, characterized in that in the second step, firstly, a feature extraction method based on word frequency is used to extract the features of the data to be processed from the ocean big data target processing data set, a discretization feature vector method is used to convert the data set into binary vectors, similarity matching is carried out on the data set to be processed and the sample data set, and therefore a data set with sensitive features is found, and a matched data set is obtained; and performing relevance analysis from the matching data set to find a relevance data set of the matching data set, wherein the relevance data set and the sensitive data set jointly form a sensitive data set.

4. The precaution method according to claim 2, characterized in that in the third step, the data sensitivity category can be determined by converting each type of data in the sample set into a feature vector, determining the center and the threshold of each type of data set, and determining the distance between the vector position of the data to be processed and the class center; within a certain threshold, the class belongs to the sensitivity class.