CN113742781B

CN113742781B - K anonymous clustering privacy protection method, system, computer equipment and terminal

Info

Publication number: CN113742781B
Application number: CN202111123601.8A
Authority: CN
Inventors: 吴珺; 朱嘉辉; 王春枝; 董佳明; 周显敬; 刘虎; 李天意; 朱天亮
Original assignee: Wuhan Zhuoer Information Technology Co ltd; Hubei University of Technology
Current assignee: Wuhan Zhuoer Information Technology Co ltd; Hubei University of Technology
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2024-04-05
Anticipated expiration: 2041-09-24
Also published as: CN113742781A

Abstract

The invention belongs to the technical field of information security, and discloses a K anonymous clustering privacy protection method, a system, computer equipment and a terminal, wherein the K anonymous clustering privacy protection method comprises the following steps: the main component analysis method is used for completing the dimension reduction of the data and determining sensitive attribute, quasi-identifier attribute and identifier attribute; calculating the association degree of the sensitive attribute and the quasi-identifier attribute by using a gray level association analysis method on the data subjected to dimension reduction; determining the generalized hierarchical structure of the quasi identifier according to the association degree of the sensitive attribute and the quasi identifier; determining the number of clusters suitable for the data set by using an elbow method; judging whether the data are directly clustered or combined with other data values according to the threshold value a; clustering the data set; and carrying out K anonymization processing on clustered data according to the generalization structure of the quasi-identification attribute. The invention can reduce the dimension of medical data, avoid sinking local optimal value in the clustering process, reduce the information loss rate in the K anonymization process and protect the safety of private data.

Description

K anonymous clustering privacy protection method, system, computer equipment and terminal

Technical Field

The invention belongs to the technical field of information security, and particularly relates to a K anonymous clustering privacy protection method, a K anonymous clustering privacy protection system, computer equipment and a K anonymous clustering privacy protection terminal.

Background

Currently, with the development of medical technology, medical data sharing is becoming more and more common, and the problem of medical data leakage is becoming more serious. The privacy protection problem is an important direction in the field of information security, and how to guarantee the security of information is a key for realizing personal privacy protection.

The early data privacy protection mode mainly sets different authorities in a database, protects privacy safety of individuals according to the different authorities, but certain high-authority individuals exist, and in order to obtain benefits, personal information is sold to other people, so that the personal information is revealed. With the gradual shaping of the concept of privacy protection, people pay more attention to privacy protection, and privacy protection technology is required to improve the protection of private information.

The K Anonymity privacy protection model is used for protecting information in the data release process, is different from privacy protection modes based on access control and the like, performs preprocessing on original data, then releases anonymized data sets, protects personal privacy data, and is applicable to the fields of medical treatment, job hunting and the like, obvious personal information needs to be hidden in the fields, and an information attacker cannot deduce specific personal privacy data according to the released data through link attack, so that the privacy data is effectively protected in the data release process. The traditional K anonymity model improves the strength of privacy protection mostly at the expense of information loss. Therefore, a new K anonymous clustering privacy protection method and system are needed to make up for the problems existing in the prior art.

Through the above analysis, the problems and defects existing in the prior art are as follows: the traditional K anonymity model mostly improves the strength of privacy protection at the expense of the amount of information loss. The data dimension in K anonymity is overlarge, so that the time cost for processing the data is increased, and more data loss is caused by the data in the whole dimension of K anonymity.

The difficulty of solving the problems and the defects is as follows: the dimension of the data set can be effectively reduced, and the information loss of the data in the K anonymization process can be effectively reduced.

The meaning of solving the problems and the defects is as follows: the time cost for processing the data is reduced through data dimension reduction, the information loss in the data K anonymization process is reduced, the originality of the data is more likely to be reserved, and support is provided for the follow-up data analysis work.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a K anonymous clustering privacy protection method and system, computer equipment and a terminal, and particularly relates to a K anonymous clustering privacy protection method and system based on medical data utility.

The invention is realized in such a way that the K anonymous clustering privacy protection method comprises the following steps:

the main component analysis method is used for completing the dimension reduction of the data and determining sensitive attribute, quasi-identifier attribute and identifier attribute; calculating the association degree of the sensitive attribute and the quasi-identifier attribute by using a gray level association analysis method on the data subjected to dimension reduction; determining the generalized hierarchical structure of the quasi identifier according to the association degree of the sensitive attribute and the quasi identifier; determining the number of clusters suitable for the data set by using an elbow method; judging whether the data are directly clustered or combined with other data values according to the threshold value a; clustering the data set; and carrying out K anonymization processing on clustered data according to the generalization structure of the quasi-identification attribute.

Further, the K anonymous cluster privacy protection method comprises the following steps:

step one, reducing the dimension of a medical data set T according to a principal component analysis method;

step two, determining the association degree of the quasi identifier and the sensitive attribute by using a gray level association analysis method;

determining the generalization hierarchy of the quasi-identifier attribute according to the association degree of the quasi-identifier and the sensitive attribute;

determining the number of optimal clusters of data according to the selected identifier, the quasi-identifier and the sensitive attribute and the elbow method;

step five, clustering the data set by taking L as the clustering cluster number according to the optimal cluster number L;

step six, K anonymizing the given size of a as a threshold value, listing records in the data set which are matched with the K anonymity into a K hidden name table, and counting T _m The number recorded in the table.

Further, in the first step, the dimension reduction of the medical data set T according to the principal component analysis method includes:

(1) The principal component identifiers that may be present are expressed as:

wherein, p identifies the dimension of the attribute in each group of records, c represents the weight of the attribute in each group of records, Z represents the principal component, q represents the number of the principal components which can exist, and each principal component is mutually independent; z is Z ₁ ，Z ₂ ，…，Z _n From different x ₁ ，x ₂ ，…，x _p Quasi-identifiers.

(2) According to the load valueC _ij And selecting the principal component with the smallest attribute dimension from the principal component set, selecting the proper QI attribute from the principal component with the smallest attribute dimension, and determining the identifier, the quasi-identifier and the sensitive attribute.

Further, in the second step, the determining the association degree between the quasi identifier and the sensitive attribute by using the gray level association analysis method includes:

(1) The sensitive attribute is taken as a reference sequence and expressed as:

Y＝Y(k)|1,2,...,n；

wherein Y is a specific sensitive attribute.

(2) Determining the association degree of the sensitive attribute as comparison data, wherein the comparison data is expressed as:

X _i ＝X _i (k)|k＝1,2,...,n,i＝1,2,...,m；

wherein X is _i (k) Represents the kth value in the ith comparison sequence, and m represents the number of QI attributes.

(3) The measurement units of different data are different, and the data are normalized by the following formula:

(4) After normalization processing, the gray scale association coefficient of the quasi identifier attribute and the sensitive attribute is calculated, and the gray scale association coefficient is determined by the following formula:

wherein, |y (k) -x _i (k) The i is the distance between the reference sequence and the corresponding kth data in the ith comparison sequence, max represents the maximum distance, and min represents the minimum distance; ρ is called a resolution coefficient, and the value interval of ρ is (0, 1); when ρ is less than or equal to 0.5463, the resolution is higher, taking ρ=0.5.

(5) According to the association coefficient of each moment, determining the association degree, and determining the association degree by using the following formula:

wherein r is _i Expressed as a degree of association, the closer the degree of association is to 1, the higher the association of the quasi-identifier attribute with the sensitive attribute, the stronger the association.

In the third step, the higher the association degree is, the stronger the association of the data, the finer the generalization hierarchical structure of the standard identifier, and for the standard identifier with low association degree, the generalization hierarchical structure is relatively fuzzy, so that the standard identifier generalization hierarchical structure can be determined.

Further, in the fourth step, the determining the number of the best clusters of the data according to the elbow method according to the selected identifier, the quasi-identifier and the sensitive attribute includes:

(1) Giving a cluster number range m of the data set T, carrying out local division on the data set according to the given cluster number range m, and calculating the Euclidean distance from the mass center of the cluster to each data point in the cluster from the cluster number of 2:

wherein x is _i ，y _i Correspondingly calculating data of different dimensions of two data points; and according to the nearest centroid principle, completing cluster division of data points according to the Euclidean distance.

(2) According to the division of the clusters, SSE of each cluster is calculated, the sum of squares of the current number of the clusters and the total error is taken as a coordinate, the coordinate axis is represented, and the SSE is calculated according to the following formula:

wherein C is _i Represents the ith cluster, p represents the sample point in Ci, m _i Represent C _i The average value of all samples in the (a); the optimal cluster number L is determined from the elbow map of the medical dataset T.

In a fifth step, according to the optimal cluster number L, clustering the data set with L as the cluster number, including:

(1) Put all data into a queue { d } as one cluster ₁ Mean value clustering of the clusters with the cluster number m=2, calculating SSE of each cluster, and placing the divided clusters into a queue { d } ₁ ，d ₂ ，d ₃ }。

(2) And (3) selecting the minimum SSE from the queue to perform m=2 mean clustering, then placing the divided clusters into the queue, and repeating the step (1) until the number of the clusters is larger than L.

(3) According to the clustering step, the medical data set T is divided into m data sets (T ₁ ，T ₂ ，…，T _m )。

In step six, the given a is K anonymous as a threshold value, records in the dataset which are consistent with the K anonymous are listed in a K-hidden list, and T is counted _m The number recorded in the table includes:

(1) Find T _m And (3) the standard identifier attribute A with the highest value number and highest association degree rises the generalization level of the standard identifier attribute A by one layer according to the generalization hierarchy structure.

(2) Statistics of the current T _m Records conforming to the K anonymity rule and records not conforming to the K anonymity rule.

(3) Will T _m Record conforming to K anonymity rule is listed in K-hidden name list, record not conforming to K anonymity rule is repeated in step (1) until T _m The number of records in (a) is less than K.

(4) Combining records with the number of records less than the threshold value a after anonymizing the data sets K into a new data table T _s K anonymization is performed according to step (1).

Another object of the present invention is to provide a K-anonymous clustering privacy protection system applying the K-anonymous clustering privacy protection method, where the K-anonymous clustering privacy protection system includes:

the data dimension reduction module is used for reducing dimension of the medical data set T according to the principal component analysis method;

the association degree determining module is used for determining the association degree of the quasi identifier and the sensitive attribute by using a gray level association analysis method;

the generalization level determining module is used for determining the generalization level of the quasi-identifier attribute according to the association degree of the quasi-identifier and the sensitive attribute;

an optimal cluster number determining module for determining the number of optimal clusters of data according to the selected identifier, the quasi identifier and the sensitive attribute and according to an elbow method;

the data aggregation module is used for clustering the data set by taking the L as the clustering cluster number according to the optimal cluster number L;

k anonymization module for K anonymizing according to the given a size as threshold, listing record in data set which accords with K anonymization into K hidden list, and counting T _m The number recorded in the table.

It is a further object of the present invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

The invention further aims to provide an information data processing terminal which is used for realizing the K anonymous clustering privacy protection system.

By combining all the technical schemes, the invention has the advantages and positive effects that: the K anonymous clustering privacy protection method provided by the invention can reduce the dimensionality of medical data, avoid sinking into a local optimal value in the clustering process, and reduce the information loss rate in the K anonymous process. The invention can also effectively reduce the risk of data leakage, reduce homogeneity attack and protect private data.

According to the invention, the dimension reduction of the medical data is completed through principal component analysis, the situation that the local optimal value falls into in the clustering process is avoided, and the data set with the least error square sum is continuously selected for clustering in the binary mean value clustering process through a clustering algorithm, so that the optimal processing of the global data is achieved. According to the method, the information loss rate of summary in the K anonymization process is reduced, the generalized hierarchical structure of the quasi identifier is controlled through the association degree by gray level association analysis, a data set which does not meet the K anonymization threshold is combined with other data sets which do not meet the K anonymization threshold, and then K anonymization is carried out to reduce the information loss rate.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a K anonymous cluster privacy protection method provided by an embodiment of the invention.

Fig. 2 is a schematic diagram of a K anonymous clustering privacy protection method provided by an embodiment of the invention.

FIG. 3 is a block diagram of a K anonymous cluster privacy protection system provided by an embodiment of the invention;

in the figure: 1. a data dimension reduction module; 2. a relevancy determination module; 3. a generalization hierarchy determination module; 4. an optimal cluster number determining module; 5. a data aggregation class module; 6. k anonymizing module.

Fig. 4 is a flow chart of principal component analysis provided by an embodiment of the present invention.

Fig. 5 is a flowchart of gray scale correlation analysis provided in an embodiment of the present invention.

Fig. 6 is a generalized hierarchical structure diagram provided by an embodiment of the present invention.

FIG. 7 is a flowchart of an elbow method according to an embodiment of the present invention.

Fig. 8 is a flowchart of a clustering method provided by an embodiment of the present invention.

Fig. 9 is a flowchart of K anonymization provided by an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Aiming at the problems existing in the prior art, the invention provides a K anonymous clustering privacy protection method and a K anonymous clustering privacy protection system, and the K anonymous clustering privacy protection method and the K anonymous clustering privacy protection system are described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the K anonymous clustering privacy protection method provided by the embodiment of the invention includes the following steps:

s101, reducing the dimension of a medical data set T according to a principal component analysis method;

s102, determining the association degree of the quasi identifier and the sensitive attribute by using a gray level association analysis method;

s103, determining the generalization hierarchy of the quasi-identifier attribute according to the association degree of the quasi-identifier and the sensitive attribute;

s104, determining the number of optimal clusters of the data according to the elbow method according to the selected identifier, the quasi-identifier and the sensitive attribute;

s105, clustering the data set by taking L as the clustering cluster number according to the optimal cluster number L;

s106, K anonymizing the given size of a as a threshold value, listing records in the data set which are matched with the K anonymity into a K hidden name table, and counting T _m The number recorded in the table.

The schematic diagram of the K anonymous clustering privacy protection method provided by the embodiment of the invention is shown in figure 2.

As shown in fig. 3, the K anonymous cluster privacy protection system provided by the embodiment of the invention includes:

the data dimension reduction module 1 is used for reducing the dimension of the medical data set T according to the principal component analysis method;

a correlation determination module 2, configured to determine a correlation between the quasi identifier and the sensitive attribute using a gray correlation analysis method;

a generalization hierarchy determining module 3, configured to determine a generalization hierarchy of the quasi-identifier attribute according to a degree of association between the quasi-identifier and the sensitive attribute;

an optimal cluster number determining module 4 for determining the optimal cluster number of the data according to the selected identifier, the quasi identifier and the sensitive attribute and according to the elbow method;

the data set clustering module 5 is used for clustering the data sets by taking L as the clustering cluster number according to the optimal cluster number L;

a K anonymizing module 6 for K anonymizing according to the given size of a as a threshold value, listing the records in the data set which are in accordance with the K anonymization into a K hidden name table, and counting T _m The number recorded in the table.

The technical scheme of the invention is further described below with reference to specific embodiments.

Example 1

The K anonymous clustering algorithm based on the medical data provided by the embodiment of the invention comprises the following steps: (1) The main component analysis method is used for completing the dimension reduction of the data and determining sensitive attribute, quasi-identifier attribute and identifier attribute; (2) Calculating the association degree of the sensitive attribute and the quasi-identifier attribute by using a gray level association analysis method on the data subjected to dimension reduction; (3) Determining the generalized hierarchical structure of the quasi identifier according to the association degree of the sensitive attribute and the quasi identifier; (4) Determining the number of clusters suitable for the data set by using an elbow method; (5) Judging whether the data are directly clustered or combined with other data values according to the threshold value a; (6) clustering the data set; (7) K anonymization processing is carried out on clustered data according to the generalization structure of the standard identification attribute, so that the risk of data leakage can be effectively reduced after the processing, homogeneity attack is reduced, and private data is protected.

The method comprises the following steps: the K anonymous clustering privacy protection method based on medical data utility provided by the embodiment of the invention comprises the following steps:

step 1: and (5) performing dimension reduction on the medical data set T according to the principal component analysis method.

Step 1.1: the principal component identifiers that may be present are expressed as:

wherein p identifies the dimension of the attribute in each set of records, c represents the weight of the attribute in each set of records, Z represents the principal component, q represents the number of principal components that may be present, and each principal component is independent of the other. Z is Z ₁ ，Z ₂ ，…，Z _n From different x ₁ ，x ₂ ，…，x _p Quasi-identifiers.

Step 1.2: according to the load value C _ij And selecting the principal component with the smallest attribute dimension from the principal component set, selecting the proper QI attribute from the principal component with the smallest attribute dimension, and determining the identifier, the quasi-identifier and the sensitive attribute.

Step 2: a gray scale correlation analysis method is used to determine the degree of correlation of the quasi-identifier to the sensitive attribute.

Step 2.1: taking the sensitive attribute as a reference sequence, y=y (k) |1, 2.

Step 2.2: the correlation degree with the sensitive attribute needs to be determined as comparison data, and the specific expression is as follows: x is X _i ＝X _i (k)|k＝1,2,...,n,i＝1,2,...,m，X _i (k) Represents the kth value in the ith comparison sequence, and m represents the number of QI attributes.

Step 2.3: the measurement units of different data are different, and the data are normalized by the following formula:

step 2.4: after normalization processing, the gray scale association coefficient of the quasi identifier attribute and the sensitive attribute is calculated, and the gray scale association coefficient is determined by the following formula:

wherein, |y (k) -x _i (k) I is the distance between the reference sequence and the corresponding kth data in the ith comparison sequence, max represents the maximum distance, and min represents the minimum distance. ρ is called a resolution coefficient, and the value interval of ρ is (0, 1) in general, and ρ=0.5 is usually taken when ρ is smaller than or equal to 0.5463 and the resolution is high.

Step 2.5: according to the association coefficient of each moment, determining the association degree, and determining the association degree by using the following formula:

Step 3: the generalization hierarchy of the quasi-identifier attribute is determined according to the association degree of the quasi-identifier and the sensitive attribute, the higher the association degree is, the stronger the association of data is, the more detailed the generalization hierarchy of the quasi-identifier should be, and for the quasi-identifier with low association degree, the generalization hierarchy is relatively blurred, so that the generalization hierarchy of the quasi-identifier is determined.

Step 4: based on the selected identifier, the quasi-identifier, the sensitive property, the number of best clusters of data is determined based on the elbow method.

Step 4.1: giving a cluster number range m of the data set T, carrying out local division on the data set according to the given cluster number range m, calculating the Euclidean distance from the mass center of the cluster to each data point in the cluster according to the following formula from the cluster number to be 2:

wherein x is _i ，y _i Data of different dimensions of the two data points are correspondingly calculated. And according to the nearest centroid principle, completing cluster division of data points according to the Euclidean distance.

Step 4.2: according to the division of the clusters, SSE (Sum of Squared Error) of each cluster is calculated, the sum of squares of the current number of clusters and the total error is taken as a coordinate, the sum of squares is expressed in the coordinate axis, and an SSE calculation formula is as follows:

wherein C is _i Represents the ith cluster, p represents the sample point in Ci, m _i Represent C _i Is the average of all samples in the sample. The optimal cluster number L is determined from the elbow map of the medical dataset T as shown in fig. 2.

Step 5: and clustering the data set by taking L as the clustering cluster number according to the optimal cluster number L.

Step 5.1: put all data into a queue { d } as one cluster ₁ Mean value clustering of the cluster number m=2 is carried out on the cluster, SSE of each cluster is calculated, and the divided clusters are put into a queue { d } ₁ ，d ₂ ，d ₃ }。

Step 5.2: selecting the minimum SSE from the queue to perform m=2 mean clustering, then placing the divided clusters into the queue, and repeating the steps until the number of clusters is greater than L.

Step 5.3: according to the above-mentioned clustering step, the division of the medical data set T into m data sets (T ₁ ，T ₂ ，…，T _m )。

Step 6: the given a size is used as a threshold value to carry out K anonymity, records which are in the data set and are in accordance with the K anonymity are listed in a K-hidden name table, and T is counted _m The number recorded in the table.

Step 6.1: find T _m And (3) the standard identifier attribute A with the highest value number and highest association degree rises the generalization level of the standard identifier attribute A by one layer according to the generalization hierarchy structure.

Step 6.2: statistics of the current T _m Records conforming to the K anonymity rule and records not conforming to the K anonymity rule.

Step 6.3: will T _m Record conforming to K anonymity rule is listed in K-hidden name list, record not conforming to K anonymity rule is repeated in step 6.1 until T _m The number of records in (a) is less than K.

Step 6.4: combining records with the number of records less than the threshold value a after anonymizing the data sets K into a new data table T _s K anonymization is performed according to step 6.1.

Example 2

The K anonymous clustering privacy protection method based on medical data utility provided by the embodiment of the invention comprises the following steps:

as shown in FIG. 2, the implementation includes principal component analysis, gray scale correlation analysis, generalization, elbow method, clustering and K anonymization modules. The method comprises the following steps:

step 1: the principal component analysis is performed on the medical data, as shown in fig. 4, firstly, the medical data is subjected to the averaging process, covariance is calculated, eigenvalues and eigenvectors of a covariance matrix are calculated, the eigenvalues are ordered from large to small, the largest K eigenvectors are reserved, the data is converted into a new space constructed by the K eigenvectors, finally, the dimension reduction of the data is completed, and an identifier, a quasi identifier and a sensitive attribute are selected according to the load quantity.

Identifier attribute: generally refers to data that can directly identify an individual, such as name, phone number, identification number, etc., and for an identifier attribute, the attribute is deleted from the data table directly prior to the release of the data.

Quasi identifier: the minimum set of attributes for an individual may be linked to an external table and the attributes co-existing with the external table at the release data table, such as a zip code. Birthday, gender, etc., the specific personal information can be identified by link attack by combining these attribute sets with an external data table.

Sensitive properties: other user-aware attributes such as disease information, purchasing preferences, salaries, etc. are not desired at the time of data distribution, and information that needs to be protected before distribution.

Step 2: and (5) carrying out gray scale correlation analysis on the data set T obtained in the step (1), as shown in fig. 5.

Step 2.1: determining a reference sequence y=y (k) |1,2, a..n, the reference sequence Y corresponding to the sensitive attribute, and comparing the sequences X _i ＝X _i (k) I k=1, 2, n, i=1, 2, m, comparative series X _i Corresponding to the attribute that needs to be determined to be associated with the sensitive attribute. X is X _i (k) Represents the kth value in the ith comparison sequence, and m represents the number of QI attributes.

Step 2.2: the measurement units of different data are different, and the data are normalized by the following formula:

As shown in FIG. 6, when the association degree of the attribute and the sensitive attribute is lower, the generalized hierarchy is smaller, such as the generalized hierarchy on the left side of FIG. 6, and when the association degree of the attribute and the sensitive attribute is higher, the generalized hierarchy is finer, such as the generalized hierarchy on the right side of FIG. 6. When the generalized hierarchical structure is finer, the loss rate of information in the anonymization process is lower, and the originality of data is more protected.

Step 4: the optimal cluster number is determined for the medical dataset T using an elbow method, as shown in fig. 7.

wherein C is _i Represents the ith cluster, p represents C _i Sample points m in (1) _i Represent C _i Is the average of all samples in the sample. The optimal cluster number L is determined from the elbow map of the medical dataset T as shown in fig. 2.

Step 5: and clustering the data set according to the optimal cluster number L, wherein L is used as the cluster number, as shown in fig. 8.

Step 5.1: placing all data as one cluster into a queue d= { D ₁ Mean value clustering of the cluster number m=2 is carried out on the cluster, SSE of each cluster is calculated, and the divided clusters are put into a queue D= { D ₁ ，d ₂ ，d ₃ }。

Step 5.2: and selecting the minimum SSE from the queue to perform m=2 mean clustering, then placing the divided clusters into the queue, and taking the optimal cluster number as a clustering threshold until the threshold is met.

Step 6: the data is K anonymized as shown in FIG. 9.

Step 6.1: given the size of a, K anonymization is performed as a threshold. Listing each subset of data set which has been matched with K anonymity into a K-prime list, and counting each T _m The number recorded in the table.

Step 6.2: find each T _m And (3) the standard identifier attribute A with the highest value number and highest association degree rises the generalization hierarchy of the standard identifier attribute A by one layer from the bottom according to the generalization hierarchy structure.

Step 6.3: statistics of the current T _m Records conforming to the K anonymity rule and records not conforming to the K anonymity rule.

Step 6.4: will T _m Record conforming to K anonymity rule is listed in K-hidden name list, record not conforming to K anonymity rule is repeated in step 6.2 until T _m The number of records in (a) is less than K.

Step 6.5: combining records with the number of records less than the threshold value a after anonymizing the data sets K into a new data table T _s K anonymization is performed according to step 6.2.

The technical scheme of the invention is further described below in connection with specific experimental data.

The 14 attributes in the raw dataset include age, sex, chest pain type, resting blood pressure, plasma steroid content, fasting blood glucose, resting electrocardiographic results, highest heart rate, exercise-induced angina, exercise-induced ST decline value, slope of electrocardiographic ST at maximum exercise, number of main vessels measured using fluorescence, THAL (thalassemia), and whether heart disease is present. As in table 1.

TABLE 1

After the principal component analysis is carried out on 13 attributes of the original data set, according to the size of the correlation coefficient, the principal component with the smallest dimension is selected, 5 attributes of gender, plasma steroid quantity, resting electrocardiogram result, highest heart rate and exercise type angina are used as standard identifiers, 1 attribute of whether heart disease is suffered or not is used as a sensitive attribute, and the dimension of the original 13 standard identification attributes is reduced to 5 standard identification attributes. See Table 2

TABLE 2

After determining these 5 quasi-identifier attributes, gray scale correlation analysis is used to determine the degree of correlation of quasi-identifiers to sensitive attributes. According to the standard identifier attribute with higher association degree shown in fig. 6, the generalized hierarchical structure is divided more finely, and the standard identifier attribute with low association degree has a blurred generalized hierarchical structure.

In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more; the terms "upper," "lower," "left," "right," "inner," "outer," "front," "rear," "head," "tail," and the like are used as an orientation or positional relationship based on that shown in the drawings, merely to facilitate description of the invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and therefore should not be construed as limiting the invention. Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in the form of a computer program product comprising one or more computer instructions. When loaded or executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims

1. The K anonymous clustering privacy protection method is characterized by comprising the following steps of:

the main component analysis method is used for completing the dimension reduction of the data and determining sensitive attribute, quasi-identifier attribute and identifier attribute; calculating the association degree of the sensitive attribute and the quasi-identifier attribute by using a gray level association analysis method on the data subjected to dimension reduction; determining the generalized hierarchical structure of the quasi identifier according to the association degree of the sensitive attribute and the quasi identifier; determining the number of clusters suitable for the data set by using an elbow method; judging whether the data are directly clustered or combined with other data values according to the threshold value a; clustering the data set; k anonymizing the clustered data according to the generalization structure of the quasi-identification attribute;

the K anonymous clustering privacy protection method comprises the following steps of:

step six, K anonymizing is carried out according to the size of a as a threshold value, records which are in accordance with the K anonymity in the data set are listed in a K-hidden name table, and the number of the records in the K-anonymity table is counted;

in the sixth step, according to the size of a, performing K anonymity as a threshold, and listing records in the data set which are in accordance with K anonymity into a K-hidden name table, and counting the number of records in the K anonymity table, including:

(1) Finding the standard identifier attribute A with the highest value number and highest association degree in the K hidden name table, and rising the generalization level of the standard identifier attribute A by one layer according to the generalization hierarchical structure;

(2) Counting records of the current K anonymity table conforming to the K anonymity rule and records not conforming to the K anonymity rule;

(3) Listing records conforming to the K anonymity rule in the K anonymity list into the K anonymity list, and repeating the step (1) on records not conforming to the K anonymity rule until the number of records in the K anonymity list is smaller than K;

2. The K-anonymous clustering privacy protection method as set forth in claim 1, wherein in the first step, the dimension reduction of the medical data set T according to the principal component analysis method includes:

(1) The principal component identifiers that may be present are expressed as:

wherein, p identifies the dimension of the attribute in each group of records, c represents the weight of the attribute in each group of records, Z represents the principal component, q represents the number of the principal components which can exist, and each principal component is mutually independent; z is Z ₁ ，Z ₂ ，…，Z _q From different x ₁ ，x ₂ ，…，x _p Quasi-identifier composition;

(2) According to the load value C _ij And selecting the principal component with the smallest attribute dimension from the principal component set, selecting the proper QI attribute from the principal component with the smallest attribute dimension, and determining the identifier, the quasi-identifier and the sensitive attribute.

3. The K-anonymous cluster privacy protection method as set forth in claim 1, wherein in the second step, the determining the association degree of the quasi-identifier and the sensitive attribute using the gray-scale association analysis method comprises:

(1) The sensitive attribute is taken as a reference sequence and expressed as:

Y＝Y(k)|k＝1,2,....,n

wherein Y is a specific sensitive attribute;

X _i ＝X _i (k)|k＝1,2,...,n,i＝1,2,...,m；

wherein X is _i (k) Represents the kth value in the ith comparison sequence, m represents the number of QI attributes;

wherein, |y (k) -x _i (k) The i is the distance between the reference sequence and the corresponding kth data in the ith comparison sequence, max represents the maximum distance, and min represents the minimum distance; ρ is called a resolution coefficient, and the value interval of ρ is (0, 1); when ρ is less than or equal to 0.5463, the resolution is higher, and ρ=0.5 is taken;

wherein r is _i Expressed as a degree of association, when the degree of association is closer to 1, the higher the association of the quasi identifier attribute and the sensitive attribute is, the stronger the association is;

4. The K-anonymous cluster privacy protection method as set forth in claim 1, wherein in the fourth step, the determining the number of best clusters of data according to the elbow method based on the selected identifier, the quasi identifier and the sensitive attribute comprises:

wherein x is _i ，y _i Correspondingly calculating data of different dimensions of two data points; according to the nearest centroid principle, completing cluster division of data points according to Euclidean distance;

5. The K anonymous clustering privacy protection method as set forth in claim 1, wherein in the fifth step, clustering the data set with L as the number of clusters according to the optimal number of clusters L, comprises:

(1) All data is regarded as one clusterPut into queue { d } ₁ Mean value clustering of the clusters with the cluster number m=2, calculating SSE of each cluster, and placing the divided clusters into a queue { d } ₁ ，d ₂ ，d ₃ }；

(2) Selecting the minimum SSE from the queue to perform m=2 mean clustering, then placing the divided clusters into the queue, and repeating the step (1) until the number of the clusters is larger than L;

6. A K-anonymous cluster privacy protection system applying the K-anonymous cluster privacy protection method as defined in any one of claims 1 to 5, characterized in that the K-anonymous cluster privacy protection system comprises:

7. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the K-anonymous cluster privacy protection method of any of claims 1-5, comprising the steps of:

8. An information data processing terminal, wherein the information data processing terminal is configured to implement the K anonymous cluster privacy protection system as set forth in claim 6.