CN102542209A

CN102542209A - Data anonymization method and system

Info

Publication number: CN102542209A
Application number: CN2010106132608A
Authority: CN
Inventors: 赵彧; 李建强; 刘博�
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd
Priority date: 2010-12-21
Filing date: 2010-12-21
Publication date: 2012-07-04
Anticipated expiration: 2030-12-21
Also published as: CN102542209B

Abstract

The invention provides a data anonymization method and system. The data anonymization method comprises the steps of: carrying out text analysis on the attribute value of a text type in data; replacing the attribute value of the text type in the data with the attribute value of a value type or a class type according to text analysis result; and carrying out anonymization processing on the data in which the attribute value of the text type is replaced by the attribute value of the value type or the class type. According to the invention, after anonymization processing, the data comprising the attribute value of the text type not only can prevent the privacy leakage based on the attribute value, but also still has use value.

Description

Data anonymous methods and system

Technical field

The present invention relates to computer realm, relate more specifically to a kind of data anonymous methods and system.

Background technology

In statistics, microdata is meant the data that include personal information, for example is present in the data of information such as age that includes each patient in the medical data base of hospital, sex, diagnostic result.Microdata is being issued or when shared, the protection individual privacy be one must not irrespective problem.Microdata comprises following three generic attributes usually: strong identity property (Explicit Identifier), fiducial mark know attribute (Quasi-Identifiers, QIs) and responsive attribute (Sensitive Attribute).For a data recording, the value of strong identity property can be used for identifying clearly the individual relevant with this record, is exactly strong identity property such as " name ", " identification card number " etc.For a data recording; Usually comprise one group of fiducial mark and know attribute; The value of these fiducial marks knowledge attributes combines and can identify the individual relevant with this record faintly, combines such as the value of " age ", " sex ", " postcode " these several fiducial marks knowledge attributes and can identify one or more individuals relevant with this record faintly.In addition, for a data recording, the individual's that responsive attribute is promptly relevant with this record sensitive data (for example, privacy information) is such as " disease ", " wage " or the like.In issue or when sharing microdata, need to guarantee related can not the leakage between individual and its responsive property value usually, promptly carry out anonymity and handle, but the while need guarantee that again the use value of the data after shared can not weakened.

The most also being anonymous way the most frequently used in the daily life, is when distributing data, strong identity property directly to be erased.But this is not a kind of safe method, because fiducial mark is known the individual that identifies that attribute also can be potential, particularly knows property value for some uncommon fiducial marks, and the identified probability of individual will increase greatly.To this problem; General received processing mode is to use the technology of a kind of k-of being called as anonymity (k-anonymity) at present, and the core concept of the anonymous technology of k-is: after strong identity property is concealed, the fiducial mark of each record is known property value (for example handle; Extensive or conceal); Record in the whole microdata is divided into several record groups, and it is identical that the fiducial mark of each group record is known attribute, and each group all comprises k bar record at least.

But traditional k-anonymous methods only can prevent that knowing attribute through fiducial mark identifies the individual and leak privacy, but can not prevent that the privacy that causes through responsive attribute from revealing.Such as; Though it is anonymous that microdata has satisfied k-, in a certain group record, the responsive property value of all records is all identical; If known someone's fiducial mark is known property value; Though then be no judge of which the bar record in the concrete corresponding group of this people, also can know this people's responsive property value, in fact privacy has still been revealed.Provided an anonymous example of k-below, wherein table 1 is former tables of data, table 2 be through aligning the identity property value carry out extensive processing (that is, with postcode extensive be 476 ^*With 4790 ^*, and with the age extensive be 2 ^*, 3 ^*, and greater than 40 years old three age bracket) after satisfy the tables of data of 3-anonymity (being k=3).Can find out that from table 2 if a known people's information is present in the record 1 to 9, and this people's postcode is 476 ^*, the age is more than 20 year old, can know for certain that then this people has a heart disease.Therefore, the anonymous technology of considering responsive attribute is a focus direction of current anonymous research.

Table 1

Table 2

The anonymous technology of main flow is the anonymous and expansion of k-at present.The k-anonymous methods can guarantee that the individual can not know attribute identified (identity disclosure) through fiducial mark, but still exists because responsive property value causes the be exposed possibility of (attribute disclosure) of privacy.Mostly the anonymous extended method of k-is when based on the fiducial mark knowledge property value after handling the record in the microdata being divided into groups, to require the distribution situation of the responsive property value of every group record to satisfy predetermined condition.

For example, several kinds of anonymous methods below the existence in the prior art:

In periodical " ACM Transaction on Knowledge Discovery from Data " 2007 the 1st phases the 1st volume by having proposed the anonymous improvement anonymous methods that is called as l-diversity in the paper " privacy beyond k-anonymity " that A.Machanavajjhala, D.Kifer, J.Gehrke and M.Venkitasubramaniam showed based on k-.Particularly; For the data that make issue can not expose privacy because of responsive property value; This method requires under the anonymous prerequisite of k-; Handle the record in the microdata is divided into a plurality of groups through aiming at the identity property value, and have the responsive property value of l " difference " at least in every group of (that is, fiducial mark is known identical one group of property value) record.What is the responsive property value of l " difference " for, can select the various definitions method, and simple, the most direct define method promptly is that l different data value arranged.For the example shown in the table 1; After anonymous the processing; Following result not only satisfies 3-anonymity (being k=3) but also satisfy 2-diversity (diversity) (being l=2); Promptly record has been divided into a plurality of groups that comprise at least 3 records respectively, and comprises 2 different responsive property values in every group record.

Data engineering international conference in 2007 (International Conference on Data Engineering, ICDE) in by proposed in the paper " privacy beyond k-anonymity and 1-diversity " that N.Li, T.Li and S.Ventkatasubramanian showed than l-diversity secret protection dynamics stronger, based on the anonymous improvement anonymous methods that is called as t-closeness of k-.Because the l-diversity method only requires to have in every group record l different responsive property value; And the distribution situation of these values is not carried out requirement; Possibly cause following privacy compromise situation like this: certain responsive property value occurs with larger proportion in the record group; If a so known user is arranged in certain record group, will knowing this user by inference, to have the probability of this responsive property value also bigger.Reveal for fear of the privacy of this situation, the t-closeness method is then different with l-diversity, require the responsive property value in each record group distribution situation need and the distribution situation of the responsive property value of whole data set between difference be not more than t.The metric of distributional difference can adopt different computing method to obtain.

International very-large database academic conference in 2009 (International Conference on Very Large Data Bases; VLDB) in by the improvement anonymous methods and the similar distribution of considering responsive property value of t-closeness method that propose in the paper " Distribution based microdata anonymization " that N.Koudas, D.Srivastava, T.Yu and Q.Zhang showed; But with based on anonymous technological different of traditional k-; This method has selected to handle responsive property value; Promptly carry out extensive and arrangement again, and it is constant to keep fiducial mark to know property value to responsive property value.In addition; The anonymous processing target of this method is the target distribution of a given responsive property value; Then the responsive property value in the tables of data is handled; Data for after handling can only provide such information to the external world: the responsive property value of each bar record meets given target distribution in advance, and does not comprise other any information.Be an example of this method below, table 3 is former tables of data, and the target of preset secret protection is: the front three according to postcode divides into groups, and being evenly distributed in the { $30K ， $40K ， $50K ， $60K} set satisfied in the distribution of the responsive property value in each group record.Table 4 is the tables of data after handling, and is as shown in table 4, responsive property value by extensive and arrange again so that the distribution of the responsive property value in each group record satisfy,,, being evenly distributed in the set.

Table 3

Table 4

Though all being directed against the problem that causes privacy to reveal based on responsive attribute, the existing method of introducing above carried out effective processing; But they can only handle the responsive attribute of numeric type (numerical) or classification type (categorical), and can not handle the responsive attribute of text.And in practical application, the responsive attribute under a lot of situation all is a text, and for example the symptom in the medical record data is described attribute etc.Under the situation of the responsive attribute of text, prior art can not directly should be used for carrying out anonymity and be handled, and is used for especially preventing that the privacy based on responsive attribute from revealing.

Summary of the invention

The present invention is that in view of the above one or more problems are made.

The present invention proposes a kind of data anonymous methods and system to the responsive property value that comprises text.This method and system dissolves in the text analyzing treatment technology in anonymous the processing; Excavate the semantic association between the responsive property value of different pieces of information record based on the requirement of secret protection through the text analyzing treatment technology, these semantic associations are combined with anonymous technology prevent privacy leakage then based on responsive attribute.

The data anonymous methods comprises according to an embodiment of the invention: the property value to the text in the data carries out text analyzing; According to the text analyzing result, the property value of the text in the data is replaced by the property value of value type or classification type; And the data that the property value of wherein text has been replaced by the property value of value type or classification type are carried out anonymity handle.

The data anonymous systems comprises according to an embodiment of the invention: the text analyzing device is configured to the property value of the text in the data is carried out text analyzing; The attribute replacement device is configured to the property value of the text in the data is replaced by the property value of value type or classification type according to the text analyzing result; And anonymous treating apparatus, be configured to data that property value to wherein text has been replaced by the property value of value type or classification type and carry out anonymity and handle.

The present invention is with respect to existing anonymous methods, and the data that can make the property value that comprises text not only can prevent to reveal but also still have use value based on the privacy of the property value of text after anonymity is handled.

Description of drawings

In conjunction with accompanying drawing, in the face of the detailed description of the embodiment of the invention, will understand the present invention from down better, similar label is indicated similar part in the accompanying drawing, wherein:

Fig. 1 shows the brief block diagram of microdata anonymous systems according to an embodiment of the invention;

Fig. 2 shows the outline flowchart of microdata anonymous methods according to an embodiment of the invention;

Fig. 3 shows a detailed diagram of microdata anonymous systems according to an embodiment of the invention;

Fig. 4 shows the outline flowchart of the process that the cluster parameter K is set according to an embodiment of the invention; And

Fig. 5 shows the another detailed diagram according to the microdata anonymous systems of the embodiment of the invention.

Embodiment

To describe the characteristic and the exemplary embodiment of various aspects of the present invention below in detail.Many details have been contained in following description, so that complete understanding of the present invention is provided.But, it will be apparent to one skilled in the art that the present invention can implement under the situation of some details in not needing these details.Description in the face of embodiment only is in order through example of the present invention is shown the clearer understanding to the present invention to be provided down.Any concrete configuration and the algorithm that are proposed below the present invention never is limited to, but any modification, replacement and the improvement that under the prerequisite that does not break away from spirit of the present invention, have covered coherent element, parts and algorithm.

Need to prove, among this paper the responsive property value of said text, value type and/or classification type be with microdata in the individual that is associated of entries or individual privacy information.And; Here the item name of the responsive property value of said text usefulness textual representation " heart disease ", " cancer ", " influenza " not in table 1 and table 2, and be meant the description describing with the symptom in the case data beyond the item name of textual representation to the textual form of incident or state.

The privacy compromise problem and the other problems that exist in the microdata of the responsive property value that comprises text in view of the above the present invention proposes a kind of data anonymous methods and system.Below in conjunction with Fig. 1 and Fig. 2, the example of microdata anonymous methods and system according to an embodiment of the invention is described.Fig. 1 shows the brief block diagram of microdata anonymous systems according to an embodiment of the invention.Fig. 2 shows the outline flowchart of microdata anonymous methods according to an embodiment of the invention.

As shown in Figure 1; The microdata anonymous systems comprises text analyzing device 102, attribute replacement device 104 and anonymous treating apparatus 106 according to an embodiment of the invention; Their function is following: text analyzing device 102 is used for the responsive property value of the text of microdata is carried out text analyzing (that is execution in step S202).Attribute replacement device 104 is used for according to the text analyzing result, the responsive property value of the text in the microdata is replaced by the responsive property value (that is execution in step S204) of value type or classification type.Anonymous treating apparatus 106 is used for the microdata that responsive property value to wherein text has been replaced by the responsive property value of value type or classification type and carries out anonymity processing (that is execution in step S206).

Particularly, attribute replacement device 104 can be replaced by the responsive property value of the text in the microdata the responsive property value of value type or classification type based on the upper strata text analyzing method such as text cluster or text classification.Anonymous treating apparatus 106 can utilize above-described microdata anonymous methods also can utilize other to be used for the microdata of the responsive property value that comprises value type or classification type is carried out the anonymous microdata anonymous methods of handling, and the microdata that the responsive property value of wherein text has been replaced by the responsive property value of value type or classification type carries out anonymity processing.Below, the example that provides that the anonymity based on text cluster that is realized by microdata anonymous systems is according to an embodiment of the invention handled and handle based on the anonymity of text classification:

1. < anonymity based on text cluster is handled >

Text cluster is a kind of characteristic of utilizing text self, does not have the text-processing technology that supervision and robotization ground is put in order text.Generally; Text cluster at first carries out similarity analysis through the responsive property value to text; Obtain a similarity measurement value between the responsive property value of any two text, utilize the similarity measurement value that obtains that the responsive property value of text is carried out automatic cluster then.Cluster result has been divided into some groups to the responsive property value of all text naturally; If with these group numbers of weaving into successively (for example, 0,1; 2...;) or give a title (for example, have uniqueness item name, or have the identifier that letter and/or the data of uniqueness are formed), the responsive property value with each text all substitutes with the group # or the item name (or identifier) at its place then; The responsive property value that belongs to same group has so just had identical substitution value; So can be converted into the responsive property value of microdata Chinese version type the responsive attribute of value type or classification type, existing so anonymous technology, for example l-diversity and t-closeness are promptly applicable.

But; Usually (these parameters directly influence clustering result to need artificial some cluster parameter that preestablishes in the clustering processing; Then influencing follow-up anonymity handles); Therefore before implementing the text cluster processing, can optionally increase a step:, the cluster parameter is set based on the secret protection demand.

Realizing that under the situation about handling based on the anonymity of text cluster, the function of the device of each in the microdata anonymous systems is following according to an embodiment of the invention:

Text analyzing device 102 is used for carrying out similarity analysis through the responsive property value to the text of microdata, obtains the similarity measurement value between the responsive property value of any two text in the responsive property value of text.

Attribute replacement device 104 is used for according to the similarity measurement value between the responsive property value of any two text of microdata the responsive property value of the text in the microdata being carried out cluster, thereby the responsive property value of the text in the microdata is replaced by the responsive property value of value type or classification type.Particularly, Fig. 3 shows a detailed diagram of microdata anonymous systems according to an embodiment of the invention.As shown in Figure 3, attribute replacement device 104 comprises parameter set unit 1042, hierarchical cluster attribute unit 1044 and attribute substituting unit 1046, and their function is following: parameter set unit 1042 is used for according to the secret protection demand cluster parameter being set; Hierarchical cluster attribute unit 1044 is used for according to the similarity measurement value between the responsive property value of any two text in the responsive property value of the text of microdata and set cluster parameter the responsive property value of the text in the microdata being carried out cluster; (for example be divided into a plurality of responsive property value groups with responsive property value with the text in the microdata; According to the similarity measurement value between the responsive property value group, the responsive property value of text is divided into a cluster parameter responsive property value group); Attribute substituting unit 1046 is used for the responsive property value of the text that belongs to a plurality of responsive property value groups respectively of microdata is replaced by the responsive property value of a plurality of different numerical types or classification type; Substitute (in other words, the property value of the text in each responsive property value group being replaced by the property value of identical value type or classification type) so that belong to the responsive property value of same group text by the responsive property value of same value type or classification type.

Anonymous treating apparatus 106 is used for based on such as the k-anonymous methods (for example; The l-diversity anonymous methods) and so on existing anonymous methods, the microdata that the responsive property value of wherein text has been replaced by the responsive property value of value type or classification type carries out anonymity to be handled.

For example; When the microdata data that have been replaced by the responsive property value of value type or classification type at the responsive property value that utilizes the k-anonymous methods to wherein text are carried out anonymity and are handled; Entries in this microdata is divided into a plurality of record groups; And the diversity metric of the value type that comprises in each record value group or the responsive property value of classification type meets predetermined (for example the requirement; Greater than second predetermined value); The distribution situation of the responsive property value in the distribution situation of the value type that perhaps comprises in each record group or the responsive property value of classification type and the microdata in all record groups be more or less the same (for example, the diversity factor value is not more than first predetermined value).

Describe below under the situation of using K-means (K average) clustering method and l-diversity anonymous methods, the processing of cluster parameter is set.

The target of K-means clustering processing is that the object one of participating in cluster is copolymerized into the K group, and each object finally belongs to wherein one group, and the center (mean value of this group) that this object distance should be organized is nearest.The cluster parameter that the K-means clustering processing need be imported is the K value.For the most basic l-diversity anonymous methods, require every group of responsive property value in (being that fiducial mark is known identical one group of property value) record need have l different value at least, promptly require to be K >=l for being provided with of K.Fig. 4 shows the outline flowchart of the process that the cluster parameter K is set according to an embodiment of the invention.

A kind of expansion Entropy l-diversity (based on the l-diversity anonymous methods of entropy) for basic l-diversity anonymous methods; The processing that the cluster parameter K is set is following: for any one group (being that fiducial mark is known identical one group of property value) record E; Because the responsive property value of the inside is various more; Whole record then can be big more about the entropy of responsive property value; So Entropy l-diversity require its about the entropy of responsive property value greater than parameter l ogl; Be entropy

wherein S be the set of all responsive property values, (E is that responsive property value is the record ratio of s among the E s) to p.For K, l as its initial value, is carried out the K-means cluster, if-∑ _{X ∈ X}P (x) logp (x)>=logl, then with this K value as final K value as the cluster parameter, otherwise make K=K+1, this process that circulates, wherein X is the set of all clusters, p (x) is the ratio that element accounts for all elements among the cluster x.

That is to say; For Entropy l-diversity anonymous methods; The diversity of the responsive property value that comprises in every group record tolerance must satisfy is scheduled to requirement (for example, being not less than a predetermined value), and the processing that the cluster parameter K is set comprises: the initial value of K is set to l; Judgement is divided at the responsive property value with text whether every group of responsive property value meets the following conditions under the situation of K group :-∑ _{X ∈ X}P (x) logp (x)>=logl; Wherein, X representes the set of the responsive property value of all text in the microdata, and p (x) expression K organizes the ratio of number of responsive property value of number and all text in the microdata of the responsive property value of the text among the group x in the responsive property value; If, then with K as final cluster parameter K, otherwise make K=K+1, and repeat above judgment processing.

Other a kind of expansion Recursive (c for basic l-diversity anonymous methods; L)-diversity; The core concept of this definition is to investigate the number of times that certain responsive property value occurs in a group record; Utilize two parameter c and l, control, and the occurrence number of the minimum responsive property value of occurrence number can not be very little so that the occurrence number of the maximum responsive property value of occurrence number can not be too many; And the gap of the occurrence number of each responsive property value is not too big in the whole group record; Specifically be defined as: for a record group (being that fiducial mark is known identical one group of property value) E, a total m different responsive property value in this group, the number of times that each responsive property value occurs in E is designated as r _i, and let r ₁>=r ₂>=...>=r _m, and if only if satisfies r ₁＜c * (r _l+ r _L+1+ ...+r _m) time (c can be for any greater than 0 number, and 0＜c≤1) generally speaking, tables of data satisfy Recursive (c, l)-diversity.For K, then with l as its initial value, carry out the K-means cluster, the element number in each cluster is designated as t _j, and let t ₁>=t ₂>=...>=t _kIf, t ₁＜c (t _l+ t _L+1+ ...+t _k), then with this K value as final K value as the cluster parameter, otherwise make K=K+1, this process circulates.

That is to say, for Recursive (c, l)-the diversity anonymous methods, the processing that the cluster parameter K is set comprises: the initial value of K is set to l; Judgement is divided at the responsive property value with text whether every group of responsive property value meets the following conditions under the situation of K group: t ₁＜c * (t _l+ t _L+1+ ...+t _k), t wherein _iThe number of representing the responsive property value of the text in any one group of responsive property value, and t ₁>=t ₂>=...>=t _KIf, then with K as final cluster parameter K, otherwise make K=K+1, and repeat above judgment processing.

Need to prove; Though more than described the processing that the cluster parameter is set under the situation of using the K-means clustering method; But those skilled in the art should be understood that; The cluster parameter here is provided with handles the situation that is equally applicable to adopt other clustering methods, and promptly the cluster parameter need satisfy above requirement equally in other clustering methods.

2. < anonymity based on text classification is handled >

For the field under the responsive attribute; An if given in advance known categorizing system (taxonomy) (including the hierarchical relationship between some notions and the notion in this categorizing system); Can utilize the text classification technology so; The responsive property value classification map of each text on corresponding concept, can be become the responsive property transformation of text by the responsive property value of given categorizing system as the classification type in value space like this.For example, for the responsive attribute " diagnosis and therapy recording " of the text in the medical data, can be in a categorizing system of forming by disease name with its classification map.To the responsive property value of classification type after the conversion, the existing anonymous technology that is used in the responsive property value of value type and classification type promptly capable of using (such as, l-diversity or t-closeness) carry out the anonymity processing.

Realizing that under the situation about handling based on the anonymity of text classification, the function of the device of each in the microdata anonymous systems is following according to an embodiment of the invention:

Text analyzing device 102 is used for carrying out some bottom text analyzings that the responsive property value to the text of microdata carries out need carrying out before the text classification to be handled.

Attribute replacement device 104 is used for according to the text analyzing result; The responsive property value of the classification type in the responsive property value of microdata Chinese version type and the given in advance categorizing system is associated, thereby the responsive property value of the text in the microdata is replaced by the responsive property value of classification type.Particularly, Fig. 5 shows the another detailed diagram of microdata anonymous systems according to an embodiment of the invention.As shown in Figure 5; Attribute replacement device 104 comprises Attribute Association unit 1048 and attribute substituting unit 1050; Their function is following: Attribute Association unit 1048 is used for according to the text analyzing result; The responsive property value of the classification type in the responsive property value of the text in the microdata and the given in advance categorizing system is associated, wherein, each in the responsive property value of text only with the responsive property value of classification type in one be associated; Attribute substituting unit 1050 is used for the responsive property value of the text of microdata is replaced by the responsive property value of the classification type that is associated with it.In other words, attribute replacement device 104 is used for according to the text analyzing result, the responsive property value of text is replaced by the responsive property value of corresponding classification type according to given in advance categorizing system.

Anonymous treating apparatus 106 is used for based on the existing anonymous methods such as the k-anonymous methods, and the microdata that the responsive property value of wherein text has been replaced by the responsive property value of classification type carries out anonymity processing.

Microdata anonymous methods and system according to the embodiment of the invention are described in detail above with reference to accompanying drawing.As previously mentioned; The present invention is dissolved in the text analyzing treatment technology in anonymous the processing; Excavate the semantic association between the responsive property value of different pieces of information record based on the requirement of secret protection through the text analyzing treatment technology, these semantic associations are combined with anonymous technology prevent privacy leakage then based on responsive attribute.With respect to the anonymous technology of existing microdata, the present invention can make the microdata of the responsive property value that comprises text after anonymity is handled, and not only can prevent to reveal but also still have use value based on the privacy of responsive attribute.In addition; Those skilled in the art should understand that; Above-described method and system not only is suitable for microdata and wherein responsive property value are handled, and can carry out the anonymity processing to any data of the privacy information that includes the individuality that is associated with wherein entries.

But, needing clearly, the present invention is not limited to customized configuration and the processing that preceding text are described and illustrated in the drawings.And, for brevity, omit detailed description here to the known method technology.In the above-described embodiments, describe and show some concrete steps as an example.But procedure of the present invention is not limited to the concrete steps that institute describes and illustrates, and those skilled in the art can make various changes, modification and interpolation after understanding spirit of the present invention, perhaps change the order between the step.

Functional block shown in the above-described structured flowchart can be implemented as hardware, software, firmware or their combination.When realizing with hardware mode, it can for example be electronic circuit, special IC (ASIC), suitable firmware, plug-in unit, function card or the like.When realizing with software mode, element of the present invention is program or the code segment that is used to carry out required task.Program or code segment can be stored in the machine readable media, perhaps send at transmission medium or communication links through the data-signal that carries in the carrier wave." machine readable media " can comprise any medium that can store or transmit information.The example of machine readable media comprises electronic circuit, semiconductor memory devices, ROM, flash memory, can wipe ROM (EROM), floppy disk, CD-ROM, CD, hard disk, fiber medium, radio frequency (RF) link, or the like.Code segment can be downloaded via the computer network such as the Internet, Intranet etc.

The present invention can realize with other concrete form, and do not break away from its spirit and essential characteristic.For example, the algorithm described in the specific embodiment can be modified, and system architecture does not break away from essence spirit of the present invention.Therefore; Current embodiment is counted as exemplary but not determinate in all respects; Scope of the present invention is by accompanying claims but not foregoing description definition; And, thereby the whole changes that fall in the scope of implication and equivalent of claim all are included among the scope of the present invention.

Claims

1. data anonymous methods comprises:

Property value to the text in the data carries out text analyzing;

According to the text analyzing result, the property value of the said text in the said data is replaced by the property value of value type or classification type; And

The said data that the property value of wherein said text has been replaced by the property value of said value type or classification type are carried out anonymity and are handled.

2. data anonymous methods according to claim 1 is characterized in that, said property value be with said data in the privacy information of the relevant individuality of entries.

3. data anonymous methods according to claim 1 and 2 is characterized in that, the processing that the property value of the said text in the said data is replaced by the property value of value type or classification type comprises:

According to said text analyzing result, the property value of said text is replaced by the property value of corresponding classification type according to given in advance categorizing system.

4. data anonymous methods according to claim 1 and 2 is characterized in that, the processing of the property value of said text being carried out text analyzing comprises:

Property value through to said text carries out similarity analysis, obtains the similarity measurement value between the property value of said text,

And the processing that wherein, the property value of the said text in the said data is replaced by the property value of value type or classification type comprises:

According to the secret protection demand cluster parameter is set,

According to the similarity measurement value between the property value of said text and said cluster parameter the property value of said text is carried out cluster, is divided into a plurality of property value groups with property value with said text,

The property value of the said text in each property value group is replaced by the responsive property value of identical value type or classification type.

5. according to claim 3 or 4 described data anonymous methods; It is characterized in that; Based on the k-anonymous methods; The said data that the property value of wherein said text has been replaced by the property value of said value type or classification type are carried out anonymity and are handled; Wherein, the entries in the said data is divided into a plurality of record groups, and in the distribution situation of the property value of value type in each record group or classification type and the said data all the metric of the difference between the overall distribution situation of the property value of the value type of entries or classification type be not more than first predetermined value.

6. according to claim 3 or 4 described data anonymous methods; It is characterized in that; Based on the k-anonymous methods; The said data that the property value of wherein said text has been replaced by the property value of said value type or classification type are carried out anonymity and are handled, and the entries in the wherein said data is divided into a plurality of record groups, and the diversity metric of the property value of value type that comprises in each record group or classification type is greater than second predetermined value.

7. according to the data anonymous methods described in the claim 6; It is characterized in that; When the number of the said property value group of said cluster parametric representation, need meet the following conditions: be divided under the situation of property value group that number is said cluster parameter at the property value with said text, each record group comprises that number is at least the property value of the different numerical type or the classification type of the 4th predetermined value.

8. data anonymous methods according to claim 7; It is characterized in that; Be divided under the situation of property value group that number is said cluster parameter at the property value with said text, the entropy of the value type that comprises in each record group or the property value of classification type is greater than the 3rd predetermined value.

9. data anonymous methods according to claim 7; It is characterized in that;, the property value with said text concerns t below being divided under the situation of property value group that number is said cluster parameter, existing between the number of the property value of the text that comprises in each property value group ₁＜c * (t _l+ t _L+1+ ...+t _K), wherein, t _jRepresent the number of the responsive property value of the text in any property value group, K representes said cluster parameter, 0＜c≤1 and t ₁>=t ₂>=...>=t _K

10. data anonymous systems comprises:

The text analyzing device is configured to the property value of the text in the data is carried out text analyzing;

The attribute replacement device is configured to the property value of the said text in the said data is replaced by the property value of value type or classification type according to the text analyzing result; And

Anonymous treating apparatus is configured to the said data that property value to wherein said text has been replaced by the property value of said value type or classification type and carries out anonymity processing.

11. data anonymous systems according to claim 10 is characterized in that, said property value be with said data in the privacy information of the relevant individuality of entries.

12. according to claim 10 or 11 described data anonymous systems; It is characterized in that; Said attribute replacement device further is configured to is replaced by the property value of said text according to given in advance categorizing system the property value of corresponding classification type according to said text analyzing result.

13. according to claim 10 or 11 described data anonymous systems, it is characterized in that,

The property value that said text analyzing device further is configured to through to said text carries out similarity analysis, obtains the similarity measurement value between the property value of said text,

And wherein, said attribute replacement device comprises:

Parameter set unit is configured to according to the secret protection demand cluster parameter is set,

The hierarchical cluster attribute unit is configured to according to the similarity measurement value between the property value of said text and said cluster parameter the property value of said text carried out cluster, is divided into a plurality of property value groups with the property value with said text,

The attribute substituting unit is configured to the property value of the said text in each property value group is replaced by the property value of identical value type or classification type.

14. according to claim 12 or 13 described data anonymous systems; It is characterized in that; Said anonymous treating apparatus further is configured to the anonymous methods based on k-; The said data that the property value of wherein said text has been replaced by the property value of said value type or classification type are carried out anonymity and are handled; Wherein, the entries in the said data is divided into a plurality of record groups, and in the distribution situation of the property value of value type in each record group or classification type and the said data all the metric of the difference between the overall distribution situation of the property value of the value type of entries or classification type be not more than first predetermined value.

15. according to claim 12 or 13 described data anonymous systems; It is characterized in that; Said anonymous treating apparatus further is configured to the anonymous methods based on k-; The said data that the property value of wherein said text has been replaced by the property value of said value type or classification type are carried out anonymity and are handled; Entries in the wherein said data is divided into a plurality of record groups, and the diversity metric of the property value of value type that comprises in each record value group or classification type is greater than second predetermined value.

16. data anonymous systems according to claim 15; It is characterized in that; When the number of the said property value group of said cluster parametric representation, need meet the following conditions: be divided under the situation of property value group that number is said cluster parameter at the property value with said text, each record group comprises that number is at least the property value of the different numerical type or the classification type of the 4th predetermined value.

17. data anonymous systems according to claim 16; It is characterized in that; Be divided under the situation of property value group that number is said cluster parameter at the property value with said text, the entropy of the value type that comprises in each record group or the property value of classification type is greater than the 3rd predetermined value.

18. data anonymous systems according to claim 16; It is characterized in that;, the property value with said text concerns t below being divided under the situation of property value group that number is said cluster parameter, existing between the number of the property value of the text that comprises in each property value group ₁＜c * (t _l+ t _L+1+ ... t _K), wherein, t _jRepresent the number of the responsive property value of the text in any property value group, K representes said cluster parameter, 0＜c≤1 and t ₁>=t ₂>=...>=t _K