CN106649262A

CN106649262A - Protection method for enterprise hardware facility sensitive information in social media

Info

Publication number: CN106649262A
Application number: CN201610971014.7A
Authority: CN
Inventors: 曾剑平; 崔战伟
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2016-10-31
Filing date: 2016-10-31
Publication date: 2017-05-10
Anticipated expiration: 2036-10-31
Also published as: CN106649262B

Abstract

The invention belongs to the technical field of privacy protection, and particularly provides a protection method for enterprise hardware facility sensitive information in social media. The protection method comprises the steps that firstly, a hardware foundation facility information base is established; secondly, the hardware type related to social media description information is determined by constructing a hardware classification model and a hardware type matching algorithm; finally, keywords, possibly leaking the sensitive information, in hardware description information are shielded or replaced in a targeted mode according to the obtained hardware type. According to the protection method, different processing can be conducted on the keywords according to the different sensitivity levels of the keywords, and the expandability is high.

Description

Enterprise's hardware facility sensitive information means of defence in a kind of social media

Technical field

The present invention relates to enterprise's hardware facility sensitive information means of defence in a kind of social media, belongs to secret protection technology Field.

Background technology

It is emerging along with traditional social media such as microblogging, network forum and wechat, Facebook, Twitter etc. The appearance of social media, people enter the social media epoch.The rapid rising of social media accelerates the flowing of information so that Interpersonal communication becomes more and more convenient.But very important, widely using for social media also bring safety On hidden danger, social media user also either intentionally or unintentionally to the secret sensitive information of enterprise or mechanism causing threat, this If a little information are obtained, integrated and utilized by commercial undertaking or the non-good will of some lawless persons, may result in individual or mechanism is hidden [1] is revealed in private.Mobile device user easily can obtain the clothes of the position of oneself and correlation by location Based service Business information.Although location Based service has provided the user great convenience, location Based service needs first to obtain shifting Employing the positional information at family could provide user corresponding service, and location Based service system does not ensure that server Do not reveal or illegally use the positional information of user.Therefore location Based service brings pole to the location privacy protection of user Big challenge [2].In addition with the rise of big data technology in recent years, the secret protection technology based on big data technology is also more next It is more, but in general, the current correlative study both at home and abroad for the protection of big data security and privacy is also insufficient, only leads to Technological means is crossed in combination with relevant policies regulation etc., big data security and privacy protection problem [3] could be preferably solved.

With the extensive application of internet, both at home and abroad with regard to secret protection or trade secret protection research also increasingly It is many.The main direction of studying of secret protection includes secret protection technology, the base that general secret protection technology, data-oriented are excavated Data publication principle, Privacy preserving algorithms in secret protection etc..General secret protection technology is devoted in relatively low application layer The privacy of secondary upper protection data, is typically realized by introducing statistical model and probabilistic model；The privacy that data-oriented is excavated is protected Shield technology is mainly solved in high level data application, how according to the characteristic of different pieces of information dredge operation, realizes the guarantor to privacy Shield；Based on the data publication principle of secret protection be to provide for it is a kind of types of applications can with general method for secret protection, And then the Privacy preserving algorithms for causing to design on this basis also have versatility.As emerging study hotspot, secret protection No matter technology is in terms of theoretical research or practical application, all with very important value [4].

Traditional sensitive information means of defence is mainly based upon the filter method of Keywords matching, but this method is ignored The semantic environment of context, accuracy is relatively low, and is difficult to resist Human disturbance, needs to safeguard substantial amounts of keyword dictionary, people Work is relatively costly.Emerging sensitive information means of defence includes the means of defence based on natural language processing and artificial intelligence, but These technologies are still in conceptual phase, can not meet under actual conditions for the requirement for filtering accuracy.

The content of the invention

Protection of the present invention not from the angle of macroscopic view to sensitive information is studied, but chooses privacy or merchant password protection The a certain specific aspect of shield, i.e., enterprise's hardware information protection in social media is studied, and gives corresponding information protection side Method.

As it was previously stated, social media user is likely to result in the leakage of privacy information when stating one's views, similarly, Be also possible to cause when internal staff states one's views in the social media such as microblogging or forum enterprises ardware model number, The leakage of the sensitive informations such as configuration.

In order to solve above-mentioned technical problem, the present invention proposes a new angle, that is, combines text classification and semanteme The strategy of replacement carries out message protection.Its basic ideas is to determine the hardware class described by information publisher by classification first And model, all properties information of the model hardware is then searched from the hardware information storehouse having built up, and according to the attribute The keyword that keyword in information is deshielded or replaced in the hardware description information that publisher issued.The main wound of the present invention New point is to construct hardware information storehouse, devise hardware information disaggregated model and ardware model number matching algorithm, give key Sensitive word replacement method；

Technical scheme is specifically described as follows.

The present invention provides enterprise's hardware facility sensitive information means of defence in a kind of social media, comprises the following steps that：

Step one, structure model

(1) structure in hardware information storehouse

Hardware information is obtained, is extracted including the multiple levels including hardware big class, producer and model, attribute and property value letter Breath, is organized into XML hierarchy structure, builds hardware information storehouse；

(2) Chinese word segmentation is carried out to the hardware description information in hardware information storehouse

(3) hardware taxonomy model and ardware model number matching algorithm are built

Hardware description information in hardware information storehouse is carried out after participle, the characteristic information of big class is extracted first, then big On the basis of class classification, the characteristic information of producer is extracted, build producer's disaggregated model；Believe finally by the classification of big class and producer Breath, builds ardware model number matching algorithm, determines the model of hardware；

(4) keyword shielding substitution model is built

For each hardware big class, the attribute keywords to occurring in hardware description information carry out sensitive partition of the level, And different processing modes are taken the other keyword of different sensitivity levels, build keyword shielding substitution model；Wherein, sensitivity level It is not divided into 0,1,2,3 and 4；It is straight for the keyword that sensitive rank is 4 for the keyword that sensitive rank is 0 does not deal with Connect and shielded with asterisk, for the keyword that sensitive rank is 1,2,3 is processed by keywords semantics tree；The key wordses Tree is built justice by the keyword in different levels in hardware information storehouse according to XML structure relation；Keywords semantics tree has four layers, base It is as follows in the replacement policy of keywords semantics tree：

For the keyword that sensitive rank is 1, it is replaced using its father node；For the keyword that sensitive rank is 2, It is replaced using the father node of its father node；For the keyword that sensitive rank is 3 is directly replaced using root node；

Step 2, detection protection

Social media content to being input into is carried out after word segmentation processing, according to hardware taxonomy model and hardware-type in step one Number matching algorithm determines ownership big class, ownership producer and ownership model；After determining model, the key built in recycle step one Word shields substitution model, by the attribute keywords in the social media content after participle, using corresponding sensitive rank and process Mode performs corresponding action, that is, shield, replace and do not deal with.

In the present invention, by feature selecting algorithm and sorting algorithm to hardware big class and hardware vendors in hardware taxonomy model Classified.

In the present invention, when carrying out the classification of hardware big class, the method that feature selecting algorithm adopts improved information gain；Tool Body computing formula is as follows：

Wherein, t is feature, and c represents classification, and k represents classification number, and dis (t) represents distribution of feature t between class, and it is Sample number and the ratio of all total sample numbers that feature t occurs, P (t) represents the probability that feature occurs, and P (c) represents that classification occurs Probability, P (c, t) represents the probability that feature and classification occur jointly,The absent variable probability of feature is represented,Represent Feature occurs without the probability that sample belongs to classification c.

The method that sorting algorithm adopts improved KNN is therein as follows apart from computing formula：

Wherein, x represents unfiled sample, and y represents classification samples, and they are all n-dimensional vectors, the every one-dimensional generation in vector One characteristic value of table, IG ' (t_i) represent ith feature t_iInformation gain value, x=(x₁,x₂,…,x_n), y=(y₁,y₂,…, y_n), d (x, y) represents the distance between x and y, x_i y_iRepresent the ith feature value of sample.

In the present invention, when carrying out the classification of hardware vendors, feature selecting algorithm adopts to enter using the method for characteristic similarity Row feature selecting；Feature is selected using the similarity between class characteristically, is defined between p class in feature t_iOn it is similar Degree, makes this p class be respectively c₁,c₂,…,c_p, this p class is defined in feature t_iOn similarity be any two class in t_iOn The mean value of similarity sum, i.e.,：

IfThen think feature t_iSimilarity is excessive between this p class, discomfort cooperation For classification feature, otherwise then can as classification feature；

The method that sorting algorithm adopts improved KNN, it selects the inverse of similarity to participate in KNN as the weight of feature In the calculating of algorithm, the following is specific KNN apart from computing formula：

Wherein, c_iI-th classification is represented, p is that classification is total, t_iIth feature is represented, n is characterized sum, x=(x₁, x₂,…,x_n), y=(y₁,y₂,…,y_n) unfiled sample and classification samples are represented respectively, they have n characteristic value x_i y_i。

In the present invention, ardware model number matching algorithm, will same alike result value using the method based on ardware model number set Ardware model number is put in a set, by determining property value of the hardware to be matched on some attributes, so that it is determined that the hardware Affiliated model set, then seeks these intersection of sets collection, obtains the model belonging to the hardware.

In the present invention, the leafy node of the bottom of keywords semantics tree is the innermost layer of XML structure in hardware information storehouse The subcharacter word of attribute keywords, it is the innermost layer category of XML structure in hardware information storehouse that the layer second from the bottom of semantic tree is corresponding Property keyword, the layer third from the bottom of semantic tree is the second layer attribute keywords of XML structure, the 4th layer be root node, root node For the title of hardware big class.

Compared to the prior art, the present invention has substantive distinguishing features and marked improvement：

(1) can be used for finding that social media content possibility existing when issuing is revealed in the sensitivity of enterprise's hardware information Hold, there is provided fine-grained contents controlling method, compared to the coarseness side that existing method can only be controlled to entire content Formula has certain advance, and the shared essential demand of social media content is remained as much as possible.

(2) devise based on big class, the classification of three levels of producer and model and matching process, can make full use of similar The information such as other vocabulary, attribute, improve the recall rate of detection, it is to avoid the sensitive leakage of hardware.Reduce search in matching simultaneously Scope, it is only necessary to matched in the information bank of same producer, improve matching efficiency.

(3) hardware information library structure, feature selecting, grader build and means of defence on propose new thinking and Implementation method, devises the version of XML, improves information gain computational methods, devises based on producer's category feature phase Like the feature selection approach of degree, keywords semantics tree is constructed, give specific prevention policies.

Description of the drawings

Fig. 1 is the overview flow chart of the present invention.

Fig. 2 is the classification process schematic diagram of hardware vendors.

Fig. 3 is the schematic flow sheet of ardware model number matching process.

Fig. 4 is the flow chart that keyword shields replacement method.

Fig. 5 is hardware information storehouse (XML structure) figure.

Fig. 6 is the corresponding relation figure between every layer of keyword of every layer of keyword of semantic tree in embodiment and XML.

Fig. 7 is the final samples illustration of the semantic tree set up in embodiment.

Specific embodiment

Technical scheme is described in detail with reference to the accompanying drawings and examples.

The overall procedure of the present invention contains as shown in Figure 1, specifically the inspection for building model flow and the right on the left side in Fig. 1 Protection flow process is surveyed, wherein model construction flow process provides necessary basic number in the result of three links for detection protection flow process According to.

The groundwork of the present invention includes：

(1) structure in hardware information storehouse；

(2) Chinese word segmentation is carried out to hardware description information；

(3) hardware taxonomy model and ardware model number matching algorithm are built；

(4) keyword shielding replacement method is built.

Key technology involved in said process is explained in detail in turn below.

1st, the structure in hardware information storehouse

In embodiment, for certain giant brain net, web crawler is devised, 36 up to ten thousand kinds of big class have been crawled automatically The hardware information of model, including mobile phone, notebook, switch, router etc..These hardware informations are organized into into XML file Each label of form, wherein XML represents the attribute of the hardware, and the text description content corresponding to label represents the hardware Property value.By the structure descriptive power of XML itself, tree-like hardware information storehouse is constructed.The hardware information storehouse constitutes subsequently Essential information source required for handling process.The hardware information storehouse (XML structure) of structure is as shown in Figure 5.

2nd, Chinese word segmentation is carried out to hardware information

Although having been obtained for the hardware information of all models in the work of the 1st step, these information can not be used directly In computer disposal, need to carry out Chinese word segmentation, remove auxiliary word, extract keyword therein, then using extracting Keyword carries out the work such as follow-up classification process.Segmenting method common at present may be used to the step, such as Chinese section Chinese lexical analysis system ICTCLAS based on level HMM that institute's Institute of Computing Technology is developed etc., User-oriented dictionary and various coded formats are held, participle accuracy is up to 97.5%.

3rd, hardware taxonomy model and ardware model number matching algorithm are built

On the basis of participle, the present invention determines hardware description by building disaggregated model and ardware model number matching algorithm Ardware model number described by information.And hardware taxonomy model include two sub- assorting processes, be respectively hardware big class classification and The classification of the classification of hardware vendors, wherein hardware vendors is carried out on the basis of the classification of hardware big class.Through the two steps Suddenly the classification being assured that belonging to hardware and producer, are assured that belonging to the hardware finally by ardware model number matching process Model, below just the basic ideas of these three processes are described.

(1) classification of hardware big class

The KNN sorting techniques in text classification have been used for reference in the classification of hardware big class, select those by feature selecting first The Feature Words larger to classification contribution, are then classified by sorting algorithm to hardware.The present invention feature selecting algorithm and The method that sorting algorithm has used for reference respectively the method and KNN of information gain, but improved the characteristics of for hardware information storehouse, It is favorably improved the accuracy of classification.

Traditional Information Gain Method only considered the impact whether Feature Words occur to global information entropy, without considering The frequency issues that Feature Words occur in class and between class, the present invention is improved traditional Information Gain Method, it is contemplated that Frequency of the Feature Words between class, improves the effect of feature selecting.

The computing formula of improved Information Gain Method is as follows：

Wherein, dis (t) represents distribution of feature t between class, and it is the sample number and all total sample numbers that feature t occurs Ratio.Why selectIt is to be based on following two reasons as regulation coefficient, first,It is subtracting for dis (t) When Distribution Value very little between class of function, i.e. feature t,Than larger, this conforms exactly to require；Secondly, selectBetween traditional information gain value IG (t) and distribution between class value dis (t) of feature t being balanced for regulation coefficient Weight, makes result of calculation excessively to rely on one party.

Similarly, the present invention is improved traditional KNN algorithms, is theed improvement is that and is considered different features pair The impact of classification is different, by the use of feature selecting information gain value as KNN algorithms weight, the information gain value of a feature Impact size of this feature to comentropy is represented, if information gain value is bigger, the impact of result of this feature to classifying is got over Greatly, so the direct weight by the use of the information gain value of feature as this feature in KNN algorithms, can thus embody difference Contribution degree of the feature of information gain value to classification.Shown below is the computing formula of distance in the KNN algorithms after improving.

Wherein, x represents unfiled sample, and y represents classification samples, and they are all n-dimensional vectors, the every one-dimensional generation in vector One characteristic value of table.IG(t_i) represent ith feature t_iInformation gain value.X=(x₁,x₂,…,x_n), y=(y₁,y₂,…, y_n)。

(2) classification of hardware vendors

After the classification of hardware big class, the classification of hardware vendors is to determine certain producer of hardware under the category.Equally Ground, is needed to carry out feature selecting in the classification of this step and is classified using suitable sorting algorithm.

Feature selecting algorithm of the present invention is the computational methods of feature based similarity, i.e., for each feature, Their characteristic similarities between different manufacturers classification are investigated, if this feature similarity is more than or equal to certain threshold value, Think that this feature is excessively similar between different manufacturers, be not suitable as the feature classified, on the contrary then can be used as the spy of classification Levy.Similarly, continue to adopt improved KNN sorting algorithms in the classification of this part, simply the weight of feature is changed to into spy Levy the logarithm reciprocal of similarity, introduction specific as follows.

In hardware information storehouse, each hardware characteristics may include multiple subcharacters, such as " appearance and size " this spy The characteristic value levied includes three dimension values of length.Here, length, width, height are exactly " appearance and size " this feature Three subcharacters.It is assumed that feature t_iIt is made up of n sub- feature, i.e. t_i=(t_i1,t_i2,…,t_in).Some sample is in feature t_i On characteristic value beAnother sample is in feature t_iOn characteristic value be Then defineWithBetween similarity be：

Define the similarity between two features using cosine of an angle is pressed from both sides between vector.Due to the difference to be investigated Feature may include different subcharacter numbers, i.e., different dimensions, so the purpose of do so can be to ignore the dimension of vector Number, investigates the similarity between two vectors, when two vectors, i.e., two feature similarities from the angle of two vector angles emphatically When, the cosine value of angle is larger, otherwise then less.

After having defined the similarity of single feature, similarity between two classes in certain feature is next given Computational methods.Because each class may include multiple samples, it is assumed that two classes c₁And c₂Comprising sample number be respectively m₁With m₂, then the two classes are defined in feature t_iOn Similarity Measure it is as follows：

As can be seen from the above equation, to two classes in feature t_iOn similarity definition be directly to take all samples pair of two classes In feature t_iThe average of upper similarity, do so can be all samples between two classes in feature t_iOn similarity examine Worry is entered.

In feature t between two classes_iOn Similarity Measure on the basis of, be defined below p class between in feature t_iOn Similarity.This p class is made to be respectively c₁,c₂,…,c_p, this p class is defined in feature t_iOn similarity be any two class in t_i On similarity sum mean value, i.e.,：

If this p class is in feature t_iOn similarity be more than or equal to a certain threshold value δ, i.e., Then think feature t_iSimilarity is excessive between this p class, is not suitable as the feature classified, otherwise then can be used as classification Feature.

Still classified using improved KNN algorithms in the classification of individual step, simply here the weight of feature will be sent out It is raw to change, no longer it is information gain value, but the inverse of the similarity of feature.Why selection selects the inverse of characteristic similarity As feature weight be based on the reason for such, characteristic similarity represent it is different classes of between similar journey in this feature Degree, the feature higher for similarity, they are little to the contribution classified, and should give less weight, and for similarity Contribution of the relatively low feature then to classifying is larger, should give higher feature, so the present invention selects the work reciprocal of similarity It is rational that the weight being characterized is participated in the calculating of KNN algorithms, the following is specific KNN apart from computing formula：

The classification process of hardware vendors is as follows, and Fig. 2 illustrates corresponding flow chart.

1) sample of different manufacturers under a certain classification is selected from hardware information storehouse；

2) characteristic similarity for different feature calculation this feature between different manufacturers；

3) if the characteristic similarity of this feature is less than certain threshold value, using this feature as characteristic of division, otherwise return 2) next feature, is selected to continue to calculate characteristic similarity；

4) classified using the feature and improved KNN algorithms selected, obtained corresponding producer's classification.

(3) matching of ardware model number

After the producer under the classification and the category that determine hardware, the present invention is by building ardware model number matching algorithm To determine model of the hardware under the producer.Ardware model number matching algorithm of the present invention is based on ardware model number set Method, will the ardware model number of same alike result value be put in a set, when it needs to be determined that certain hardware model when, only need Determine property value of the hardware on some attributes, then the model set being so assured that belonging to the hardware asks this A little intersection of sets collection can be obtained by the model belonging to the hardware.This ardware model number matching process is relative to gradually carrying out hardware Model has very big advantage for comparing in efficiency, can greatly reduce the number of times of comparison.

It not is that all of product is compared one by one one time when carrying out that ardware model number is matched, but establishes one New algorithm is made than to there is higher efficiency.Specifically, if the product of the category has n attribute (t₁,t₂,…,t_n), Each attribute t_iAll include a_iIndividual subcharacter, i.e.,In the product of the manufacturer production in attribute t_i Upper identical product is incorporated into in a set.And due to certain model product may on more than one attribute and its His product is identical, so the product of the model all may can occur in different set, namely may be mutually between each set There is common factor.

If occurring in that p attribute in the description information of the hardware, it is respectivelyAttributeCharacteristic value It isThen the arthmetic statement of ardware model number matching is as follows：

1) by attribute t_iThe upper ardware model number with same alike result value is placed in same set；

2) i=1, wherein C=Ω, Ω is made to represent complete or collected works；

3) find and attributeSet with same alike result value

4)

If 5) C is only comprising an element or i>6) p, then carried out, otherwise i=i+1, and is returned 3)；

6) set C is returned, set C is final ardware model number comparison result.

Fig. 3 illustrates the specific flow chart of ardware model number matching process, and key step is described as follows.

1) the ardware model number set with same alike result value is built for each attribute；

2) a certain attribute is taken out, investigates property value of the hardware on the attribute, obtain the corresponding hardware-type of the property value Number set；

3) the ardware model number set and the ardware model number collection conjunction for having obtained are occured simultaneously, if occuring simultaneously a unit of only include Element or attribute have taken, and stop, and the element in common factor is the model belonging to the hardware, otherwise returns 2)；

4th, keyword shielding substitution model is built

The present invention shields substitution model and reveals hard to being possible to appeared in hardware description information by design key word The keyword of part sensitive information carries out shielding replacement.It is directed to different keywords and divides different sensitive ranks, and to difference The other keyword of sensitivity level takes different processing modes.

(1) keyword sensitivity partition of the level

For each hardware big class, 5 sensitive ranks of all of property value keyword are set up in advance, respectively with numeral 0th, 1,2,3,4 represent, their sensitivity rises successively, is specifically shown in Table 1.

The sensitive rank table of comparisons of table 1

Sensitive rank	0	1	2	3	4
						Meaning	It is insensitive	It is somewhat sensitive	It is general sensitive	Comparison is sensitive	It is very sensitive
Processing mode	Do not deal with	Replace	Replace	Replace	Shielding

Different processing modes are taken to the other keyword of different sensitivity levels.Wherein, for the keyword that sensitive rank is 0 Do not deal with, it is logical for the keyword that sensitive rank is 1,2,3 for the keyword that sensitive rank is 4 is directly shielded with asterisk The mode for crossing structure semantic tree is processed.

(2) construction of keywords semantics tree

The keyword that sensitive rank is 1,2,3 is replaced by way of building semantic tree.Semantic tree leaf node It is semantic keyword most specifically, with the rising of node level, semanteme is gradually obscured, root node is semantic most fuzzy section Point.For hardware description information, its semantic tree is a total of 4 layers, and the replacement policy based on semantic tree is as follows：

For the keyword that sensitive rank is 1, it is replaced using its father node；For the keyword that sensitive rank is 2, It is replaced using the father node of its father node；For the keyword that sensitive rank is 3 is directly replaced using root node.

The XML document of each model hardware is a hierarchical structure in hardware information storehouse, and the attribute on upper strata is closed Keyword is more obscured than the attribute keywords of lower floor semantically, it is possible to the keyword for going to set up using the XML document Semantic tree.

It is such that the present invention sets up the method for semantic tree, and the leafy node of the bottom is the son of innermost layer attribute keywords Feature Words.It is the innermost layer attribute keywords of XML structure in hardware information storehouse that the layer second from the bottom of semantic tree is corresponding, they Semantically more obscure than respective subcharacter word.The layer third from the bottom of semantic tree is that the second layer attribute of XML structure is crucial Word, because the ground floor of XML document is the concrete model of the hardware, this is very sensitive information, so the inverse of semantic tree 4th layer does not correspond to the ground floor of XML document, but takes the hardware big class semantically more fuzzyyer than layer third from the bottom Name be referred to as the keyword of this layer, because fourth from the last layer has had increased to the title of hardware big class, so the layer is also The ground floor of whole semantic tree, i.e. root node.Fig. 6 is illustrated between every layer of keyword of every layer of keyword of semantic tree and XML Corresponding relation, Fig. 7 illustrates the final sample of the semantic tree of foundation, " second layer attribute keywords " and " third layer in sample Attribute keywords " each mean the second layer in XML document and third layer attribute keywords.

Application example

Because the available information content related to enterprise IT hardware facilities is not also many in internet social media, search Collection gets up relatively difficult.Here in case verification, it is extracted the part of 5000 hardware description from hardware information storehouse first Information, and these description informations are organized into into text document, each description information one text document of correspondence.Participle used Keyword sample (through some keywords of random erasure) afterwards be after the contents processing obtained from social media it is consistent, Therefore data after treatment can be with the hardware description information sample in approximate simulation social media.

Used as training sample, total training sample has 2160 to optional 60 samples, and each class is surplus from each big class 40 remaining samples are then tested as sample to be sorted, a total of 1440 test samples, obtain classification performance with k values Relation is as shown in table 2.

The correct Classified Proportion and F of hardware big class under the conditions of the different value of K of table 2₁Mean value

Parameter k	1	5	10	15	20	25	30
								Correct Classified Proportion	80.1%	72.8%	69.3%	67.3%	65.7%	63.8%	60%
F₁Mean value	0.805	0.734	0.706	0.689	0.676	0.663	0.639

In hardware vendors classification, the producer of hardware is classified by taking " mobile phone " this hardware big class as an example, choose hand Eight producers of machine, are respectively Samsung, apple, Huawei, OPPO, vivo, Meizu, association, cruel group.Test different value of K condition The ratio and F of correct classification samples down₁Mean value, the result for obtaining is as shown in table 3.

The ratio and F of the correct classification samples of producer under the conditions of the different value of K of table 3₁Mean value

Parameter k	1	5	10	15	20	25	30	35
									Correct Classified Proportion	42.4%	36.0%	34.7%	35.6%	31.8%	35.6%	33.5%	31.4%
F₁Mean value	0.422	0.350	0.339	0.328	0.295	0.319	0.299	0.281

200 texts under mobile phone classification are selected at random, by each subcharacter value according to the quick of its corresponding subcharacter word Sense rank is processed accordingly, and final statistics is as shown in table 4.

The performance data that the shielding of the Partial key word of table 4 is replaced

Subcharacter word	The whole network leads to	Mobile 4G	UNICOM 4G	Telecommunications 4G	Laterally
						Subcharacter word number	20	89	76	41	138
The correct number for processing	20	89	76	41	138
						Accuracy	100%	100%	100%	100%	100%

Bibliography

[1] Guo Qing. user profile privacy and protection [J] in social media use. Chinese information security, 2014, (7)：90- 93.

[2] Wei Qiong, Lu Yansheng. location privacy protection Research progress [J]. computer science, 2008,35 (9)：21- 25.

[3] Feng Dengguo, Zhang Min, Li Hao. big data security and privacy protects [J]. Chinese journal of computers, 2014,37 (1)： 246-258.

[4] Zhou Shuigeng, Li Feng, Tao Yufei, Xiao little Kui. the secret protection Review Study [J] of data base-oriented application. calculate Machine journal, 2009,32 (5)：847-861.

Claims

1. enterprise's hardware facility sensitive information means of defence in a kind of social media, it is characterised in that comprise the following steps that：

Step one, structure model

(1) structure in hardware information storehouse

Hardware information is obtained, is extracted including the multiple levels including hardware big class, producer and model, attribute and attribute value information, XML hierarchy structure is organized into, hardware information storehouse is built；

(2) Chinese word segmentation is carried out to the hardware description information in hardware information storehouse；

Hardware description information in hardware information storehouse is carried out after participle, the characteristic information of big class is extracted first, then in big class point On the basis of class, the characteristic information of producer is extracted, build producer's disaggregated model；Finally by big class and the classification information of producer, Ardware model number matching algorithm is built, the model of hardware is determined；

(4) keyword shielding substitution model is built

For each hardware big class, the attribute keywords to occurring in hardware description information carry out sensitive partition of the level, and right The other keyword of different sensitivity levels takes different processing modes, builds keyword shielding substitution model；Wherein, sensitive rank is drawn It is divided into 0,1,2,3 and 4；For the keyword that sensitive rank is 0 does not deal with, for the keyword that sensitive rank is 4 directly shields Cover, for the keyword that sensitive rank is 1,2,3 is processed by keywords semantics tree；The keywords semantics tree is by hardware Keyword in information bank in different levels builds according to XML structure relation；Keywords semantics tree has four layers, based on key wordses The replacement policy of justice tree is as follows：

For the keyword that sensitive rank is 1, it is replaced using its father node；For the keyword that sensitive rank is 2, adopt The father node of its father node is replaced；For the keyword that sensitive rank is 3 is directly replaced using root node；

Step 2, detection protection

Social media content to being input into is carried out after word segmentation processing, according to hardware taxonomy model and ardware model number in step one Determine ownership big class, ownership producer and ownership model with algorithm；After determining model, the keyword screen built in recycle step one Substitution model is covered, by the attribute keywords in the social media content after participle, using corresponding sensitive rank and processing mode Corresponding action is performed, that is, shielded, replaced and do not deal with.

2. sensitive information means of defence according to claim 1, it is characterised in that selected by feature in hardware taxonomy model Select algorithm and sorting algorithm to classify hardware big class and hardware vendors.

3. sensitive information means of defence according to claim 2, it is characterised in that when carrying out the classification of hardware big class, it is special Method of the selection algorithm using improved information gain is levied, specific formula for calculation is as follows：

{IG}^{'} (t) = \lg \frac{1}{d i s (t)} [Σ_{j = 1}^{k} P (c_{j}, t) \log \frac{P (c_{j}, t)}{P (c_{j}) P (t)} + Σ_{j = 1}^{k} P (c_{j}, \overset{&OverBar;}{t}) \log \frac{P (c_{j}, \overset{&OverBar;}{t})}{P (c_{j}) P (\overset{&OverBar;}{t})}]

Wherein, t is feature, and c represents classification, and k represents classification number, and dis (t) represents distribution of feature t between class, and it is feature Sample number and the ratio of all total sample numbers that t occurs, P (t) represents the probability that feature occurs, and it is general that P (c) represents that classification occurs Rate, P (c, t) represents the probability that feature and classification occur jointly,The absent variable probability of feature is represented,Represent feature Occur without the probability that sample belongs to classification c；

d (x, y) = {[Σ_{i = 1}^{n} {IG}^{'} (t_{i}) {(x_{i} - y_{i})}^{2}]}^{\frac{1}{2}}

Wherein, x represents unfiled sample, and y represents classification samples, and they are all n-dimensional vectors, and every one-dimensional in vector represents one Individual characteristic value, IG'(t_i) represent ith feature t_iInformation gain value, x=(x₁,x₂,…,x_n), y=(y₁,y₂,…,y_n), d (x, y) represents the distance between x and y, x_i, y_iRepresent the ith feature value of sample.

4. sensitive information means of defence according to claim 2, it is characterised in that when carrying out the classification of hardware vendors, it is special Selection algorithm is levied using carrying out feature selecting using the method for characteristic similarity；Selected using the similarity between class characteristically Feature is selected, is defined between p class in feature t_iOn similarity, make this p class be respectively c₁,c₂,…,c_p, define this p class In feature t_iOn similarity be any two class in t_iOn similarity sum mean value, i.e.,：

\underset{t_{i}}{s i m} (c_{1}, c_{2}, ..., c_{p}) = \frac{2}{p (p - 1)} Σ_{i = 1}^{p} Σ_{j = i + 1}^{p} \underset{t_{i}}{s i m} (c_{i}, c_{j})

IfThen think feature t_iSimilarity is excessive between this p class, is not suitable as classification Feature, otherwise then can as classification feature；

The method that sorting algorithm adopts improved KNN, it selects the inverse of similarity to participate in KNN algorithms as the weight of feature Calculating in, the following is specific KNN apart from computing formula：

d (x, y) = {[Σ_{i = 1}^{n} \frac{1}{{sim}_{t_{i}} (c_{1}, c_{2}, ..., c_{p})} {(x_{i} - y_{i})}^{2}]}^{\frac{1}{2}} .

Wherein, c_iI-th classification is represented, p is that classification is total, t_iIth feature is represented, n is characterized sum, x=(x₁,x₂,…, x_n), y=(y₁,y₂,…,y_n) unfiled sample and classification samples are represented respectively, they have n characteristic value x_i y_i。

5. sensitive information means of defence according to claim 1, it is characterised in that ardware model number matching algorithm adopts hardware The method of model set, will the ardware model number of same alike result value be put in a set, by determining hardware to be matched at certain Property value on a little attributes, so that it is determined that the model set belonging to the hardware, then seeks these intersection of sets collection, obtains the hardware Affiliated model.

6. sensitive information means of defence according to claim 1, it is characterised in that the leaf of the bottom of keywords semantics tree Child node is the subcharacter word of the innermost layer attribute keywords of XML structure in hardware information storehouse, and the layer second from the bottom of semantic tree is right What is answered is the innermost layer attribute keywords of XML structure in hardware information storehouse, and the layer third from the bottom of semantic tree is the second of XML structure Layer attribute keywords, the 4th layer is root node, and root node is the title of hardware big class.