CN106649262A - Protection method for enterprise hardware facility sensitive information in social media - Google Patents

Protection method for enterprise hardware facility sensitive information in social media Download PDF

Info

Publication number
CN106649262A
CN106649262A CN201610971014.7A CN201610971014A CN106649262A CN 106649262 A CN106649262 A CN 106649262A CN 201610971014 A CN201610971014 A CN 201610971014A CN 106649262 A CN106649262 A CN 106649262A
Authority
CN
China
Prior art keywords
hardware
feature
information
classification
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610971014.7A
Other languages
Chinese (zh)
Other versions
CN106649262B (en
Inventor
曾剑平
崔战伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201610971014.7A priority Critical patent/CN106649262B/en
Publication of CN106649262A publication Critical patent/CN106649262A/en
Application granted granted Critical
Publication of CN106649262B publication Critical patent/CN106649262B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Computing Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of privacy protection, and particularly provides a protection method for enterprise hardware facility sensitive information in social media. The protection method comprises the steps that firstly, a hardware foundation facility information base is established; secondly, the hardware type related to social media description information is determined by constructing a hardware classification model and a hardware type matching algorithm; finally, keywords, possibly leaking the sensitive information, in hardware description information are shielded or replaced in a targeted mode according to the obtained hardware type. According to the protection method, different processing can be conducted on the keywords according to the different sensitivity levels of the keywords, and the expandability is high.

Description

Enterprise's hardware facility sensitive information means of defence in a kind of social media
Technical field
The present invention relates to enterprise's hardware facility sensitive information means of defence in a kind of social media, belongs to secret protection technology Field.
Background technology
It is emerging along with traditional social media such as microblogging, network forum and wechat, Facebook, Twitter etc. The appearance of social media, people enter the social media epoch.The rapid rising of social media accelerates the flowing of information so that Interpersonal communication becomes more and more convenient.But very important, widely using for social media also bring safety On hidden danger, social media user also either intentionally or unintentionally to the secret sensitive information of enterprise or mechanism causing threat, this If a little information are obtained, integrated and utilized by commercial undertaking or the non-good will of some lawless persons, may result in individual or mechanism is hidden [1] is revealed in private.Mobile device user easily can obtain the clothes of the position of oneself and correlation by location Based service Business information.Although location Based service has provided the user great convenience, location Based service needs first to obtain shifting Employing the positional information at family could provide user corresponding service, and location Based service system does not ensure that server Do not reveal or illegally use the positional information of user.Therefore location Based service brings pole to the location privacy protection of user Big challenge [2].In addition with the rise of big data technology in recent years, the secret protection technology based on big data technology is also more next It is more, but in general, the current correlative study both at home and abroad for the protection of big data security and privacy is also insufficient, only leads to Technological means is crossed in combination with relevant policies regulation etc., big data security and privacy protection problem [3] could be preferably solved.
With the extensive application of internet, both at home and abroad with regard to secret protection or trade secret protection research also increasingly It is many.The main direction of studying of secret protection includes secret protection technology, the base that general secret protection technology, data-oriented are excavated Data publication principle, Privacy preserving algorithms in secret protection etc..General secret protection technology is devoted in relatively low application layer The privacy of secondary upper protection data, is typically realized by introducing statistical model and probabilistic model;The privacy that data-oriented is excavated is protected Shield technology is mainly solved in high level data application, how according to the characteristic of different pieces of information dredge operation, realizes the guarantor to privacy Shield;Based on the data publication principle of secret protection be to provide for it is a kind of types of applications can with general method for secret protection, And then the Privacy preserving algorithms for causing to design on this basis also have versatility.As emerging study hotspot, secret protection No matter technology is in terms of theoretical research or practical application, all with very important value [4].
Traditional sensitive information means of defence is mainly based upon the filter method of Keywords matching, but this method is ignored The semantic environment of context, accuracy is relatively low, and is difficult to resist Human disturbance, needs to safeguard substantial amounts of keyword dictionary, people Work is relatively costly.Emerging sensitive information means of defence includes the means of defence based on natural language processing and artificial intelligence, but These technologies are still in conceptual phase, can not meet under actual conditions for the requirement for filtering accuracy.
The content of the invention
Protection of the present invention not from the angle of macroscopic view to sensitive information is studied, but chooses privacy or merchant password protection The a certain specific aspect of shield, i.e., enterprise's hardware information protection in social media is studied, and gives corresponding information protection side Method.
As it was previously stated, social media user is likely to result in the leakage of privacy information when stating one's views, similarly, Be also possible to cause when internal staff states one's views in the social media such as microblogging or forum enterprises ardware model number, The leakage of the sensitive informations such as configuration.
In order to solve above-mentioned technical problem, the present invention proposes a new angle, that is, combines text classification and semanteme The strategy of replacement carries out message protection.Its basic ideas is to determine the hardware class described by information publisher by classification first And model, all properties information of the model hardware is then searched from the hardware information storehouse having built up, and according to the attribute The keyword that keyword in information is deshielded or replaced in the hardware description information that publisher issued.The main wound of the present invention New point is to construct hardware information storehouse, devise hardware information disaggregated model and ardware model number matching algorithm, give key Sensitive word replacement method;
Technical scheme is specifically described as follows.
The present invention provides enterprise's hardware facility sensitive information means of defence in a kind of social media, comprises the following steps that:
Step one, structure model
(1) structure in hardware information storehouse
Hardware information is obtained, is extracted including the multiple levels including hardware big class, producer and model, attribute and property value letter Breath, is organized into XML hierarchy structure, builds hardware information storehouse;
(2) Chinese word segmentation is carried out to the hardware description information in hardware information storehouse
(3) hardware taxonomy model and ardware model number matching algorithm are built
Hardware description information in hardware information storehouse is carried out after participle, the characteristic information of big class is extracted first, then big On the basis of class classification, the characteristic information of producer is extracted, build producer's disaggregated model;Believe finally by the classification of big class and producer Breath, builds ardware model number matching algorithm, determines the model of hardware;
(4) keyword shielding substitution model is built
For each hardware big class, the attribute keywords to occurring in hardware description information carry out sensitive partition of the level, And different processing modes are taken the other keyword of different sensitivity levels, build keyword shielding substitution model;Wherein, sensitivity level It is not divided into 0,1,2,3 and 4;It is straight for the keyword that sensitive rank is 4 for the keyword that sensitive rank is 0 does not deal with Connect and shielded with asterisk, for the keyword that sensitive rank is 1,2,3 is processed by keywords semantics tree;The key wordses Tree is built justice by the keyword in different levels in hardware information storehouse according to XML structure relation;Keywords semantics tree has four layers, base It is as follows in the replacement policy of keywords semantics tree:
For the keyword that sensitive rank is 1, it is replaced using its father node;For the keyword that sensitive rank is 2, It is replaced using the father node of its father node;For the keyword that sensitive rank is 3 is directly replaced using root node;
Step 2, detection protection
Social media content to being input into is carried out after word segmentation processing, according to hardware taxonomy model and hardware-type in step one Number matching algorithm determines ownership big class, ownership producer and ownership model;After determining model, the key built in recycle step one Word shields substitution model, by the attribute keywords in the social media content after participle, using corresponding sensitive rank and process Mode performs corresponding action, that is, shield, replace and do not deal with.
In the present invention, by feature selecting algorithm and sorting algorithm to hardware big class and hardware vendors in hardware taxonomy model Classified.
In the present invention, when carrying out the classification of hardware big class, the method that feature selecting algorithm adopts improved information gain;Tool Body computing formula is as follows:
Wherein, t is feature, and c represents classification, and k represents classification number, and dis (t) represents distribution of feature t between class, and it is Sample number and the ratio of all total sample numbers that feature t occurs, P (t) represents the probability that feature occurs, and P (c) represents that classification occurs Probability, P (c, t) represents the probability that feature and classification occur jointly,The absent variable probability of feature is represented,Represent Feature occurs without the probability that sample belongs to classification c.
The method that sorting algorithm adopts improved KNN is therein as follows apart from computing formula:
Wherein, x represents unfiled sample, and y represents classification samples, and they are all n-dimensional vectors, the every one-dimensional generation in vector One characteristic value of table, IG ' (ti) represent ith feature tiInformation gain value, x=(x1,x2,…,xn), y=(y1,y2,…, yn), d (x, y) represents the distance between x and y, xi yiRepresent the ith feature value of sample.
In the present invention, when carrying out the classification of hardware vendors, feature selecting algorithm adopts to enter using the method for characteristic similarity Row feature selecting;Feature is selected using the similarity between class characteristically, is defined between p class in feature tiOn it is similar Degree, makes this p class be respectively c1,c2,…,cp, this p class is defined in feature tiOn similarity be any two class in tiOn The mean value of similarity sum, i.e.,:
IfThen think feature tiSimilarity is excessive between this p class, discomfort cooperation For classification feature, otherwise then can as classification feature;
The method that sorting algorithm adopts improved KNN, it selects the inverse of similarity to participate in KNN as the weight of feature In the calculating of algorithm, the following is specific KNN apart from computing formula:
Wherein, ciI-th classification is represented, p is that classification is total, tiIth feature is represented, n is characterized sum, x=(x1, x2,…,xn), y=(y1,y2,…,yn) unfiled sample and classification samples are represented respectively, they have n characteristic value xi yi
In the present invention, ardware model number matching algorithm, will same alike result value using the method based on ardware model number set Ardware model number is put in a set, by determining property value of the hardware to be matched on some attributes, so that it is determined that the hardware Affiliated model set, then seeks these intersection of sets collection, obtains the model belonging to the hardware.
In the present invention, the leafy node of the bottom of keywords semantics tree is the innermost layer of XML structure in hardware information storehouse The subcharacter word of attribute keywords, it is the innermost layer category of XML structure in hardware information storehouse that the layer second from the bottom of semantic tree is corresponding Property keyword, the layer third from the bottom of semantic tree is the second layer attribute keywords of XML structure, the 4th layer be root node, root node For the title of hardware big class.
Compared to the prior art, the present invention has substantive distinguishing features and marked improvement:
(1) can be used for finding that social media content possibility existing when issuing is revealed in the sensitivity of enterprise's hardware information Hold, there is provided fine-grained contents controlling method, compared to the coarseness side that existing method can only be controlled to entire content Formula has certain advance, and the shared essential demand of social media content is remained as much as possible.
(2) devise based on big class, the classification of three levels of producer and model and matching process, can make full use of similar The information such as other vocabulary, attribute, improve the recall rate of detection, it is to avoid the sensitive leakage of hardware.Reduce search in matching simultaneously Scope, it is only necessary to matched in the information bank of same producer, improve matching efficiency.
(3) hardware information library structure, feature selecting, grader build and means of defence on propose new thinking and Implementation method, devises the version of XML, improves information gain computational methods, devises based on producer's category feature phase Like the feature selection approach of degree, keywords semantics tree is constructed, give specific prevention policies.
Description of the drawings
Fig. 1 is the overview flow chart of the present invention.
Fig. 2 is the classification process schematic diagram of hardware vendors.
Fig. 3 is the schematic flow sheet of ardware model number matching process.
Fig. 4 is the flow chart that keyword shields replacement method.
Fig. 5 is hardware information storehouse (XML structure) figure.
Fig. 6 is the corresponding relation figure between every layer of keyword of every layer of keyword of semantic tree in embodiment and XML.
Fig. 7 is the final samples illustration of the semantic tree set up in embodiment.
Specific embodiment
Technical scheme is described in detail with reference to the accompanying drawings and examples.
The overall procedure of the present invention contains as shown in Figure 1, specifically the inspection for building model flow and the right on the left side in Fig. 1 Protection flow process is surveyed, wherein model construction flow process provides necessary basic number in the result of three links for detection protection flow process According to.
The groundwork of the present invention includes:
(1) structure in hardware information storehouse;
(2) Chinese word segmentation is carried out to hardware description information;
(3) hardware taxonomy model and ardware model number matching algorithm are built;
(4) keyword shielding replacement method is built.
Key technology involved in said process is explained in detail in turn below.
1st, the structure in hardware information storehouse
In embodiment, for certain giant brain net, web crawler is devised, 36 up to ten thousand kinds of big class have been crawled automatically The hardware information of model, including mobile phone, notebook, switch, router etc..These hardware informations are organized into into XML file Each label of form, wherein XML represents the attribute of the hardware, and the text description content corresponding to label represents the hardware Property value.By the structure descriptive power of XML itself, tree-like hardware information storehouse is constructed.The hardware information storehouse constitutes subsequently Essential information source required for handling process.The hardware information storehouse (XML structure) of structure is as shown in Figure 5.
2nd, Chinese word segmentation is carried out to hardware information
Although having been obtained for the hardware information of all models in the work of the 1st step, these information can not be used directly In computer disposal, need to carry out Chinese word segmentation, remove auxiliary word, extract keyword therein, then using extracting Keyword carries out the work such as follow-up classification process.Segmenting method common at present may be used to the step, such as Chinese section Chinese lexical analysis system ICTCLAS based on level HMM that institute's Institute of Computing Technology is developed etc., User-oriented dictionary and various coded formats are held, participle accuracy is up to 97.5%.
3rd, hardware taxonomy model and ardware model number matching algorithm are built
On the basis of participle, the present invention determines hardware description by building disaggregated model and ardware model number matching algorithm Ardware model number described by information.And hardware taxonomy model include two sub- assorting processes, be respectively hardware big class classification and The classification of the classification of hardware vendors, wherein hardware vendors is carried out on the basis of the classification of hardware big class.Through the two steps Suddenly the classification being assured that belonging to hardware and producer, are assured that belonging to the hardware finally by ardware model number matching process Model, below just the basic ideas of these three processes are described.
(1) classification of hardware big class
The KNN sorting techniques in text classification have been used for reference in the classification of hardware big class, select those by feature selecting first The Feature Words larger to classification contribution, are then classified by sorting algorithm to hardware.The present invention feature selecting algorithm and The method that sorting algorithm has used for reference respectively the method and KNN of information gain, but improved the characteristics of for hardware information storehouse, It is favorably improved the accuracy of classification.
Traditional Information Gain Method only considered the impact whether Feature Words occur to global information entropy, without considering The frequency issues that Feature Words occur in class and between class, the present invention is improved traditional Information Gain Method, it is contemplated that Frequency of the Feature Words between class, improves the effect of feature selecting.
The computing formula of improved Information Gain Method is as follows:
Wherein, dis (t) represents distribution of feature t between class, and it is the sample number and all total sample numbers that feature t occurs Ratio.Why selectIt is to be based on following two reasons as regulation coefficient, first,It is subtracting for dis (t) When Distribution Value very little between class of function, i.e. feature t,Than larger, this conforms exactly to require;Secondly, selectBetween traditional information gain value IG (t) and distribution between class value dis (t) of feature t being balanced for regulation coefficient Weight, makes result of calculation excessively to rely on one party.
Similarly, the present invention is improved traditional KNN algorithms, is theed improvement is that and is considered different features pair The impact of classification is different, by the use of feature selecting information gain value as KNN algorithms weight, the information gain value of a feature Impact size of this feature to comentropy is represented, if information gain value is bigger, the impact of result of this feature to classifying is got over Greatly, so the direct weight by the use of the information gain value of feature as this feature in KNN algorithms, can thus embody difference Contribution degree of the feature of information gain value to classification.Shown below is the computing formula of distance in the KNN algorithms after improving.
Wherein, x represents unfiled sample, and y represents classification samples, and they are all n-dimensional vectors, the every one-dimensional generation in vector One characteristic value of table.IG(ti) represent ith feature tiInformation gain value.X=(x1,x2,…,xn), y=(y1,y2,…, yn)。
(2) classification of hardware vendors
After the classification of hardware big class, the classification of hardware vendors is to determine certain producer of hardware under the category.Equally Ground, is needed to carry out feature selecting in the classification of this step and is classified using suitable sorting algorithm.
Feature selecting algorithm of the present invention is the computational methods of feature based similarity, i.e., for each feature, Their characteristic similarities between different manufacturers classification are investigated, if this feature similarity is more than or equal to certain threshold value, Think that this feature is excessively similar between different manufacturers, be not suitable as the feature classified, on the contrary then can be used as the spy of classification Levy.Similarly, continue to adopt improved KNN sorting algorithms in the classification of this part, simply the weight of feature is changed to into spy Levy the logarithm reciprocal of similarity, introduction specific as follows.
In hardware information storehouse, each hardware characteristics may include multiple subcharacters, such as " appearance and size " this spy The characteristic value levied includes three dimension values of length.Here, length, width, height are exactly " appearance and size " this feature Three subcharacters.It is assumed that feature tiIt is made up of n sub- feature, i.e. ti=(ti1,ti2,…,tin).Some sample is in feature ti On characteristic value beAnother sample is in feature tiOn characteristic value be Then defineWithBetween similarity be:
Define the similarity between two features using cosine of an angle is pressed from both sides between vector.Due to the difference to be investigated Feature may include different subcharacter numbers, i.e., different dimensions, so the purpose of do so can be to ignore the dimension of vector Number, investigates the similarity between two vectors, when two vectors, i.e., two feature similarities from the angle of two vector angles emphatically When, the cosine value of angle is larger, otherwise then less.
After having defined the similarity of single feature, similarity between two classes in certain feature is next given Computational methods.Because each class may include multiple samples, it is assumed that two classes c1And c2Comprising sample number be respectively m1With m2, then the two classes are defined in feature tiOn Similarity Measure it is as follows:
As can be seen from the above equation, to two classes in feature tiOn similarity definition be directly to take all samples pair of two classes In feature tiThe average of upper similarity, do so can be all samples between two classes in feature tiOn similarity examine Worry is entered.
In feature t between two classesiOn Similarity Measure on the basis of, be defined below p class between in feature tiOn Similarity.This p class is made to be respectively c1,c2,…,cp, this p class is defined in feature tiOn similarity be any two class in ti On similarity sum mean value, i.e.,:
If this p class is in feature tiOn similarity be more than or equal to a certain threshold value δ, i.e., Then think feature tiSimilarity is excessive between this p class, is not suitable as the feature classified, otherwise then can be used as classification Feature.
Still classified using improved KNN algorithms in the classification of individual step, simply here the weight of feature will be sent out It is raw to change, no longer it is information gain value, but the inverse of the similarity of feature.Why selection selects the inverse of characteristic similarity As feature weight be based on the reason for such, characteristic similarity represent it is different classes of between similar journey in this feature Degree, the feature higher for similarity, they are little to the contribution classified, and should give less weight, and for similarity Contribution of the relatively low feature then to classifying is larger, should give higher feature, so the present invention selects the work reciprocal of similarity It is rational that the weight being characterized is participated in the calculating of KNN algorithms, the following is specific KNN apart from computing formula:
The classification process of hardware vendors is as follows, and Fig. 2 illustrates corresponding flow chart.
1) sample of different manufacturers under a certain classification is selected from hardware information storehouse;
2) characteristic similarity for different feature calculation this feature between different manufacturers;
3) if the characteristic similarity of this feature is less than certain threshold value, using this feature as characteristic of division, otherwise return 2) next feature, is selected to continue to calculate characteristic similarity;
4) classified using the feature and improved KNN algorithms selected, obtained corresponding producer's classification.
(3) matching of ardware model number
After the producer under the classification and the category that determine hardware, the present invention is by building ardware model number matching algorithm To determine model of the hardware under the producer.Ardware model number matching algorithm of the present invention is based on ardware model number set Method, will the ardware model number of same alike result value be put in a set, when it needs to be determined that certain hardware model when, only need Determine property value of the hardware on some attributes, then the model set being so assured that belonging to the hardware asks this A little intersection of sets collection can be obtained by the model belonging to the hardware.This ardware model number matching process is relative to gradually carrying out hardware Model has very big advantage for comparing in efficiency, can greatly reduce the number of times of comparison.
It not is that all of product is compared one by one one time when carrying out that ardware model number is matched, but establishes one New algorithm is made than to there is higher efficiency.Specifically, if the product of the category has n attribute (t1,t2,…,tn), Each attribute tiAll include aiIndividual subcharacter, i.e.,In the product of the manufacturer production in attribute ti Upper identical product is incorporated into in a set.And due to certain model product may on more than one attribute and its His product is identical, so the product of the model all may can occur in different set, namely may be mutually between each set There is common factor.
If occurring in that p attribute in the description information of the hardware, it is respectivelyAttributeCharacteristic value It isThen the arthmetic statement of ardware model number matching is as follows:
1) by attribute tiThe upper ardware model number with same alike result value is placed in same set;
2) i=1, wherein C=Ω, Ω is made to represent complete or collected works;
3) find and attributeSet with same alike result value
4)
If 5) C is only comprising an element or i>6) p, then carried out, otherwise i=i+1, and is returned 3);
6) set C is returned, set C is final ardware model number comparison result.
Fig. 3 illustrates the specific flow chart of ardware model number matching process, and key step is described as follows.
1) the ardware model number set with same alike result value is built for each attribute;
2) a certain attribute is taken out, investigates property value of the hardware on the attribute, obtain the corresponding hardware-type of the property value Number set;
3) the ardware model number set and the ardware model number collection conjunction for having obtained are occured simultaneously, if occuring simultaneously a unit of only include Element or attribute have taken, and stop, and the element in common factor is the model belonging to the hardware, otherwise returns 2);
4th, keyword shielding substitution model is built
The present invention shields substitution model and reveals hard to being possible to appeared in hardware description information by design key word The keyword of part sensitive information carries out shielding replacement.It is directed to different keywords and divides different sensitive ranks, and to difference The other keyword of sensitivity level takes different processing modes.
(1) keyword sensitivity partition of the level
For each hardware big class, 5 sensitive ranks of all of property value keyword are set up in advance, respectively with numeral 0th, 1,2,3,4 represent, their sensitivity rises successively, is specifically shown in Table 1.
The sensitive rank table of comparisons of table 1
Sensitive rank 0 1 2 3 4
Meaning It is insensitive It is somewhat sensitive It is general sensitive Comparison is sensitive It is very sensitive
Processing mode Do not deal with Replace Replace Replace Shielding
Different processing modes are taken to the other keyword of different sensitivity levels.Wherein, for the keyword that sensitive rank is 0 Do not deal with, it is logical for the keyword that sensitive rank is 1,2,3 for the keyword that sensitive rank is 4 is directly shielded with asterisk The mode for crossing structure semantic tree is processed.
(2) construction of keywords semantics tree
The keyword that sensitive rank is 1,2,3 is replaced by way of building semantic tree.Semantic tree leaf node It is semantic keyword most specifically, with the rising of node level, semanteme is gradually obscured, root node is semantic most fuzzy section Point.For hardware description information, its semantic tree is a total of 4 layers, and the replacement policy based on semantic tree is as follows:
For the keyword that sensitive rank is 1, it is replaced using its father node;For the keyword that sensitive rank is 2, It is replaced using the father node of its father node;For the keyword that sensitive rank is 3 is directly replaced using root node.
The XML document of each model hardware is a hierarchical structure in hardware information storehouse, and the attribute on upper strata is closed Keyword is more obscured than the attribute keywords of lower floor semantically, it is possible to the keyword for going to set up using the XML document Semantic tree.
It is such that the present invention sets up the method for semantic tree, and the leafy node of the bottom is the son of innermost layer attribute keywords Feature Words.It is the innermost layer attribute keywords of XML structure in hardware information storehouse that the layer second from the bottom of semantic tree is corresponding, they Semantically more obscure than respective subcharacter word.The layer third from the bottom of semantic tree is that the second layer attribute of XML structure is crucial Word, because the ground floor of XML document is the concrete model of the hardware, this is very sensitive information, so the inverse of semantic tree 4th layer does not correspond to the ground floor of XML document, but takes the hardware big class semantically more fuzzyyer than layer third from the bottom Name be referred to as the keyword of this layer, because fourth from the last layer has had increased to the title of hardware big class, so the layer is also The ground floor of whole semantic tree, i.e. root node.Fig. 6 is illustrated between every layer of keyword of every layer of keyword of semantic tree and XML Corresponding relation, Fig. 7 illustrates the final sample of the semantic tree of foundation, " second layer attribute keywords " and " third layer in sample Attribute keywords " each mean the second layer in XML document and third layer attribute keywords.
Application example
Because the available information content related to enterprise IT hardware facilities is not also many in internet social media, search Collection gets up relatively difficult.Here in case verification, it is extracted the part of 5000 hardware description from hardware information storehouse first Information, and these description informations are organized into into text document, each description information one text document of correspondence.Participle used Keyword sample (through some keywords of random erasure) afterwards be after the contents processing obtained from social media it is consistent, Therefore data after treatment can be with the hardware description information sample in approximate simulation social media.
Used as training sample, total training sample has 2160 to optional 60 samples, and each class is surplus from each big class 40 remaining samples are then tested as sample to be sorted, a total of 1440 test samples, obtain classification performance with k values Relation is as shown in table 2.
The correct Classified Proportion and F of hardware big class under the conditions of the different value of K of table 21Mean value
Parameter k 1 5 10 15 20 25 30
Correct Classified Proportion 80.1% 72.8% 69.3% 67.3% 65.7% 63.8% 60%
F1Mean value 0.805 0.734 0.706 0.689 0.676 0.663 0.639
In hardware vendors classification, the producer of hardware is classified by taking " mobile phone " this hardware big class as an example, choose hand Eight producers of machine, are respectively Samsung, apple, Huawei, OPPO, vivo, Meizu, association, cruel group.Test different value of K condition The ratio and F of correct classification samples down1Mean value, the result for obtaining is as shown in table 3.
The ratio and F of the correct classification samples of producer under the conditions of the different value of K of table 31Mean value
Parameter k 1 5 10 15 20 25 30 35
Correct Classified Proportion 42.4% 36.0% 34.7% 35.6% 31.8% 35.6% 33.5% 31.4%
F1Mean value 0.422 0.350 0.339 0.328 0.295 0.319 0.299 0.281
200 texts under mobile phone classification are selected at random, by each subcharacter value according to the quick of its corresponding subcharacter word Sense rank is processed accordingly, and final statistics is as shown in table 4.
The performance data that the shielding of the Partial key word of table 4 is replaced
Subcharacter word The whole network leads to Mobile 4G UNICOM 4G Telecommunications 4G Laterally
Subcharacter word number 20 89 76 41 138
The correct number for processing 20 89 76 41 138
Accuracy 100% 100% 100% 100% 100%
Bibliography
[1] Guo Qing. user profile privacy and protection [J] in social media use. Chinese information security, 2014, (7):90- 93.
[2] Wei Qiong, Lu Yansheng. location privacy protection Research progress [J]. computer science, 2008,35 (9):21- 25.
[3] Feng Dengguo, Zhang Min, Li Hao. big data security and privacy protects [J]. Chinese journal of computers, 2014,37 (1): 246-258.
[4] Zhou Shuigeng, Li Feng, Tao Yufei, Xiao little Kui. the secret protection Review Study [J] of data base-oriented application. calculate Machine journal, 2009,32 (5):847-861.

Claims (6)

1. enterprise's hardware facility sensitive information means of defence in a kind of social media, it is characterised in that comprise the following steps that:
Step one, structure model
(1) structure in hardware information storehouse
Hardware information is obtained, is extracted including the multiple levels including hardware big class, producer and model, attribute and attribute value information, XML hierarchy structure is organized into, hardware information storehouse is built;
(2) Chinese word segmentation is carried out to the hardware description information in hardware information storehouse;
(3) hardware taxonomy model and ardware model number matching algorithm are built
Hardware description information in hardware information storehouse is carried out after participle, the characteristic information of big class is extracted first, then in big class point On the basis of class, the characteristic information of producer is extracted, build producer's disaggregated model;Finally by big class and the classification information of producer, Ardware model number matching algorithm is built, the model of hardware is determined;
(4) keyword shielding substitution model is built
For each hardware big class, the attribute keywords to occurring in hardware description information carry out sensitive partition of the level, and right The other keyword of different sensitivity levels takes different processing modes, builds keyword shielding substitution model;Wherein, sensitive rank is drawn It is divided into 0,1,2,3 and 4;For the keyword that sensitive rank is 0 does not deal with, for the keyword that sensitive rank is 4 directly shields Cover, for the keyword that sensitive rank is 1,2,3 is processed by keywords semantics tree;The keywords semantics tree is by hardware Keyword in information bank in different levels builds according to XML structure relation;Keywords semantics tree has four layers, based on key wordses The replacement policy of justice tree is as follows:
For the keyword that sensitive rank is 1, it is replaced using its father node;For the keyword that sensitive rank is 2, adopt The father node of its father node is replaced;For the keyword that sensitive rank is 3 is directly replaced using root node;
Step 2, detection protection
Social media content to being input into is carried out after word segmentation processing, according to hardware taxonomy model and ardware model number in step one Determine ownership big class, ownership producer and ownership model with algorithm;After determining model, the keyword screen built in recycle step one Substitution model is covered, by the attribute keywords in the social media content after participle, using corresponding sensitive rank and processing mode Corresponding action is performed, that is, shielded, replaced and do not deal with.
2. sensitive information means of defence according to claim 1, it is characterised in that selected by feature in hardware taxonomy model Select algorithm and sorting algorithm to classify hardware big class and hardware vendors.
3. sensitive information means of defence according to claim 2, it is characterised in that when carrying out the classification of hardware big class, it is special Method of the selection algorithm using improved information gain is levied, specific formula for calculation is as follows:
IG ′ ( t ) = lg 1 d i s ( t ) [ Σ j = 1 k P ( c j , t ) log P ( c j , t ) P ( c j ) P ( t ) + Σ j = 1 k P ( c j , t ‾ ) log P ( c j , t ‾ ) P ( c j ) P ( t ‾ ) ]
Wherein, t is feature, and c represents classification, and k represents classification number, and dis (t) represents distribution of feature t between class, and it is feature Sample number and the ratio of all total sample numbers that t occurs, P (t) represents the probability that feature occurs, and it is general that P (c) represents that classification occurs Rate, P (c, t) represents the probability that feature and classification occur jointly,The absent variable probability of feature is represented,Represent feature Occur without the probability that sample belongs to classification c;
The method that sorting algorithm adopts improved KNN is therein as follows apart from computing formula:
d ( x , y ) = [ Σ i = 1 n IG ′ ( t i ) ( x i - y i ) 2 ] 1 2
Wherein, x represents unfiled sample, and y represents classification samples, and they are all n-dimensional vectors, and every one-dimensional in vector represents one Individual characteristic value, IG'(ti) represent ith feature tiInformation gain value, x=(x1,x2,…,xn), y=(y1,y2,…,yn), d (x, y) represents the distance between x and y, xi, yiRepresent the ith feature value of sample.
4. sensitive information means of defence according to claim 2, it is characterised in that when carrying out the classification of hardware vendors, it is special Selection algorithm is levied using carrying out feature selecting using the method for characteristic similarity;Selected using the similarity between class characteristically Feature is selected, is defined between p class in feature tiOn similarity, make this p class be respectively c1,c2,…,cp, define this p class In feature tiOn similarity be any two class in tiOn similarity sum mean value, i.e.,:
s i m t i ( c 1 , c 2 , ... , c p ) = 2 p ( p - 1 ) Σ i = 1 p Σ j = i + 1 p s i m t i ( c i , c j )
IfThen think feature tiSimilarity is excessive between this p class, is not suitable as classification Feature, otherwise then can as classification feature;
The method that sorting algorithm adopts improved KNN, it selects the inverse of similarity to participate in KNN algorithms as the weight of feature Calculating in, the following is specific KNN apart from computing formula:
d ( x , y ) = [ Σ i = 1 n 1 sim t i ( c 1 , c 2 , ... , c p ) ( x i - y i ) 2 ] 1 2 .
Wherein, ciI-th classification is represented, p is that classification is total, tiIth feature is represented, n is characterized sum, x=(x1,x2,…, xn), y=(y1,y2,…,yn) unfiled sample and classification samples are represented respectively, they have n characteristic value xi yi
5. sensitive information means of defence according to claim 1, it is characterised in that ardware model number matching algorithm adopts hardware The method of model set, will the ardware model number of same alike result value be put in a set, by determining hardware to be matched at certain Property value on a little attributes, so that it is determined that the model set belonging to the hardware, then seeks these intersection of sets collection, obtains the hardware Affiliated model.
6. sensitive information means of defence according to claim 1, it is characterised in that the leaf of the bottom of keywords semantics tree Child node is the subcharacter word of the innermost layer attribute keywords of XML structure in hardware information storehouse, and the layer second from the bottom of semantic tree is right What is answered is the innermost layer attribute keywords of XML structure in hardware information storehouse, and the layer third from the bottom of semantic tree is the second of XML structure Layer attribute keywords, the 4th layer is root node, and root node is the title of hardware big class.
CN201610971014.7A 2016-10-31 2016-10-31 Method for protecting sensitive information of enterprise hardware facilities in social media Expired - Fee Related CN106649262B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610971014.7A CN106649262B (en) 2016-10-31 2016-10-31 Method for protecting sensitive information of enterprise hardware facilities in social media

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610971014.7A CN106649262B (en) 2016-10-31 2016-10-31 Method for protecting sensitive information of enterprise hardware facilities in social media

Publications (2)

Publication Number Publication Date
CN106649262A true CN106649262A (en) 2017-05-10
CN106649262B CN106649262B (en) 2020-07-07

Family

ID=58821041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610971014.7A Expired - Fee Related CN106649262B (en) 2016-10-31 2016-10-31 Method for protecting sensitive information of enterprise hardware facilities in social media

Country Status (1)

Country Link
CN (1) CN106649262B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108390865A (en) * 2018-01-30 2018-08-10 南京航空航天大学 A kind of fine-grained access control mechanisms and system based on privacy driving
CN111209735A (en) * 2020-01-03 2020-05-29 广州杰赛科技股份有限公司 Document sensitivity calculation method and device
CN112000867A (en) * 2020-08-17 2020-11-27 桂林电子科技大学 Text classification method based on social media platform
CN112100646A (en) * 2020-04-09 2020-12-18 南京邮电大学 Spatial data privacy protection matching method based on two-stage grid conversion

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827102A (en) * 2010-04-20 2010-09-08 中国人民解放军理工大学指挥自动化学院 Data prevention method based on content filtering
US20120254085A1 (en) * 2008-03-28 2012-10-04 International Business Machines Corporation Information classification system, information processing apparatus, information classification method and program
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN104866465A (en) * 2014-02-25 2015-08-26 腾讯科技(深圳)有限公司 Sensitive text detection method and device
CN105426361A (en) * 2015-12-02 2016-03-23 上海智臻智能网络科技股份有限公司 Keyword extraction method and device
CN105955978A (en) * 2016-04-15 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for data leakage protection

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120254085A1 (en) * 2008-03-28 2012-10-04 International Business Machines Corporation Information classification system, information processing apparatus, information classification method and program
US9245012B2 (en) * 2008-03-28 2016-01-26 International Business Machines Corporation Information classification system, information processing apparatus, information classification method and program
CN101827102A (en) * 2010-04-20 2010-09-08 中国人民解放军理工大学指挥自动化学院 Data prevention method based on content filtering
CN104866465A (en) * 2014-02-25 2015-08-26 腾讯科技(深圳)有限公司 Sensitive text detection method and device
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN105426361A (en) * 2015-12-02 2016-03-23 上海智臻智能网络科技股份有限公司 Keyword extraction method and device
CN105955978A (en) * 2016-04-15 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for data leakage protection

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108390865A (en) * 2018-01-30 2018-08-10 南京航空航天大学 A kind of fine-grained access control mechanisms and system based on privacy driving
CN111209735A (en) * 2020-01-03 2020-05-29 广州杰赛科技股份有限公司 Document sensitivity calculation method and device
CN111209735B (en) * 2020-01-03 2023-06-02 广州杰赛科技股份有限公司 Document sensitivity calculation method and device
CN112100646A (en) * 2020-04-09 2020-12-18 南京邮电大学 Spatial data privacy protection matching method based on two-stage grid conversion
CN112000867A (en) * 2020-08-17 2020-11-27 桂林电子科技大学 Text classification method based on social media platform

Also Published As

Publication number Publication date
CN106649262B (en) 2020-07-07

Similar Documents

Publication Publication Date Title
Zhao et al. Investigating capsule networks with dynamic routing for text classification
Xiao et al. Cail2019-scm: A dataset of similar case matching in legal domain
Maleki et al. A comprehensive literature review of the rank reversal phenomenon in the analytic hierarchy process
CN103927302B (en) A kind of file classification method and system
Yue et al. Neurjudge: A circumstance-aware neural framework for legal judgment prediction
Kim et al. Etm: Entity topic models for mining documents associated with entities
CN106649262A (en) Protection method for enterprise hardware facility sensitive information in social media
Yi et al. A Novel Text Clustering Approach Using Deep‐Learning Vocabulary Network
Setiawan et al. Certain investigation of fake news detection from facebook and twitter using artificial intelligence approach
CN114595689A (en) Data processing method, data processing device, storage medium and computer equipment
Deng et al. Clue-based spatio-textual query
Chandra et al. Collective representation learning on spatiotemporal heterogeneous information networks
Rowe et al. Disambiguating identity web references using Web 2.0 data and semantics
Wang et al. Emotional contagion-based social sentiment mining in social networks by introducing network communities
Lv et al. Semantic annotation for supporting context-aware information retrieval in the transportation project environmental review domain
Ye et al. An interpretable mechanism for personalized recommendation based on cross feature
Yang et al. A hot topic detection approach on Chinese microblogging
Fors-Isalguez et al. Query-oriented text summarization based on multiobjective evolutionary algorithms and word embeddings
Dai et al. Approach for text classification based on the similarity measurement between normal cloud models
Prakoso et al. Kernelized eigenspace based fuzzy C-means for sensing trending topics on twitter
Bai et al. Text Sentiment Analysis of Hotel Online Reviews
Chen et al. A Malicious Web Page Detection Model based on SVM Algorithm: Research on the Enhancement of SVM Efficiency by Multiple Machine Learning Algorithms
Kawan et al. Multiclass Resume Categorization Using Data Mining
Xu et al. Relevance analysis of social equity and urbanization based on fuzzy logic and factor analysis model
Gurini et al. Trec microblog 2012 track: Real-time algorithm for microblog ranking systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200707