CN106649262A - Protection method for enterprise hardware facility sensitive information in social media - Google Patents
Protection method for enterprise hardware facility sensitive information in social media Download PDFInfo
- Publication number
- CN106649262A CN106649262A CN201610971014.7A CN201610971014A CN106649262A CN 106649262 A CN106649262 A CN 106649262A CN 201610971014 A CN201610971014 A CN 201610971014A CN 106649262 A CN106649262 A CN 106649262A
- Authority
- CN
- China
- Prior art keywords
- hardware
- feature
- information
- classification
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 45
- 238000012545 processing Methods 0.000 claims abstract description 11
- 230000035945 sensitivity Effects 0.000 claims abstract description 9
- 230000000875 corresponding effect Effects 0.000 claims description 15
- 239000013598 vector Substances 0.000 claims description 11
- 238000006467 substitution reaction Methods 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 7
- 238000001514 detection method Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000005192 partition Methods 0.000 claims description 3
- 238000013145 classification model Methods 0.000 abstract 1
- 230000008569 process Effects 0.000 description 16
- 238000005516 engineering process Methods 0.000 description 13
- 238000000205 computational method Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000000630 rising effect Effects 0.000 description 2
- 238000011524 similarity measure Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- BYACHAOCSIPLCM-UHFFFAOYSA-N 2-[2-[bis(2-hydroxyethyl)amino]ethyl-(2-hydroxyethyl)amino]ethanol Chemical compound OCCN(CCO)CCN(CCO)CCO BYACHAOCSIPLCM-UHFFFAOYSA-N 0.000 description 1
- DWDGSKGGUZPXMQ-UHFFFAOYSA-N OPPO Chemical compound OPPO DWDGSKGGUZPXMQ-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- Computing Systems (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of privacy protection, and particularly provides a protection method for enterprise hardware facility sensitive information in social media. The protection method comprises the steps that firstly, a hardware foundation facility information base is established; secondly, the hardware type related to social media description information is determined by constructing a hardware classification model and a hardware type matching algorithm; finally, keywords, possibly leaking the sensitive information, in hardware description information are shielded or replaced in a targeted mode according to the obtained hardware type. According to the protection method, different processing can be conducted on the keywords according to the different sensitivity levels of the keywords, and the expandability is high.
Description
Technical field
The present invention relates to enterprise's hardware facility sensitive information means of defence in a kind of social media, belongs to secret protection technology
Field.
Background technology
It is emerging along with traditional social media such as microblogging, network forum and wechat, Facebook, Twitter etc.
The appearance of social media, people enter the social media epoch.The rapid rising of social media accelerates the flowing of information so that
Interpersonal communication becomes more and more convenient.But very important, widely using for social media also bring safety
On hidden danger, social media user also either intentionally or unintentionally to the secret sensitive information of enterprise or mechanism causing threat, this
If a little information are obtained, integrated and utilized by commercial undertaking or the non-good will of some lawless persons, may result in individual or mechanism is hidden
[1] is revealed in private.Mobile device user easily can obtain the clothes of the position of oneself and correlation by location Based service
Business information.Although location Based service has provided the user great convenience, location Based service needs first to obtain shifting
Employing the positional information at family could provide user corresponding service, and location Based service system does not ensure that server
Do not reveal or illegally use the positional information of user.Therefore location Based service brings pole to the location privacy protection of user
Big challenge [2].In addition with the rise of big data technology in recent years, the secret protection technology based on big data technology is also more next
It is more, but in general, the current correlative study both at home and abroad for the protection of big data security and privacy is also insufficient, only leads to
Technological means is crossed in combination with relevant policies regulation etc., big data security and privacy protection problem [3] could be preferably solved.
With the extensive application of internet, both at home and abroad with regard to secret protection or trade secret protection research also increasingly
It is many.The main direction of studying of secret protection includes secret protection technology, the base that general secret protection technology, data-oriented are excavated
Data publication principle, Privacy preserving algorithms in secret protection etc..General secret protection technology is devoted in relatively low application layer
The privacy of secondary upper protection data, is typically realized by introducing statistical model and probabilistic model;The privacy that data-oriented is excavated is protected
Shield technology is mainly solved in high level data application, how according to the characteristic of different pieces of information dredge operation, realizes the guarantor to privacy
Shield;Based on the data publication principle of secret protection be to provide for it is a kind of types of applications can with general method for secret protection,
And then the Privacy preserving algorithms for causing to design on this basis also have versatility.As emerging study hotspot, secret protection
No matter technology is in terms of theoretical research or practical application, all with very important value [4].
Traditional sensitive information means of defence is mainly based upon the filter method of Keywords matching, but this method is ignored
The semantic environment of context, accuracy is relatively low, and is difficult to resist Human disturbance, needs to safeguard substantial amounts of keyword dictionary, people
Work is relatively costly.Emerging sensitive information means of defence includes the means of defence based on natural language processing and artificial intelligence, but
These technologies are still in conceptual phase, can not meet under actual conditions for the requirement for filtering accuracy.
The content of the invention
Protection of the present invention not from the angle of macroscopic view to sensitive information is studied, but chooses privacy or merchant password protection
The a certain specific aspect of shield, i.e., enterprise's hardware information protection in social media is studied, and gives corresponding information protection side
Method.
As it was previously stated, social media user is likely to result in the leakage of privacy information when stating one's views, similarly,
Be also possible to cause when internal staff states one's views in the social media such as microblogging or forum enterprises ardware model number,
The leakage of the sensitive informations such as configuration.
In order to solve above-mentioned technical problem, the present invention proposes a new angle, that is, combines text classification and semanteme
The strategy of replacement carries out message protection.Its basic ideas is to determine the hardware class described by information publisher by classification first
And model, all properties information of the model hardware is then searched from the hardware information storehouse having built up, and according to the attribute
The keyword that keyword in information is deshielded or replaced in the hardware description information that publisher issued.The main wound of the present invention
New point is to construct hardware information storehouse, devise hardware information disaggregated model and ardware model number matching algorithm, give key
Sensitive word replacement method;
Technical scheme is specifically described as follows.
The present invention provides enterprise's hardware facility sensitive information means of defence in a kind of social media, comprises the following steps that:
Step one, structure model
(1) structure in hardware information storehouse
Hardware information is obtained, is extracted including the multiple levels including hardware big class, producer and model, attribute and property value letter
Breath, is organized into XML hierarchy structure, builds hardware information storehouse;
(2) Chinese word segmentation is carried out to the hardware description information in hardware information storehouse
(3) hardware taxonomy model and ardware model number matching algorithm are built
Hardware description information in hardware information storehouse is carried out after participle, the characteristic information of big class is extracted first, then big
On the basis of class classification, the characteristic information of producer is extracted, build producer's disaggregated model;Believe finally by the classification of big class and producer
Breath, builds ardware model number matching algorithm, determines the model of hardware;
(4) keyword shielding substitution model is built
For each hardware big class, the attribute keywords to occurring in hardware description information carry out sensitive partition of the level,
And different processing modes are taken the other keyword of different sensitivity levels, build keyword shielding substitution model;Wherein, sensitivity level
It is not divided into 0,1,2,3 and 4;It is straight for the keyword that sensitive rank is 4 for the keyword that sensitive rank is 0 does not deal with
Connect and shielded with asterisk, for the keyword that sensitive rank is 1,2,3 is processed by keywords semantics tree;The key wordses
Tree is built justice by the keyword in different levels in hardware information storehouse according to XML structure relation;Keywords semantics tree has four layers, base
It is as follows in the replacement policy of keywords semantics tree:
For the keyword that sensitive rank is 1, it is replaced using its father node;For the keyword that sensitive rank is 2,
It is replaced using the father node of its father node;For the keyword that sensitive rank is 3 is directly replaced using root node;
Step 2, detection protection
Social media content to being input into is carried out after word segmentation processing, according to hardware taxonomy model and hardware-type in step one
Number matching algorithm determines ownership big class, ownership producer and ownership model;After determining model, the key built in recycle step one
Word shields substitution model, by the attribute keywords in the social media content after participle, using corresponding sensitive rank and process
Mode performs corresponding action, that is, shield, replace and do not deal with.
In the present invention, by feature selecting algorithm and sorting algorithm to hardware big class and hardware vendors in hardware taxonomy model
Classified.
In the present invention, when carrying out the classification of hardware big class, the method that feature selecting algorithm adopts improved information gain;Tool
Body computing formula is as follows:
Wherein, t is feature, and c represents classification, and k represents classification number, and dis (t) represents distribution of feature t between class, and it is
Sample number and the ratio of all total sample numbers that feature t occurs, P (t) represents the probability that feature occurs, and P (c) represents that classification occurs
Probability, P (c, t) represents the probability that feature and classification occur jointly,The absent variable probability of feature is represented,Represent
Feature occurs without the probability that sample belongs to classification c.
The method that sorting algorithm adopts improved KNN is therein as follows apart from computing formula:
Wherein, x represents unfiled sample, and y represents classification samples, and they are all n-dimensional vectors, the every one-dimensional generation in vector
One characteristic value of table, IG ' (ti) represent ith feature tiInformation gain value, x=(x1,x2,…,xn), y=(y1,y2,…,
yn), d (x, y) represents the distance between x and y, xi yiRepresent the ith feature value of sample.
In the present invention, when carrying out the classification of hardware vendors, feature selecting algorithm adopts to enter using the method for characteristic similarity
Row feature selecting;Feature is selected using the similarity between class characteristically, is defined between p class in feature tiOn it is similar
Degree, makes this p class be respectively c1,c2,…,cp, this p class is defined in feature tiOn similarity be any two class in tiOn
The mean value of similarity sum, i.e.,:
IfThen think feature tiSimilarity is excessive between this p class, discomfort cooperation
For classification feature, otherwise then can as classification feature;
The method that sorting algorithm adopts improved KNN, it selects the inverse of similarity to participate in KNN as the weight of feature
In the calculating of algorithm, the following is specific KNN apart from computing formula:
Wherein, ciI-th classification is represented, p is that classification is total, tiIth feature is represented, n is characterized sum, x=(x1,
x2,…,xn), y=(y1,y2,…,yn) unfiled sample and classification samples are represented respectively, they have n characteristic value xi
yi。
In the present invention, ardware model number matching algorithm, will same alike result value using the method based on ardware model number set
Ardware model number is put in a set, by determining property value of the hardware to be matched on some attributes, so that it is determined that the hardware
Affiliated model set, then seeks these intersection of sets collection, obtains the model belonging to the hardware.
In the present invention, the leafy node of the bottom of keywords semantics tree is the innermost layer of XML structure in hardware information storehouse
The subcharacter word of attribute keywords, it is the innermost layer category of XML structure in hardware information storehouse that the layer second from the bottom of semantic tree is corresponding
Property keyword, the layer third from the bottom of semantic tree is the second layer attribute keywords of XML structure, the 4th layer be root node, root node
For the title of hardware big class.
Compared to the prior art, the present invention has substantive distinguishing features and marked improvement:
(1) can be used for finding that social media content possibility existing when issuing is revealed in the sensitivity of enterprise's hardware information
Hold, there is provided fine-grained contents controlling method, compared to the coarseness side that existing method can only be controlled to entire content
Formula has certain advance, and the shared essential demand of social media content is remained as much as possible.
(2) devise based on big class, the classification of three levels of producer and model and matching process, can make full use of similar
The information such as other vocabulary, attribute, improve the recall rate of detection, it is to avoid the sensitive leakage of hardware.Reduce search in matching simultaneously
Scope, it is only necessary to matched in the information bank of same producer, improve matching efficiency.
(3) hardware information library structure, feature selecting, grader build and means of defence on propose new thinking and
Implementation method, devises the version of XML, improves information gain computational methods, devises based on producer's category feature phase
Like the feature selection approach of degree, keywords semantics tree is constructed, give specific prevention policies.
Description of the drawings
Fig. 1 is the overview flow chart of the present invention.
Fig. 2 is the classification process schematic diagram of hardware vendors.
Fig. 3 is the schematic flow sheet of ardware model number matching process.
Fig. 4 is the flow chart that keyword shields replacement method.
Fig. 5 is hardware information storehouse (XML structure) figure.
Fig. 6 is the corresponding relation figure between every layer of keyword of every layer of keyword of semantic tree in embodiment and XML.
Fig. 7 is the final samples illustration of the semantic tree set up in embodiment.
Specific embodiment
Technical scheme is described in detail with reference to the accompanying drawings and examples.
The overall procedure of the present invention contains as shown in Figure 1, specifically the inspection for building model flow and the right on the left side in Fig. 1
Protection flow process is surveyed, wherein model construction flow process provides necessary basic number in the result of three links for detection protection flow process
According to.
The groundwork of the present invention includes:
(1) structure in hardware information storehouse;
(2) Chinese word segmentation is carried out to hardware description information;
(3) hardware taxonomy model and ardware model number matching algorithm are built;
(4) keyword shielding replacement method is built.
Key technology involved in said process is explained in detail in turn below.
1st, the structure in hardware information storehouse
In embodiment, for certain giant brain net, web crawler is devised, 36 up to ten thousand kinds of big class have been crawled automatically
The hardware information of model, including mobile phone, notebook, switch, router etc..These hardware informations are organized into into XML file
Each label of form, wherein XML represents the attribute of the hardware, and the text description content corresponding to label represents the hardware
Property value.By the structure descriptive power of XML itself, tree-like hardware information storehouse is constructed.The hardware information storehouse constitutes subsequently
Essential information source required for handling process.The hardware information storehouse (XML structure) of structure is as shown in Figure 5.
2nd, Chinese word segmentation is carried out to hardware information
Although having been obtained for the hardware information of all models in the work of the 1st step, these information can not be used directly
In computer disposal, need to carry out Chinese word segmentation, remove auxiliary word, extract keyword therein, then using extracting
Keyword carries out the work such as follow-up classification process.Segmenting method common at present may be used to the step, such as Chinese section
Chinese lexical analysis system ICTCLAS based on level HMM that institute's Institute of Computing Technology is developed etc.,
User-oriented dictionary and various coded formats are held, participle accuracy is up to 97.5%.
3rd, hardware taxonomy model and ardware model number matching algorithm are built
On the basis of participle, the present invention determines hardware description by building disaggregated model and ardware model number matching algorithm
Ardware model number described by information.And hardware taxonomy model include two sub- assorting processes, be respectively hardware big class classification and
The classification of the classification of hardware vendors, wherein hardware vendors is carried out on the basis of the classification of hardware big class.Through the two steps
Suddenly the classification being assured that belonging to hardware and producer, are assured that belonging to the hardware finally by ardware model number matching process
Model, below just the basic ideas of these three processes are described.
(1) classification of hardware big class
The KNN sorting techniques in text classification have been used for reference in the classification of hardware big class, select those by feature selecting first
The Feature Words larger to classification contribution, are then classified by sorting algorithm to hardware.The present invention feature selecting algorithm and
The method that sorting algorithm has used for reference respectively the method and KNN of information gain, but improved the characteristics of for hardware information storehouse,
It is favorably improved the accuracy of classification.
Traditional Information Gain Method only considered the impact whether Feature Words occur to global information entropy, without considering
The frequency issues that Feature Words occur in class and between class, the present invention is improved traditional Information Gain Method, it is contemplated that
Frequency of the Feature Words between class, improves the effect of feature selecting.
The computing formula of improved Information Gain Method is as follows:
Wherein, dis (t) represents distribution of feature t between class, and it is the sample number and all total sample numbers that feature t occurs
Ratio.Why selectIt is to be based on following two reasons as regulation coefficient, first,It is subtracting for dis (t)
When Distribution Value very little between class of function, i.e. feature t,Than larger, this conforms exactly to require;Secondly, selectBetween traditional information gain value IG (t) and distribution between class value dis (t) of feature t being balanced for regulation coefficient
Weight, makes result of calculation excessively to rely on one party.
Similarly, the present invention is improved traditional KNN algorithms, is theed improvement is that and is considered different features pair
The impact of classification is different, by the use of feature selecting information gain value as KNN algorithms weight, the information gain value of a feature
Impact size of this feature to comentropy is represented, if information gain value is bigger, the impact of result of this feature to classifying is got over
Greatly, so the direct weight by the use of the information gain value of feature as this feature in KNN algorithms, can thus embody difference
Contribution degree of the feature of information gain value to classification.Shown below is the computing formula of distance in the KNN algorithms after improving.
Wherein, x represents unfiled sample, and y represents classification samples, and they are all n-dimensional vectors, the every one-dimensional generation in vector
One characteristic value of table.IG(ti) represent ith feature tiInformation gain value.X=(x1,x2,…,xn), y=(y1,y2,…,
yn)。
(2) classification of hardware vendors
After the classification of hardware big class, the classification of hardware vendors is to determine certain producer of hardware under the category.Equally
Ground, is needed to carry out feature selecting in the classification of this step and is classified using suitable sorting algorithm.
Feature selecting algorithm of the present invention is the computational methods of feature based similarity, i.e., for each feature,
Their characteristic similarities between different manufacturers classification are investigated, if this feature similarity is more than or equal to certain threshold value,
Think that this feature is excessively similar between different manufacturers, be not suitable as the feature classified, on the contrary then can be used as the spy of classification
Levy.Similarly, continue to adopt improved KNN sorting algorithms in the classification of this part, simply the weight of feature is changed to into spy
Levy the logarithm reciprocal of similarity, introduction specific as follows.
In hardware information storehouse, each hardware characteristics may include multiple subcharacters, such as " appearance and size " this spy
The characteristic value levied includes three dimension values of length.Here, length, width, height are exactly " appearance and size " this feature
Three subcharacters.It is assumed that feature tiIt is made up of n sub- feature, i.e. ti=(ti1,ti2,…,tin).Some sample is in feature ti
On characteristic value beAnother sample is in feature tiOn characteristic value be
Then defineWithBetween similarity be:
Define the similarity between two features using cosine of an angle is pressed from both sides between vector.Due to the difference to be investigated
Feature may include different subcharacter numbers, i.e., different dimensions, so the purpose of do so can be to ignore the dimension of vector
Number, investigates the similarity between two vectors, when two vectors, i.e., two feature similarities from the angle of two vector angles emphatically
When, the cosine value of angle is larger, otherwise then less.
After having defined the similarity of single feature, similarity between two classes in certain feature is next given
Computational methods.Because each class may include multiple samples, it is assumed that two classes c1And c2Comprising sample number be respectively m1With
m2, then the two classes are defined in feature tiOn Similarity Measure it is as follows:
As can be seen from the above equation, to two classes in feature tiOn similarity definition be directly to take all samples pair of two classes
In feature tiThe average of upper similarity, do so can be all samples between two classes in feature tiOn similarity examine
Worry is entered.
In feature t between two classesiOn Similarity Measure on the basis of, be defined below p class between in feature tiOn
Similarity.This p class is made to be respectively c1,c2,…,cp, this p class is defined in feature tiOn similarity be any two class in ti
On similarity sum mean value, i.e.,:
If this p class is in feature tiOn similarity be more than or equal to a certain threshold value δ, i.e.,
Then think feature tiSimilarity is excessive between this p class, is not suitable as the feature classified, otherwise then can be used as classification
Feature.
Still classified using improved KNN algorithms in the classification of individual step, simply here the weight of feature will be sent out
It is raw to change, no longer it is information gain value, but the inverse of the similarity of feature.Why selection selects the inverse of characteristic similarity
As feature weight be based on the reason for such, characteristic similarity represent it is different classes of between similar journey in this feature
Degree, the feature higher for similarity, they are little to the contribution classified, and should give less weight, and for similarity
Contribution of the relatively low feature then to classifying is larger, should give higher feature, so the present invention selects the work reciprocal of similarity
It is rational that the weight being characterized is participated in the calculating of KNN algorithms, the following is specific KNN apart from computing formula:
The classification process of hardware vendors is as follows, and Fig. 2 illustrates corresponding flow chart.
1) sample of different manufacturers under a certain classification is selected from hardware information storehouse;
2) characteristic similarity for different feature calculation this feature between different manufacturers;
3) if the characteristic similarity of this feature is less than certain threshold value, using this feature as characteristic of division, otherwise return
2) next feature, is selected to continue to calculate characteristic similarity;
4) classified using the feature and improved KNN algorithms selected, obtained corresponding producer's classification.
(3) matching of ardware model number
After the producer under the classification and the category that determine hardware, the present invention is by building ardware model number matching algorithm
To determine model of the hardware under the producer.Ardware model number matching algorithm of the present invention is based on ardware model number set
Method, will the ardware model number of same alike result value be put in a set, when it needs to be determined that certain hardware model when, only need
Determine property value of the hardware on some attributes, then the model set being so assured that belonging to the hardware asks this
A little intersection of sets collection can be obtained by the model belonging to the hardware.This ardware model number matching process is relative to gradually carrying out hardware
Model has very big advantage for comparing in efficiency, can greatly reduce the number of times of comparison.
It not is that all of product is compared one by one one time when carrying out that ardware model number is matched, but establishes one
New algorithm is made than to there is higher efficiency.Specifically, if the product of the category has n attribute (t1,t2,…,tn),
Each attribute tiAll include aiIndividual subcharacter, i.e.,In the product of the manufacturer production in attribute ti
Upper identical product is incorporated into in a set.And due to certain model product may on more than one attribute and its
His product is identical, so the product of the model all may can occur in different set, namely may be mutually between each set
There is common factor.
If occurring in that p attribute in the description information of the hardware, it is respectivelyAttributeCharacteristic value
It isThen the arthmetic statement of ardware model number matching is as follows:
1) by attribute tiThe upper ardware model number with same alike result value is placed in same set;
2) i=1, wherein C=Ω, Ω is made to represent complete or collected works;
3) find and attributeSet with same alike result value
4)
If 5) C is only comprising an element or i>6) p, then carried out, otherwise i=i+1, and is returned 3);
6) set C is returned, set C is final ardware model number comparison result.
Fig. 3 illustrates the specific flow chart of ardware model number matching process, and key step is described as follows.
1) the ardware model number set with same alike result value is built for each attribute;
2) a certain attribute is taken out, investigates property value of the hardware on the attribute, obtain the corresponding hardware-type of the property value
Number set;
3) the ardware model number set and the ardware model number collection conjunction for having obtained are occured simultaneously, if occuring simultaneously a unit of only include
Element or attribute have taken, and stop, and the element in common factor is the model belonging to the hardware, otherwise returns 2);
4th, keyword shielding substitution model is built
The present invention shields substitution model and reveals hard to being possible to appeared in hardware description information by design key word
The keyword of part sensitive information carries out shielding replacement.It is directed to different keywords and divides different sensitive ranks, and to difference
The other keyword of sensitivity level takes different processing modes.
(1) keyword sensitivity partition of the level
For each hardware big class, 5 sensitive ranks of all of property value keyword are set up in advance, respectively with numeral
0th, 1,2,3,4 represent, their sensitivity rises successively, is specifically shown in Table 1.
The sensitive rank table of comparisons of table 1
Sensitive rank | 0 | 1 | 2 | 3 | 4 |
Meaning | It is insensitive | It is somewhat sensitive | It is general sensitive | Comparison is sensitive | It is very sensitive |
Processing mode | Do not deal with | Replace | Replace | Replace | Shielding |
Different processing modes are taken to the other keyword of different sensitivity levels.Wherein, for the keyword that sensitive rank is 0
Do not deal with, it is logical for the keyword that sensitive rank is 1,2,3 for the keyword that sensitive rank is 4 is directly shielded with asterisk
The mode for crossing structure semantic tree is processed.
(2) construction of keywords semantics tree
The keyword that sensitive rank is 1,2,3 is replaced by way of building semantic tree.Semantic tree leaf node
It is semantic keyword most specifically, with the rising of node level, semanteme is gradually obscured, root node is semantic most fuzzy section
Point.For hardware description information, its semantic tree is a total of 4 layers, and the replacement policy based on semantic tree is as follows:
For the keyword that sensitive rank is 1, it is replaced using its father node;For the keyword that sensitive rank is 2,
It is replaced using the father node of its father node;For the keyword that sensitive rank is 3 is directly replaced using root node.
The XML document of each model hardware is a hierarchical structure in hardware information storehouse, and the attribute on upper strata is closed
Keyword is more obscured than the attribute keywords of lower floor semantically, it is possible to the keyword for going to set up using the XML document
Semantic tree.
It is such that the present invention sets up the method for semantic tree, and the leafy node of the bottom is the son of innermost layer attribute keywords
Feature Words.It is the innermost layer attribute keywords of XML structure in hardware information storehouse that the layer second from the bottom of semantic tree is corresponding, they
Semantically more obscure than respective subcharacter word.The layer third from the bottom of semantic tree is that the second layer attribute of XML structure is crucial
Word, because the ground floor of XML document is the concrete model of the hardware, this is very sensitive information, so the inverse of semantic tree
4th layer does not correspond to the ground floor of XML document, but takes the hardware big class semantically more fuzzyyer than layer third from the bottom
Name be referred to as the keyword of this layer, because fourth from the last layer has had increased to the title of hardware big class, so the layer is also
The ground floor of whole semantic tree, i.e. root node.Fig. 6 is illustrated between every layer of keyword of every layer of keyword of semantic tree and XML
Corresponding relation, Fig. 7 illustrates the final sample of the semantic tree of foundation, " second layer attribute keywords " and " third layer in sample
Attribute keywords " each mean the second layer in XML document and third layer attribute keywords.
Application example
Because the available information content related to enterprise IT hardware facilities is not also many in internet social media, search
Collection gets up relatively difficult.Here in case verification, it is extracted the part of 5000 hardware description from hardware information storehouse first
Information, and these description informations are organized into into text document, each description information one text document of correspondence.Participle used
Keyword sample (through some keywords of random erasure) afterwards be after the contents processing obtained from social media it is consistent,
Therefore data after treatment can be with the hardware description information sample in approximate simulation social media.
Used as training sample, total training sample has 2160 to optional 60 samples, and each class is surplus from each big class
40 remaining samples are then tested as sample to be sorted, a total of 1440 test samples, obtain classification performance with k values
Relation is as shown in table 2.
The correct Classified Proportion and F of hardware big class under the conditions of the different value of K of table 21Mean value
Parameter k | 1 | 5 | 10 | 15 | 20 | 25 | 30 |
Correct Classified Proportion | 80.1% | 72.8% | 69.3% | 67.3% | 65.7% | 63.8% | 60% |
F1Mean value | 0.805 | 0.734 | 0.706 | 0.689 | 0.676 | 0.663 | 0.639 |
In hardware vendors classification, the producer of hardware is classified by taking " mobile phone " this hardware big class as an example, choose hand
Eight producers of machine, are respectively Samsung, apple, Huawei, OPPO, vivo, Meizu, association, cruel group.Test different value of K condition
The ratio and F of correct classification samples down1Mean value, the result for obtaining is as shown in table 3.
The ratio and F of the correct classification samples of producer under the conditions of the different value of K of table 31Mean value
Parameter k | 1 | 5 | 10 | 15 | 20 | 25 | 30 | 35 |
Correct Classified Proportion | 42.4% | 36.0% | 34.7% | 35.6% | 31.8% | 35.6% | 33.5% | 31.4% |
F1Mean value | 0.422 | 0.350 | 0.339 | 0.328 | 0.295 | 0.319 | 0.299 | 0.281 |
200 texts under mobile phone classification are selected at random, by each subcharacter value according to the quick of its corresponding subcharacter word
Sense rank is processed accordingly, and final statistics is as shown in table 4.
The performance data that the shielding of the Partial key word of table 4 is replaced
Subcharacter word | The whole network leads to | Mobile 4G | UNICOM 4G | Telecommunications 4G | Laterally |
Subcharacter word number | 20 | 89 | 76 | 41 | 138 |
The correct number for processing | 20 | 89 | 76 | 41 | 138 |
Accuracy | 100% | 100% | 100% | 100% | 100% |
Bibliography
[1] Guo Qing. user profile privacy and protection [J] in social media use. Chinese information security, 2014, (7):90-
93.
[2] Wei Qiong, Lu Yansheng. location privacy protection Research progress [J]. computer science, 2008,35 (9):21-
25.
[3] Feng Dengguo, Zhang Min, Li Hao. big data security and privacy protects [J]. Chinese journal of computers, 2014,37 (1):
246-258.
[4] Zhou Shuigeng, Li Feng, Tao Yufei, Xiao little Kui. the secret protection Review Study [J] of data base-oriented application. calculate
Machine journal, 2009,32 (5):847-861.
Claims (6)
1. enterprise's hardware facility sensitive information means of defence in a kind of social media, it is characterised in that comprise the following steps that:
Step one, structure model
(1) structure in hardware information storehouse
Hardware information is obtained, is extracted including the multiple levels including hardware big class, producer and model, attribute and attribute value information,
XML hierarchy structure is organized into, hardware information storehouse is built;
(2) Chinese word segmentation is carried out to the hardware description information in hardware information storehouse;
(3) hardware taxonomy model and ardware model number matching algorithm are built
Hardware description information in hardware information storehouse is carried out after participle, the characteristic information of big class is extracted first, then in big class point
On the basis of class, the characteristic information of producer is extracted, build producer's disaggregated model;Finally by big class and the classification information of producer,
Ardware model number matching algorithm is built, the model of hardware is determined;
(4) keyword shielding substitution model is built
For each hardware big class, the attribute keywords to occurring in hardware description information carry out sensitive partition of the level, and right
The other keyword of different sensitivity levels takes different processing modes, builds keyword shielding substitution model;Wherein, sensitive rank is drawn
It is divided into 0,1,2,3 and 4;For the keyword that sensitive rank is 0 does not deal with, for the keyword that sensitive rank is 4 directly shields
Cover, for the keyword that sensitive rank is 1,2,3 is processed by keywords semantics tree;The keywords semantics tree is by hardware
Keyword in information bank in different levels builds according to XML structure relation;Keywords semantics tree has four layers, based on key wordses
The replacement policy of justice tree is as follows:
For the keyword that sensitive rank is 1, it is replaced using its father node;For the keyword that sensitive rank is 2, adopt
The father node of its father node is replaced;For the keyword that sensitive rank is 3 is directly replaced using root node;
Step 2, detection protection
Social media content to being input into is carried out after word segmentation processing, according to hardware taxonomy model and ardware model number in step one
Determine ownership big class, ownership producer and ownership model with algorithm;After determining model, the keyword screen built in recycle step one
Substitution model is covered, by the attribute keywords in the social media content after participle, using corresponding sensitive rank and processing mode
Corresponding action is performed, that is, shielded, replaced and do not deal with.
2. sensitive information means of defence according to claim 1, it is characterised in that selected by feature in hardware taxonomy model
Select algorithm and sorting algorithm to classify hardware big class and hardware vendors.
3. sensitive information means of defence according to claim 2, it is characterised in that when carrying out the classification of hardware big class, it is special
Method of the selection algorithm using improved information gain is levied, specific formula for calculation is as follows:
Wherein, t is feature, and c represents classification, and k represents classification number, and dis (t) represents distribution of feature t between class, and it is feature
Sample number and the ratio of all total sample numbers that t occurs, P (t) represents the probability that feature occurs, and it is general that P (c) represents that classification occurs
Rate, P (c, t) represents the probability that feature and classification occur jointly,The absent variable probability of feature is represented,Represent feature
Occur without the probability that sample belongs to classification c;
The method that sorting algorithm adopts improved KNN is therein as follows apart from computing formula:
Wherein, x represents unfiled sample, and y represents classification samples, and they are all n-dimensional vectors, and every one-dimensional in vector represents one
Individual characteristic value, IG'(ti) represent ith feature tiInformation gain value, x=(x1,x2,…,xn), y=(y1,y2,…,yn), d
(x, y) represents the distance between x and y, xi, yiRepresent the ith feature value of sample.
4. sensitive information means of defence according to claim 2, it is characterised in that when carrying out the classification of hardware vendors, it is special
Selection algorithm is levied using carrying out feature selecting using the method for characteristic similarity;Selected using the similarity between class characteristically
Feature is selected, is defined between p class in feature tiOn similarity, make this p class be respectively c1,c2,…,cp, define this p class
In feature tiOn similarity be any two class in tiOn similarity sum mean value, i.e.,:
IfThen think feature tiSimilarity is excessive between this p class, is not suitable as classification
Feature, otherwise then can as classification feature;
The method that sorting algorithm adopts improved KNN, it selects the inverse of similarity to participate in KNN algorithms as the weight of feature
Calculating in, the following is specific KNN apart from computing formula:
Wherein, ciI-th classification is represented, p is that classification is total, tiIth feature is represented, n is characterized sum, x=(x1,x2,…,
xn), y=(y1,y2,…,yn) unfiled sample and classification samples are represented respectively, they have n characteristic value xi yi。
5. sensitive information means of defence according to claim 1, it is characterised in that ardware model number matching algorithm adopts hardware
The method of model set, will the ardware model number of same alike result value be put in a set, by determining hardware to be matched at certain
Property value on a little attributes, so that it is determined that the model set belonging to the hardware, then seeks these intersection of sets collection, obtains the hardware
Affiliated model.
6. sensitive information means of defence according to claim 1, it is characterised in that the leaf of the bottom of keywords semantics tree
Child node is the subcharacter word of the innermost layer attribute keywords of XML structure in hardware information storehouse, and the layer second from the bottom of semantic tree is right
What is answered is the innermost layer attribute keywords of XML structure in hardware information storehouse, and the layer third from the bottom of semantic tree is the second of XML structure
Layer attribute keywords, the 4th layer is root node, and root node is the title of hardware big class.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610971014.7A CN106649262B (en) | 2016-10-31 | 2016-10-31 | Method for protecting sensitive information of enterprise hardware facilities in social media |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610971014.7A CN106649262B (en) | 2016-10-31 | 2016-10-31 | Method for protecting sensitive information of enterprise hardware facilities in social media |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106649262A true CN106649262A (en) | 2017-05-10 |
CN106649262B CN106649262B (en) | 2020-07-07 |
Family
ID=58821041
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610971014.7A Expired - Fee Related CN106649262B (en) | 2016-10-31 | 2016-10-31 | Method for protecting sensitive information of enterprise hardware facilities in social media |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649262B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108390865A (en) * | 2018-01-30 | 2018-08-10 | 南京航空航天大学 | A kind of fine-grained access control mechanisms and system based on privacy driving |
CN111209735A (en) * | 2020-01-03 | 2020-05-29 | 广州杰赛科技股份有限公司 | Document sensitivity calculation method and device |
CN112000867A (en) * | 2020-08-17 | 2020-11-27 | 桂林电子科技大学 | Text classification method based on social media platform |
CN112100646A (en) * | 2020-04-09 | 2020-12-18 | 南京邮电大学 | Spatial data privacy protection matching method based on two-stage grid conversion |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101827102A (en) * | 2010-04-20 | 2010-09-08 | 中国人民解放军理工大学指挥自动化学院 | Data prevention method based on content filtering |
US20120254085A1 (en) * | 2008-03-28 | 2012-10-04 | International Business Machines Corporation | Information classification system, information processing apparatus, information classification method and program |
CN104239436A (en) * | 2014-08-27 | 2014-12-24 | 南京邮电大学 | Network hot event detection method based on text classification and clustering analysis |
CN104866465A (en) * | 2014-02-25 | 2015-08-26 | 腾讯科技(深圳)有限公司 | Sensitive text detection method and device |
CN105426361A (en) * | 2015-12-02 | 2016-03-23 | 上海智臻智能网络科技股份有限公司 | Keyword extraction method and device |
CN105955978A (en) * | 2016-04-15 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for data leakage protection |
-
2016
- 2016-10-31 CN CN201610971014.7A patent/CN106649262B/en not_active Expired - Fee Related
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120254085A1 (en) * | 2008-03-28 | 2012-10-04 | International Business Machines Corporation | Information classification system, information processing apparatus, information classification method and program |
US9245012B2 (en) * | 2008-03-28 | 2016-01-26 | International Business Machines Corporation | Information classification system, information processing apparatus, information classification method and program |
CN101827102A (en) * | 2010-04-20 | 2010-09-08 | 中国人民解放军理工大学指挥自动化学院 | Data prevention method based on content filtering |
CN104866465A (en) * | 2014-02-25 | 2015-08-26 | 腾讯科技(深圳)有限公司 | Sensitive text detection method and device |
CN104239436A (en) * | 2014-08-27 | 2014-12-24 | 南京邮电大学 | Network hot event detection method based on text classification and clustering analysis |
CN105426361A (en) * | 2015-12-02 | 2016-03-23 | 上海智臻智能网络科技股份有限公司 | Keyword extraction method and device |
CN105955978A (en) * | 2016-04-15 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for data leakage protection |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108390865A (en) * | 2018-01-30 | 2018-08-10 | 南京航空航天大学 | A kind of fine-grained access control mechanisms and system based on privacy driving |
CN111209735A (en) * | 2020-01-03 | 2020-05-29 | 广州杰赛科技股份有限公司 | Document sensitivity calculation method and device |
CN111209735B (en) * | 2020-01-03 | 2023-06-02 | 广州杰赛科技股份有限公司 | Document sensitivity calculation method and device |
CN112100646A (en) * | 2020-04-09 | 2020-12-18 | 南京邮电大学 | Spatial data privacy protection matching method based on two-stage grid conversion |
CN112000867A (en) * | 2020-08-17 | 2020-11-27 | 桂林电子科技大学 | Text classification method based on social media platform |
Also Published As
Publication number | Publication date |
---|---|
CN106649262B (en) | 2020-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhao et al. | Investigating capsule networks with dynamic routing for text classification | |
Xiao et al. | Cail2019-scm: A dataset of similar case matching in legal domain | |
Maleki et al. | A comprehensive literature review of the rank reversal phenomenon in the analytic hierarchy process | |
CN103927302B (en) | A kind of file classification method and system | |
Yue et al. | Neurjudge: A circumstance-aware neural framework for legal judgment prediction | |
Kim et al. | Etm: Entity topic models for mining documents associated with entities | |
CN106649262A (en) | Protection method for enterprise hardware facility sensitive information in social media | |
Yi et al. | A Novel Text Clustering Approach Using Deep‐Learning Vocabulary Network | |
Setiawan et al. | Certain investigation of fake news detection from facebook and twitter using artificial intelligence approach | |
CN114595689A (en) | Data processing method, data processing device, storage medium and computer equipment | |
Deng et al. | Clue-based spatio-textual query | |
Chandra et al. | Collective representation learning on spatiotemporal heterogeneous information networks | |
Rowe et al. | Disambiguating identity web references using Web 2.0 data and semantics | |
Wang et al. | Emotional contagion-based social sentiment mining in social networks by introducing network communities | |
Lv et al. | Semantic annotation for supporting context-aware information retrieval in the transportation project environmental review domain | |
Ye et al. | An interpretable mechanism for personalized recommendation based on cross feature | |
Yang et al. | A hot topic detection approach on Chinese microblogging | |
Fors-Isalguez et al. | Query-oriented text summarization based on multiobjective evolutionary algorithms and word embeddings | |
Dai et al. | Approach for text classification based on the similarity measurement between normal cloud models | |
Prakoso et al. | Kernelized eigenspace based fuzzy C-means for sensing trending topics on twitter | |
Bai et al. | Text Sentiment Analysis of Hotel Online Reviews | |
Chen et al. | A Malicious Web Page Detection Model based on SVM Algorithm: Research on the Enhancement of SVM Efficiency by Multiple Machine Learning Algorithms | |
Kawan et al. | Multiclass Resume Categorization Using Data Mining | |
Xu et al. | Relevance analysis of social equity and urbanization based on fuzzy logic and factor analysis model | |
Gurini et al. | Trec microblog 2012 track: Real-time algorithm for microblog ranking systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200707 |