CN105740329B

CN105740329B - A kind of contents semantic method for digging of unstructured high amount of traffic

Info

Publication number: CN105740329B
Application number: CN201610041935.3A
Authority: CN
Inventors: 张少中
Original assignee: Zhejiang Wanli College
Current assignee: Zhejiang Wanli College
Priority date: 2016-01-21
Filing date: 2016-01-21
Publication date: 2019-04-05
Anticipated expiration: 2036-01-21
Also published as: CN105740329A

Abstract

The invention discloses a kind of contents semantic method for digging of unstructured high amount of traffic, including S1: extracting text link, tag attributes and the semantic tendency keyword in high amount of traffic, and corresponding definition text node, label node, content node；S2: building includes the text node set of each text node, the label node set comprising each label node, calculate and export text node between label node weight and any label node to other all weights marked between nodes；S3: according to text node set, label node set, text node to weight, any label node to the weight between other all label nodes between label node, semantic classification is carried out to each content node and constructs different content node classification set；S4: text node cluster set is obtained to the small-world network cluster calculation that text node is weighted according to text node set, content node classification set.

Description

A kind of contents semantic method for digging of unstructured high amount of traffic

Technical field

The present invention relates to field of computer technology, and in particular to a kind of contents semantic excavation side of unstructured high amount of traffic Method.

Background technique

With the rapid development and application of WEB2.0 technology, the network information interaction of the forms such as blog, microblogging, wechat becomes Important information interchange mode includes the data of structuring, semi-structured number in the data information that these exchange ways generate According to and unstructured data, wherein mostly based on unstructured data.Propagation, update of these data by numerous people, day are long-pending It is month tired to become that a kind of structure is complicated, the big data set of the heterogeneous of content multiplicity, isomery, magnanimity.Include in this big data Various information, such as user is to certain things, event, commodity or the evaluation of service, attitude, behavior content, such as What extracts these valuable contents from huge big data, to be provided with for enterprise, mechanism and personal user The service of value is very important.

By taking microblogging high amount of traffic as an example, microblogging high amount of traffic is using all kinds of microblog datas of real-time online as core, with non-knot Structure data are principal mode, and the emphasis excavated to such high amount of traffic is the extraction, classification and cluster etc. of core content. Further, since the data generated in similar microblogging have strong semanteme often with the intention and tendency of author and commentator And certain attitude and tendency that affective characteristics namely author or commentator are revealed for certain table of contents in microblogging, how will The semanteme of these contents and Sentiment orientation wherein included extract, and combine with content mining, are that unstructured microblogging is big The emphasis of the content mining of data flow.

Existing Data Stream Processing and method for digging, the Frequent Pattern Mining including being directed to traditional data collection type, and For Frequent Pattern Mining, effective mode excavation, the sliding window control technology etc. under large data sets.But these excavation sides On the one hand method can only be handled for structuring or semi-structured data, it is difficult to handle the unstructured of the types such as microblogging, blog Data；On the other hand such method for digging does not account for the semanteme and emotion tendency problem of data content, it is difficult to correct to hold To the core point of data content.

Summary of the invention

The technical problem to be solved by the present invention is to contents semantic digging can be carried out for unstructured high amount of traffic by providing one kind Pick carries out contents semantic clustering, can grasp semantic and emotion tendency method in high amount of traffic content in time.

The technical scheme is that a kind of contents semantic method for digging of unstructured high amount of traffic is provided, including with Lower step:

Step S1: a high amount of traffic is provided, text link, tag attributes and the semantic tendency in the high amount of traffic are extracted Keyword, defining each text link is text node, and each tag attributes are label node, and each semantic tendency closes Keyword is content node；

Step S2: building includes the text node set of each text node, and includes each label node Node set is marked, calculate and exports the text node to the weight and any label between the label node Node to other it is all label nodes between weights；

Step S3: according to the text node set, label node set, text node to the power between label node Value, any label node carry out semantic classification and structure to each content node to the weight between other all label nodes Build different content node classification set；

Step S4: according to the text node set, content node classification set, the small generation that text node is weighted Boundary's network clustering calculates, and obtains text node cluster set.

Further, the step S2 the following steps are included:

Step S20: building includes the text node set of each text node, and includes each label node Mark node set；

Step S21: the frequency of each text node, the characteristic value for marking node is marked；

Step S22: calculating and exports each text node to all frequencies for marking nodes；

Step S23: calculating and exports any label node to the Eigen-frequencies between other all label nodes；

Further, the step S3 the following steps are included:

Step S30: traversal is all from each starting text node to the path of each label node；

Step S31: the length in more each path simultaneously finds out the shortest path of length, wherein the length is shortest Path is the path of maximum weight；

Step S32: continue the length in more remaining each path and arranged；

Step S33: successively carrying out the operation of step S30 to step S32 for remaining each text node, determines each described Path permutations sequence of the text node to all label nodes；

Step S34: one path length threshold of setting determines the text being less than or equal in the path of the path length threshold This node meets the requirements；

Step S35: being directed to satisfactory text node, calculates the attribute of respective markers node and to each content knot Point carries out semantic classification, constructs different content node classification set.

Further, the step S4 the following steps are included:

Step S40: the path length of all starting text nodes of calculating to different content node；

Step S41: determine starting text node to the shortest length path of the content node；

Step S42: all starting text nodes are found out to the path between content node with shortest length path；

Step S43: the length threshold in one shortest length path of setting；

Step S44: it will meet that content node is semantic and with shortest path length text node cluster collects to one It closes, to obtain text node cluster set.

The extraction of technical solution of the present invention having the beneficial effect that by carrying out text mark and content to high amount of traffic, and Weight computing is carried out between text and label, between label and label, then to the content node set of building weighting, finally The small-world network cluster calculation being weighted again, to obtain the cluster set of the text node of certain corresponding a kind of content node semanteme Close, can the contents semantic and emotion tendency grasped in high amount of traffic promptly and accurately, for the high amount of traffic of similar microblogging, The hot issue of current public attention in microblogging comment text can be understood in time, grasp common people's public sentiment and its dynamic, Jin Erwei The monitoring of microblogging public's public sentiment provides basic fundamental and supports.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is provided in an embodiment of the present invention comprising the oriented of text flow node, label node and contents semantic node Figure；

Fig. 2 is contents semantic method for digging flow chart provided in an embodiment of the present invention；

Fig. 3 carries out weight computing between text and label, between label and label to be provided in an embodiment of the present invention Flow chart；

Fig. 4 is the flow chart of the content node set of building weighting provided in an embodiment of the present invention；

Fig. 5 is the flow chart of the small-world network cluster calculation provided in an embodiment of the present invention being weighted；

Fig. 6 is the ratio chart that training data simulation result falls into actual classification set in verification test；

Fig. 7 be in verification test in phylogenetic group with practical group in correctly number of edges, the number of edges of missing and extra number of edges ratio Compared with figure.

Specific embodiment

The present invention will be further explained below with reference to the attached drawings and specific examples.

Contents semantic method for digging disclosed by the embodiments of the present invention is clustered using weighted digraph method and small-world network Method combines, and Lai Shixian contents semantic and emotional orientation analysis and the content characteristic item of data flow extract.With the big number of microblogging For stream, as shown in Figure 1, describing the text of microblogging, the semanteme in link and Sentiment orientation and knot using weighted digraph Structure carries out contents semantic excavation using the correlation theory and method of digraph structure.

Digraph can be by binary group T=< V, E a > and indicate, wherein V indicates that node set, E indicate node Between side set.Text, the label category in link are described in the present embodiment using the markup of digraph, order Property, there are three classes nodes in the digraph of microblogging high amount of traffic class: text data class node, flag attribute class node and content Semantic category node.Wherein, the big data text being derived from microblogging and link that text data class node indicates, flag attribute Class node corresponds to specific label and the attribute with itself, and what contents semantic class node then indicated to extract has specific language The content knowledge of justice and emotion tendency.

The power of contents semantic and emotion tendency is calculated according to the weight of oriented path in graphs, in text data class Between node and flag attribute class node, between flag attribute class node and flag attribute class node, flag attribute class node and Between contents semantic class node connected by directed edge, these connections while connection it is strong and weak connected by being attached to while on Weight function K is indicated.Certain contents semantic expressed by microblogging text can be by having a series of weights and one in this way The path of series of markings is expressed, and the power of contents semantic and emotion tendency can then be analyzed by weight computing.

It is all that the content node for being obtained about contents semantic and emotion tendency is extracted from digraph, it can be by interior Hold clustering and obtains the cluster of the small-world network based on content.The cluster is exactly required with semantic and emotion tendency , content clustering with specific area feature.Small-world network model is that one kind is different from free scale model and Random Graph Model, the regular network with certain randomness, it can be by adjustment parameter from regular network to random network transition.Small generation Boundary's network model is intended to form cluster and group, therefore can play a role well in micro-blog information cluster.

It is similar to other regular networks, random network and Complex Networks Analysis, it can be in small-world network clustering The aggregation extent of network node is described using cluster coefficients, the aggregation extent that this cluster coefficients describe network structure is special Property.Cluster coefficients represent the tightness degree of a node and other neighbouring nodes in network connected each other, comment in microblogging Certain microblogging comment and the aggregation extent of every other associated microblogging are represented in.

For some node V with K connection side, cluster coefficients can be indicated are as follows:Wherein k_vFor the node number with connection frontier juncture system of node V.Consider the microblogging in all ranges Node is commented on, thenWherein n indicates all number of network nodes, and coefficient C is exactly cluster coefficients.

For digraph T=< V, E >, directed edge < v is defined_i,v_j> is E_ij, wherein V=< U, L, W > is then oriented Figure can also be expressed as T=< U, L, W, E >, and wherein U is the text data class node from microblogging, and L is to extract in text Flag attribute class node, W is the contents semantic node for representing different content semanteme and emotion tendency, E be above-mentioned node it Between directed edge.The linking number (including out-degree and in-degree) for defining the directed edge of any node indicates with N, then contents semantic knot Point W can be with is defined as:

In above-mentioned digraph, what weight indicated is text data class node and the flag attribute class knot with certain label The frequency between point, the close relation degree between flag attribute class node and other flag attribute class nodes, and label belong to The tightness degree of property class node and some contents semantic node, K_ijIndicate node V_iWith node V_jBetween directed edge weight it is big It is small.

In the present embodiment, defined function N (V_ij) it is in V_iOriented line set E in have connection side node set, and Its another link node is V_j, and V_j∈ L, then the weight between node can be with is defined as:, For the digraph, ∑_ij∈T K_ij=| V (U, L, W) |.

As shown in Fig. 2, the present invention provides a kind of contents semantic method for digging of unstructured high amount of traffic, including it is following Step:

Step S1: providing a high amount of traffic, and text link, tag attributes and the semantic tendency extracted in the high amount of traffic closes Keyword, defining each text link is text node, and each tag attributes are label node, and each semantic tendency keyword is content knot Point.

Step S2: building includes the text node set of each text node, and the label node comprising each label node Set calculates and exports text node and ties to the weight marked between node and any label node to other all labels Weight between point；

Step S3: according to text node set, label node set, text node to the weight marked between node, appoint Meaning label node carries out semantic classification to each content node and constructs in different to the weight between other all label nodes Hold node classification set；

Step S4: according to text node set, content node classification set, the small-word networks that text node is weighted Network cluster calculation obtains text node cluster set.

Further, as shown in figure 3, in above-mentioned steps S2, specifically includes the following steps:

Step S20: building includes the text node set of each text node, and the label node comprising each label node Set；

Step S23: calculating and exports any label node to the Eigen-frequencies between other all label nodes.

In the present embodiment, steps are as follows for the specific calculating in step S2:

Step a1: defined parameters i and j, and tax initial value is carried out to parameter i and j, so that i=0, j=1；

Step a2: it is directed to text node U_i, calculate U_iOccurs other labels node L in the characteristic value collection for being included_jFeature The frequency of value is denoted as N (V_ij)；

Step a3: next label node L is taken_j, to j P-adic valuation P, so that j=j+1

Step a4: return step a2 executes, and until traversing all label nodes, calculates text node U_iTo all marks Remember node L_jFrequency；

Step a5: label node L is taken_k, calculate label node L_kNode L is marked to other_jThe characteristic value frequency, be denoted as N (V_kj)；

Step a6: next label node L is taken_k, assignment is carried out to k, so that k=k+1, return step a5 and traversing all Node L_k；

Step a7: next text node U is taken_i, assignment is carried out to i, so that i=i+1, return step a2 are until traversing institute Some text nodes；

Step a8: it calculates

It can be calculated between text node and all label nodes by above-mentioned calculating, mark node and other labels Contents semantic tightness between node can reach weight two of certain numerical value range by the way that different threshold values is arranged Node V_i, V_jBetween connect a directed edge, indicate by node V_iTo node V_jHas directed edge E_ij, and weighting value is K_ij, exist in this way Text node, each label node just constitute an oriented authorized graph.Using text node as starting point, oriented have the right at this Contents semantic classification is carried out to each path on the basis of figure, so that it may be formed with the different semantic contents semantic collection for terminal It closes.

The basic skills for carrying out contents semantic classification is to be carried out these weights by the weight of each directed walk of calculating Accumulation calculating, finds out a series of path that total weight values are greater than certain threshold value, and the node on each path illustrates a kind of difference Contents semantic, can be regarded as a classification.In actually calculating, it is usual that routine weight value is calculated using path length It is relatively convenient, therefore the inverse of routine weight value can be taken as path length, the weight between node is bigger, then its path length It spends shorter；Conversely, the weight between node is smaller, then its path length is longer.Therefore a series of some paths are found, at this Path summation on a little paths indicates that contents semantic expressed by the node on these paths is close in certain threshold value , same type can be classified as.By the different threshold value of setting, the case where different path summations, can be taken into account, To mark off the contents semantic come classification number also with regard to difference, each difference of the number of the contents semantic node showed.

The length in path is the inverse of weight between definition node, is denoted asTherefore work as K_ij=0 When, the path length between node is exactly S (V_i,V_j)=∞.When the path length between node more in short-term, indicate node between it is close Degree is higher, otherwise the tightness degree between node is lower.Search out the node of the process on all shortest paths, so that it may look for Out with contents semantic similar in such node, the second short path is next searched again for out, can equally be found out another kind of close Contents semantic, set a threshold range, find out all corresponding contents semantics in such path, all these are different Contents semantic set be exactly one of contents semantic classification.

Further, as shown in figure 4, in above-mentioned steps S3, specifically includes the following steps:

Step S31: the length in more each path simultaneously finds out the shortest path of length, wherein the shortest path of length is power It is worth maximum path；

Step S32: continue the length in more remaining each path and arranged；

Step S33: remaining each text node is successively subjected to the operation of step S30 to step S32, determines each text Path permutations sequence of the node to all label nodes；

Step S34: one path length threshold of setting determines the text knot being less than or equal in the path of path length threshold Point meets the requirements；

Step S35: be directed to satisfactory text node, calculate respective markers node attribute and to each content node into Row semantic classification constructs different content node classification set.

In the present embodiment, steps are as follows for the specific calculating in step S3:

Step b1: initialization path node set s ← U_i；R={ s:U_i}；

Step b2: the label node L in digraph is chosen_j, judgement: if path lengthGreatly In given threshold ε, illustrate that tightness is smaller, abandons the node；If otherwise path length is less than or equal to threshold epsilon, illustrate node Between tightness it is higher, be further processed；

Step b3: if label node L_jIn the node set of path in R, then illustrates to have handled completion, go to step b2；If flag node L_jNot in the node set R of path, then by node L_jIt is added in set of paths R, and node s is set as L_jSource node, be expressed asThat is R=R ∪ { L_j,

Step b4: for s to L_jFlag node L on each path_k, judgement:

(1) if s → L_kThe distance in path is most short, then the path is incorporated s；

(2) if L_kIt is the intermediate node in path, and at the same time the node is in s to node L_jPath on, then delete L_k, even Meet its source node s and destination node L_j；

(3) if L_kTo L_jPath in path s to L_jPath on, i.e. L_k→L_jIt is s → L_jSubset, then the path is melted Enter L_j, until all s → L_iWithout L on path_k→L_jNode exist, if there is the node set of a paths in obtained path Subset be another paths node set, then delete the path, then the node permeate node；

(4) it calculates and arrives node L_jGlobal shortest path on path:

Step b5: next text node U is taken_i, return step 1, until traversing all text nodes to be treated；

Step b6: choosing and sets classification tightness threshold function table f (ξ), is handled as follows:

(1) for some text node U_iIfThen all nodes are an effective roads on its path A kind of diameter, corresponding contents semantic classification, is denoted as W_k；

(2) all satisfactions are listedActive path, be denoted as W_i；

Step b7: output W_iAnd its for the node on path.

Calculating in S2 and step S3 through the above steps, it is available about reflect in text node stream it is different in Hold semanteme, these contents semantics may be the meaning excavated from a text flow, it is also possible to dig from multiple text flows The meaning excavated, also, as the continuous variation of text flow will appear different contents semantics.By analyzing these content languages Justice can understand the hot issue of current public attention in microblogging comment text in time, grasp common people's public sentiment and its dynamic.

In microblogging class high amount of traffic, other than the hot issue and Sentiment orientation of being concerned about current public attention, sometimes Be also required to understand which text comments be concerned with same class things, expression be same class emotion, describe in same class Hold, therefore content of text progress clustering is very important.

The road of text node cluster to the effect that passed through by analyzing text node to each different content node Electrical path length and node attribute find out most substantive content that text node is reflected and most crucial semantic emotion, will have it is identical and It is one group that the text flow of similar contents semanteme, which incorporates into, tends to express identical and phase to form all text flows in this set Close content has common and approximate semantic and emotion tendency, can understand which microblogging expression in practical applications Same or similar content, publisher, commentator and the disseminator for being extended to discovery microblogging have identical or approximate language Adopted emotion, and then provide basic fundamental for the monitoring of microblogging public's public sentiment and support.

In the present embodiment, text node clustering method uses the method analyzed based on small-world network, in small-world network In analysis, two most crucial parameters are path length and cluster coefficients, are indicated with S and C.In text node cluster, how Determine that the two parameters are vital.Common microblogging node can be by counting the reading of microblogging, comment, pushing up, praise, instead To etc. behavior judge the associated path and its path length between microblogging node.But this method is difficult to find different ginsengs With the content degree of correlation and its personal semantic emotion of person, it is difficult to accurate to grasp real incidence relation between microblogging node.In order to It solves the problems, such as this, the incidence relation and tightness degree of microblogging node is analyzed herein by contents semantic, passes through contents semantic It calculates the associated path and path length between text node, and then is carried out these nodes using reasonable cluster coefficients small World's network clustering.

If G=< U, W, E > is microblogging text flow cyberspace set, wherein U, W are respectively text node and content language Adopted node set, E are global path line set.The target of microblogging text flow small-world network cluster is exactly with contents semantic node W is intermediary, finds certain network aggregation subspaces about text node U, all texts in these aggregation subspaces Node shows a kind of substantial connection.

Further, as shown in figure 5, in step s 4, specifically includes the following steps:

Step S41: determine starting text node to the shortest length path of content node；

Step S43: the length threshold in one shortest length path of setting；

In the present embodiment, steps are as follows for the specific calculating of step S4:

Step c1: selection cluster coefficients C, whereinWherein, n indicates all node U_iNumber；

Step c2: a node U is removed from network node space_k, calculateValue；

Step c3: judgement: if

Then enableTurn It is executed to step c2, the node U until not meeting above-mentioned condition_kUntil；Obtain a corresponding subset G about G^*=< U^*,E_U,W ^*>；Wherein,

Step c4: for subset G^*, repeat and delete a pathsSo thatFunction It minimizes；

Step c5: for subset G^*, repeat and a paths be addedSo thatFunction It minimizes；

Step c6: during repeating, when the path of deletion is identical as the path of addition, subspace minimum is obtained Path distanceWith corresponding subsetStep c3 is repeated to step c6 until finding out There is such subset

Step c7: adjustment cluster coefficients C goes to step c1 and continues to execute, obtain all subsets under new coefficient；

Step c8: the path length for calculating subset under each cluster coefficients value adds up to minimum value:

Meet the cluster coefficients C of the condition^*It is that high content is poly- with subset G ' Class, high semantic approximate, high touch tendency text node small-world network cluster, are denoted as:

For the validity for verifying the embodiment of the present invention, present embodiments provides public data collection and carry out experiment test.The number It is pasted according to collection from Twitter microblogging, includes about 10,000,000 Twitter microblogging notice, these microblogging models come from about 20 General-purpose family.The attribute of data intensive data includes: ID, User, Date, Url, Hashtag, MentionID.In this experimental setup In, data set is divided into two parts: training set and test set.Wherein data set as feature selecting and establishes semantic emotion The data of the training data of vector, remainder are used as test set.Firstly, randomly selecting 10,000 datas from data set and protecting The data of card half or so are the data that MentionID refers to ID, these data are as training data.Then 1000 are extracted again Such data: the ID of this data is at least being referred to 10 times or more by the MentionID of other ID, as test data set.For The accuracy of verification result, it is also necessary to these test datas are subjected to manual analysis, every data is carried out according to contents semantic Classification, training result will pass through the correctness compared with manual sort come verification algorithm.Manual sort's situation such as the following table 1 institute Show:

Contents semantic classification accuracy situation is as shown in Figure 6 under different training data item numbers.

The Accuracy Verification of microblogging node cluster is another experimental index, in same cluster, the User institute of Twitter The microblogging patch of publication should have similar attribute and similar semanteme, that is, consistent with having on Sentiment orientation in content Property.We judge the consistency of content using the MentionID item in data set, specifically using user's node in cluster Between strive for connection number of edges mesh, missing connection number of edges mesh and extra connection number of edges mesh express.Simultaneously using in table 1 Emotional semantic classification judges tendentious accuracy, correctly number of edges, the number of edges of missing and extra number of edges experiment in difference cluster As a result as shown in Figure 7, it can be seen that the number of edges of the remote extra mistake of the quantity of correct number of edges in every group cluster illustrates the standard of cluster True property is higher.

The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

1. a kind of contents semantic method for digging of unstructured high amount of traffic, which comprises the following steps:

Step S1: providing a high amount of traffic, and text link, tag attributes and the semantic tendency extracted in the high amount of traffic is crucial Word, defining each text link is text node, and each tag attributes are label node, each semantic tendency keyword For content node；

Step S2: building includes the text node set of each text node, and the label comprising each label node Node set calculates and exports the text node to the weight and any label node between the label node Weight between other all label nodes；

Step S3: according to the text node set, label node set, text node to the weight marked between node, appoint Meaning label node carries out semantic classification to each content node and constructs difference to the weight between other all label nodes Content node classify set；

Step S4: according to the text node set, content node classification set, the small-word networks that text node is weighted Network cluster calculation obtains text node cluster set.

2. a kind of contents semantic method for digging of unstructured high amount of traffic according to claim 1, which is characterized in that institute State step S2 the following steps are included:

Step S20: building includes the text node set of each text node, and the label comprising each label node Node set；

3. a kind of contents semantic method for digging of unstructured high amount of traffic according to claim 1, which is characterized in that institute State step S3 the following steps are included:

Step S31: the length in more each path simultaneously finds out the shortest path of length, wherein the shortest path of length For the path of maximum weight；

Step S32: continue the length in more remaining each path and arranged；

Step S34: one path length threshold of setting determines the text knot being less than or equal in the path of the path length threshold Point meets the requirements；