CN105740329B - A kind of contents semantic method for digging of unstructured high amount of traffic - Google Patents
A kind of contents semantic method for digging of unstructured high amount of traffic Download PDFInfo
- Publication number
- CN105740329B CN105740329B CN201610041935.3A CN201610041935A CN105740329B CN 105740329 B CN105740329 B CN 105740329B CN 201610041935 A CN201610041935 A CN 201610041935A CN 105740329 B CN105740329 B CN 105740329B
- Authority
- CN
- China
- Prior art keywords
- node
- text
- label
- path
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of contents semantic method for digging of unstructured high amount of traffic, including S1: extracting text link, tag attributes and the semantic tendency keyword in high amount of traffic, and corresponding definition text node, label node, content node;S2: building includes the text node set of each text node, the label node set comprising each label node, calculate and export text node between label node weight and any label node to other all weights marked between nodes;S3: according to text node set, label node set, text node to weight, any label node to the weight between other all label nodes between label node, semantic classification is carried out to each content node and constructs different content node classification set;S4: text node cluster set is obtained to the small-world network cluster calculation that text node is weighted according to text node set, content node classification set.
Description
Technical field
The present invention relates to field of computer technology, and in particular to a kind of contents semantic excavation side of unstructured high amount of traffic
Method.
Background technique
With the rapid development and application of WEB2.0 technology, the network information interaction of the forms such as blog, microblogging, wechat becomes
Important information interchange mode includes the data of structuring, semi-structured number in the data information that these exchange ways generate
According to and unstructured data, wherein mostly based on unstructured data.Propagation, update of these data by numerous people, day are long-pending
It is month tired to become that a kind of structure is complicated, the big data set of the heterogeneous of content multiplicity, isomery, magnanimity.Include in this big data
Various information, such as user is to certain things, event, commodity or the evaluation of service, attitude, behavior content, such as
What extracts these valuable contents from huge big data, to be provided with for enterprise, mechanism and personal user
The service of value is very important.
By taking microblogging high amount of traffic as an example, microblogging high amount of traffic is using all kinds of microblog datas of real-time online as core, with non-knot
Structure data are principal mode, and the emphasis excavated to such high amount of traffic is the extraction, classification and cluster etc. of core content.
Further, since the data generated in similar microblogging have strong semanteme often with the intention and tendency of author and commentator
And certain attitude and tendency that affective characteristics namely author or commentator are revealed for certain table of contents in microblogging, how will
The semanteme of these contents and Sentiment orientation wherein included extract, and combine with content mining, are that unstructured microblogging is big
The emphasis of the content mining of data flow.
Existing Data Stream Processing and method for digging, the Frequent Pattern Mining including being directed to traditional data collection type, and
For Frequent Pattern Mining, effective mode excavation, the sliding window control technology etc. under large data sets.But these excavation sides
On the one hand method can only be handled for structuring or semi-structured data, it is difficult to handle the unstructured of the types such as microblogging, blog
Data;On the other hand such method for digging does not account for the semanteme and emotion tendency problem of data content, it is difficult to correct to hold
To the core point of data content.
Summary of the invention
The technical problem to be solved by the present invention is to contents semantic digging can be carried out for unstructured high amount of traffic by providing one kind
Pick carries out contents semantic clustering, can grasp semantic and emotion tendency method in high amount of traffic content in time.
The technical scheme is that a kind of contents semantic method for digging of unstructured high amount of traffic is provided, including with
Lower step:
Step S1: a high amount of traffic is provided, text link, tag attributes and the semantic tendency in the high amount of traffic are extracted
Keyword, defining each text link is text node, and each tag attributes are label node, and each semantic tendency closes
Keyword is content node;
Step S2: building includes the text node set of each text node, and includes each label node
Node set is marked, calculate and exports the text node to the weight and any label between the label node
Node to other it is all label nodes between weights;
Step S3: according to the text node set, label node set, text node to the power between label node
Value, any label node carry out semantic classification and structure to each content node to the weight between other all label nodes
Build different content node classification set;
Step S4: according to the text node set, content node classification set, the small generation that text node is weighted
Boundary's network clustering calculates, and obtains text node cluster set.
Further, the step S2 the following steps are included:
Step S20: building includes the text node set of each text node, and includes each label node
Mark node set;
Step S21: the frequency of each text node, the characteristic value for marking node is marked;
Step S22: calculating and exports each text node to all frequencies for marking nodes;
Step S23: calculating and exports any label node to the Eigen-frequencies between other all label nodes;
Further, the step S3 the following steps are included:
Step S30: traversal is all from each starting text node to the path of each label node;
Step S31: the length in more each path simultaneously finds out the shortest path of length, wherein the length is shortest
Path is the path of maximum weight;
Step S32: continue the length in more remaining each path and arranged;
Step S33: successively carrying out the operation of step S30 to step S32 for remaining each text node, determines each described
Path permutations sequence of the text node to all label nodes;
Step S34: one path length threshold of setting determines the text being less than or equal in the path of the path length threshold
This node meets the requirements;
Step S35: being directed to satisfactory text node, calculates the attribute of respective markers node and to each content knot
Point carries out semantic classification, constructs different content node classification set.
Further, the step S4 the following steps are included:
Step S40: the path length of all starting text nodes of calculating to different content node;
Step S41: determine starting text node to the shortest length path of the content node;
Step S42: all starting text nodes are found out to the path between content node with shortest length path;
Step S43: the length threshold in one shortest length path of setting;
Step S44: it will meet that content node is semantic and with shortest path length text node cluster collects to one
It closes, to obtain text node cluster set.
The extraction of technical solution of the present invention having the beneficial effect that by carrying out text mark and content to high amount of traffic, and
Weight computing is carried out between text and label, between label and label, then to the content node set of building weighting, finally
The small-world network cluster calculation being weighted again, to obtain the cluster set of the text node of certain corresponding a kind of content node semanteme
Close, can the contents semantic and emotion tendency grasped in high amount of traffic promptly and accurately, for the high amount of traffic of similar microblogging,
The hot issue of current public attention in microblogging comment text can be understood in time, grasp common people's public sentiment and its dynamic, Jin Erwei
The monitoring of microblogging public's public sentiment provides basic fundamental and supports.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing.
Fig. 1 is provided in an embodiment of the present invention comprising the oriented of text flow node, label node and contents semantic node
Figure;
Fig. 2 is contents semantic method for digging flow chart provided in an embodiment of the present invention;
Fig. 3 carries out weight computing between text and label, between label and label to be provided in an embodiment of the present invention
Flow chart;
Fig. 4 is the flow chart of the content node set of building weighting provided in an embodiment of the present invention;
Fig. 5 is the flow chart of the small-world network cluster calculation provided in an embodiment of the present invention being weighted;
Fig. 6 is the ratio chart that training data simulation result falls into actual classification set in verification test;
Fig. 7 be in verification test in phylogenetic group with practical group in correctly number of edges, the number of edges of missing and extra number of edges ratio
Compared with figure.
Specific embodiment
The present invention will be further explained below with reference to the attached drawings and specific examples.
Contents semantic method for digging disclosed by the embodiments of the present invention is clustered using weighted digraph method and small-world network
Method combines, and Lai Shixian contents semantic and emotional orientation analysis and the content characteristic item of data flow extract.With the big number of microblogging
For stream, as shown in Figure 1, describing the text of microblogging, the semanteme in link and Sentiment orientation and knot using weighted digraph
Structure carries out contents semantic excavation using the correlation theory and method of digraph structure.
Digraph can be by binary group T=< V, E a > and indicate, wherein V indicates that node set, E indicate node
Between side set.Text, the label category in link are described in the present embodiment using the markup of digraph, order
Property, there are three classes nodes in the digraph of microblogging high amount of traffic class: text data class node, flag attribute class node and content
Semantic category node.Wherein, the big data text being derived from microblogging and link that text data class node indicates, flag attribute
Class node corresponds to specific label and the attribute with itself, and what contents semantic class node then indicated to extract has specific language
The content knowledge of justice and emotion tendency.
The power of contents semantic and emotion tendency is calculated according to the weight of oriented path in graphs, in text data class
Between node and flag attribute class node, between flag attribute class node and flag attribute class node, flag attribute class node and
Between contents semantic class node connected by directed edge, these connections while connection it is strong and weak connected by being attached to while on
Weight function K is indicated.Certain contents semantic expressed by microblogging text can be by having a series of weights and one in this way
The path of series of markings is expressed, and the power of contents semantic and emotion tendency can then be analyzed by weight computing.
It is all that the content node for being obtained about contents semantic and emotion tendency is extracted from digraph, it can be by interior
Hold clustering and obtains the cluster of the small-world network based on content.The cluster is exactly required with semantic and emotion tendency
, content clustering with specific area feature.Small-world network model is that one kind is different from free scale model and Random Graph
Model, the regular network with certain randomness, it can be by adjustment parameter from regular network to random network transition.Small generation
Boundary's network model is intended to form cluster and group, therefore can play a role well in micro-blog information cluster.
It is similar to other regular networks, random network and Complex Networks Analysis, it can be in small-world network clustering
The aggregation extent of network node is described using cluster coefficients, the aggregation extent that this cluster coefficients describe network structure is special
Property.Cluster coefficients represent the tightness degree of a node and other neighbouring nodes in network connected each other, comment in microblogging
Certain microblogging comment and the aggregation extent of every other associated microblogging are represented in.
For some node V with K connection side, cluster coefficients can be indicated are as follows:Wherein kvFor the node number with connection frontier juncture system of node V.Consider the microblogging in all ranges
Node is commented on, thenWherein n indicates all number of network nodes, and coefficient C is exactly cluster coefficients.
For digraph T=< V, E >, directed edge < v is definedi,vj> is Eij, wherein V=< U, L, W > is then oriented
Figure can also be expressed as T=< U, L, W, E >, and wherein U is the text data class node from microblogging, and L is to extract in text
Flag attribute class node, W is the contents semantic node for representing different content semanteme and emotion tendency, E be above-mentioned node it
Between directed edge.The linking number (including out-degree and in-degree) for defining the directed edge of any node indicates with N, then contents semantic knot
Point W can be with is defined as:
In above-mentioned digraph, what weight indicated is text data class node and the flag attribute class knot with certain label
The frequency between point, the close relation degree between flag attribute class node and other flag attribute class nodes, and label belong to
The tightness degree of property class node and some contents semantic node, KijIndicate node ViWith node VjBetween directed edge weight it is big
It is small.
In the present embodiment, defined function N (Vij) it is in ViOriented line set E in have connection side node set, and
Its another link node is Vj, and Vj∈ L, then the weight between node can be with is defined as:,
For the digraph, ∑ij∈T Kij=| V (U, L, W) |.
As shown in Fig. 2, the present invention provides a kind of contents semantic method for digging of unstructured high amount of traffic, including it is following
Step:
Step S1: providing a high amount of traffic, and text link, tag attributes and the semantic tendency extracted in the high amount of traffic closes
Keyword, defining each text link is text node, and each tag attributes are label node, and each semantic tendency keyword is content knot
Point.
Step S2: building includes the text node set of each text node, and the label node comprising each label node
Set calculates and exports text node and ties to the weight marked between node and any label node to other all labels
Weight between point;
Step S3: according to text node set, label node set, text node to the weight marked between node, appoint
Meaning label node carries out semantic classification to each content node and constructs in different to the weight between other all label nodes
Hold node classification set;
Step S4: according to text node set, content node classification set, the small-word networks that text node is weighted
Network cluster calculation obtains text node cluster set.
Further, as shown in figure 3, in above-mentioned steps S2, specifically includes the following steps:
Step S20: building includes the text node set of each text node, and the label node comprising each label node
Set;
Step S21: the frequency of each text node, the characteristic value for marking node is marked;
Step S22: calculating and exports each text node to all frequencies for marking nodes;
Step S23: calculating and exports any label node to the Eigen-frequencies between other all label nodes.
In the present embodiment, steps are as follows for the specific calculating in step S2:
Step a1: defined parameters i and j, and tax initial value is carried out to parameter i and j, so that i=0, j=1;
Step a2: it is directed to text node Ui, calculate UiOccurs other labels node L in the characteristic value collection for being includedjFeature
The frequency of value is denoted as N (Vij);
Step a3: next label node L is takenj, to j P-adic valuation P, so that j=j+1
Step a4: return step a2 executes, and until traversing all label nodes, calculates text node UiTo all marks
Remember node LjFrequency;
Step a5: label node L is takenk, calculate label node LkNode L is marked to otherjThe characteristic value frequency, be denoted as N
(Vkj);
Step a6: next label node L is takenk, assignment is carried out to k, so that k=k+1, return step a5 and traversing all
Node Lk;
Step a7: next text node U is takeni, assignment is carried out to i, so that i=i+1, return step a2 are until traversing institute
Some text nodes;
Step a8: it calculates
It can be calculated between text node and all label nodes by above-mentioned calculating, mark node and other labels
Contents semantic tightness between node can reach weight two of certain numerical value range by the way that different threshold values is arranged
Node Vi, VjBetween connect a directed edge, indicate by node ViTo node VjHas directed edge Eij, and weighting value is Kij, exist in this way
Text node, each label node just constitute an oriented authorized graph.Using text node as starting point, oriented have the right at this
Contents semantic classification is carried out to each path on the basis of figure, so that it may be formed with the different semantic contents semantic collection for terminal
It closes.
The basic skills for carrying out contents semantic classification is to be carried out these weights by the weight of each directed walk of calculating
Accumulation calculating, finds out a series of path that total weight values are greater than certain threshold value, and the node on each path illustrates a kind of difference
Contents semantic, can be regarded as a classification.In actually calculating, it is usual that routine weight value is calculated using path length
It is relatively convenient, therefore the inverse of routine weight value can be taken as path length, the weight between node is bigger, then its path length
It spends shorter;Conversely, the weight between node is smaller, then its path length is longer.Therefore a series of some paths are found, at this
Path summation on a little paths indicates that contents semantic expressed by the node on these paths is close in certain threshold value
, same type can be classified as.By the different threshold value of setting, the case where different path summations, can be taken into account,
To mark off the contents semantic come classification number also with regard to difference, each difference of the number of the contents semantic node showed.
The length in path is the inverse of weight between definition node, is denoted asTherefore work as Kij=0
When, the path length between node is exactly S (Vi,Vj)=∞.When the path length between node more in short-term, indicate node between it is close
Degree is higher, otherwise the tightness degree between node is lower.Search out the node of the process on all shortest paths, so that it may look for
Out with contents semantic similar in such node, the second short path is next searched again for out, can equally be found out another kind of close
Contents semantic, set a threshold range, find out all corresponding contents semantics in such path, all these are different
Contents semantic set be exactly one of contents semantic classification.
Further, as shown in figure 4, in above-mentioned steps S3, specifically includes the following steps:
Step S30: traversal is all from each starting text node to the path of each label node;
Step S31: the length in more each path simultaneously finds out the shortest path of length, wherein the shortest path of length is power
It is worth maximum path;
Step S32: continue the length in more remaining each path and arranged;
Step S33: remaining each text node is successively subjected to the operation of step S30 to step S32, determines each text
Path permutations sequence of the node to all label nodes;
Step S34: one path length threshold of setting determines the text knot being less than or equal in the path of path length threshold
Point meets the requirements;
Step S35: be directed to satisfactory text node, calculate respective markers node attribute and to each content node into
Row semantic classification constructs different content node classification set.
In the present embodiment, steps are as follows for the specific calculating in step S3:
Step b1: initialization path node set s ← Ui;R={ s:Ui};
Step b2: the label node L in digraph is chosenj, judgement: if path lengthGreatly
In given threshold ε, illustrate that tightness is smaller, abandons the node;If otherwise path length is less than or equal to threshold epsilon, illustrate node
Between tightness it is higher, be further processed;
Step b3: if label node LjIn the node set of path in R, then illustrates to have handled completion, go to step
b2;If flag node LjNot in the node set R of path, then by node LjIt is added in set of paths R, and node s is set as
LjSource node, be expressed asThat is R=R ∪ { Lj,
Step b4: for s to LjFlag node L on each pathk, judgement:
(1) if s → LkThe distance in path is most short, then the path is incorporated s;
(2) if LkIt is the intermediate node in path, and at the same time the node is in s to node LjPath on, then delete Lk, even
Meet its source node s and destination node Lj;
(3) if LkTo LjPath in path s to LjPath on, i.e. Lk→LjIt is s → LjSubset, then the path is melted
Enter Lj, until all s → LiWithout L on pathk→LjNode exist, if there is the node set of a paths in obtained path
Subset be another paths node set, then delete the path, then the node permeate node;
(4) it calculates and arrives node LjGlobal shortest path on path:
Step b5: next text node U is takeni, return step 1, until traversing all text nodes to be treated;
Step b6: choosing and sets classification tightness threshold function table f (ξ), is handled as follows:
(1) for some text node UiIfThen all nodes are an effective roads on its path
A kind of diameter, corresponding contents semantic classification, is denoted as Wk;
(2) all satisfactions are listedActive path, be denoted as Wi;
Step b7: output WiAnd its for the node on path.
Calculating in S2 and step S3 through the above steps, it is available about reflect in text node stream it is different in
Hold semanteme, these contents semantics may be the meaning excavated from a text flow, it is also possible to dig from multiple text flows
The meaning excavated, also, as the continuous variation of text flow will appear different contents semantics.By analyzing these content languages
Justice can understand the hot issue of current public attention in microblogging comment text in time, grasp common people's public sentiment and its dynamic.
In microblogging class high amount of traffic, other than the hot issue and Sentiment orientation of being concerned about current public attention, sometimes
Be also required to understand which text comments be concerned with same class things, expression be same class emotion, describe in same class
Hold, therefore content of text progress clustering is very important.
The road of text node cluster to the effect that passed through by analyzing text node to each different content node
Electrical path length and node attribute find out most substantive content that text node is reflected and most crucial semantic emotion, will have it is identical and
It is one group that the text flow of similar contents semanteme, which incorporates into, tends to express identical and phase to form all text flows in this set
Close content has common and approximate semantic and emotion tendency, can understand which microblogging expression in practical applications
Same or similar content, publisher, commentator and the disseminator for being extended to discovery microblogging have identical or approximate language
Adopted emotion, and then provide basic fundamental for the monitoring of microblogging public's public sentiment and support.
In the present embodiment, text node clustering method uses the method analyzed based on small-world network, in small-world network
In analysis, two most crucial parameters are path length and cluster coefficients, are indicated with S and C.In text node cluster, how
Determine that the two parameters are vital.Common microblogging node can be by counting the reading of microblogging, comment, pushing up, praise, instead
To etc. behavior judge the associated path and its path length between microblogging node.But this method is difficult to find different ginsengs
With the content degree of correlation and its personal semantic emotion of person, it is difficult to accurate to grasp real incidence relation between microblogging node.In order to
It solves the problems, such as this, the incidence relation and tightness degree of microblogging node is analyzed herein by contents semantic, passes through contents semantic
It calculates the associated path and path length between text node, and then is carried out these nodes using reasonable cluster coefficients small
World's network clustering.
If G=< U, W, E > is microblogging text flow cyberspace set, wherein U, W are respectively text node and content language
Adopted node set, E are global path line set.The target of microblogging text flow small-world network cluster is exactly with contents semantic node
W is intermediary, finds certain network aggregation subspaces about text node U, all texts in these aggregation subspaces
Node shows a kind of substantial connection.
Further, as shown in figure 5, in step s 4, specifically includes the following steps:
Step S40: the path length of all starting text nodes of calculating to different content node;
Step S41: determine starting text node to the shortest length path of content node;
Step S42: all starting text nodes are found out to the path between content node with shortest length path;
Step S43: the length threshold in one shortest length path of setting;
Step S44: it will meet that content node is semantic and with shortest path length text node cluster collects to one
It closes, to obtain text node cluster set.
In the present embodiment, steps are as follows for the specific calculating of step S4:
Step c1: selection cluster coefficients C, whereinWherein, n indicates all node UiNumber;
Step c2: a node U is removed from network node spacek, calculateValue;
Step c3: judgement: if
Then enableTurn
It is executed to step c2, the node U until not meeting above-mentioned conditionkUntil;Obtain a corresponding subset G about G*=<
U*,EU,W *>;Wherein,
Step c4: for subset G*, repeat and delete a pathsSo thatFunction
It minimizes;
Step c5: for subset G*, repeat and a paths be addedSo thatFunction
It minimizes;
Step c6: during repeating, when the path of deletion is identical as the path of addition, subspace minimum is obtained
Path distanceWith corresponding subsetStep c3 is repeated to step c6 until finding out
There is such subset
Step c7: adjustment cluster coefficients C goes to step c1 and continues to execute, obtain all subsets under new coefficient;
Step c8: the path length for calculating subset under each cluster coefficients value adds up to minimum value:
Meet the cluster coefficients C of the condition*It is that high content is poly- with subset G '
Class, high semantic approximate, high touch tendency text node small-world network cluster, are denoted as:
For the validity for verifying the embodiment of the present invention, present embodiments provides public data collection and carry out experiment test.The number
It is pasted according to collection from Twitter microblogging, includes about 10,000,000 Twitter microblogging notice, these microblogging models come from about 20
General-purpose family.The attribute of data intensive data includes: ID, User, Date, Url, Hashtag, MentionID.In this experimental setup
In, data set is divided into two parts: training set and test set.Wherein data set as feature selecting and establishes semantic emotion
The data of the training data of vector, remainder are used as test set.Firstly, randomly selecting 10,000 datas from data set and protecting
The data of card half or so are the data that MentionID refers to ID, these data are as training data.Then 1000 are extracted again
Such data: the ID of this data is at least being referred to 10 times or more by the MentionID of other ID, as test data set.For
The accuracy of verification result, it is also necessary to these test datas are subjected to manual analysis, every data is carried out according to contents semantic
Classification, training result will pass through the correctness compared with manual sort come verification algorithm.Manual sort's situation such as the following table 1 institute
Show:
Contents semantic classification accuracy situation is as shown in Figure 6 under different training data item numbers.
The Accuracy Verification of microblogging node cluster is another experimental index, in same cluster, the User institute of Twitter
The microblogging patch of publication should have similar attribute and similar semanteme, that is, consistent with having on Sentiment orientation in content
Property.We judge the consistency of content using the MentionID item in data set, specifically using user's node in cluster
Between strive for connection number of edges mesh, missing connection number of edges mesh and extra connection number of edges mesh express.Simultaneously using in table 1
Emotional semantic classification judges tendentious accuracy, correctly number of edges, the number of edges of missing and extra number of edges experiment in difference cluster
As a result as shown in Figure 7, it can be seen that the number of edges of the remote extra mistake of the quantity of correct number of edges in every group cluster illustrates the standard of cluster
True property is higher.
The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation
Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art
Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (3)
1. a kind of contents semantic method for digging of unstructured high amount of traffic, which comprises the following steps:
Step S1: providing a high amount of traffic, and text link, tag attributes and the semantic tendency extracted in the high amount of traffic is crucial
Word, defining each text link is text node, and each tag attributes are label node, each semantic tendency keyword
For content node;
Step S2: building includes the text node set of each text node, and the label comprising each label node
Node set calculates and exports the text node to the weight and any label node between the label node
Weight between other all label nodes;
Step S3: according to the text node set, label node set, text node to the weight marked between node, appoint
Meaning label node carries out semantic classification to each content node and constructs difference to the weight between other all label nodes
Content node classify set;
Step S4: according to the text node set, content node classification set, the small-word networks that text node is weighted
Network cluster calculation obtains text node cluster set.
2. a kind of contents semantic method for digging of unstructured high amount of traffic according to claim 1, which is characterized in that institute
State step S2 the following steps are included:
Step S20: building includes the text node set of each text node, and the label comprising each label node
Node set;
Step S21: the frequency of each text node, the characteristic value for marking node is marked;
Step S22: calculating and exports each text node to all frequencies for marking nodes;
Step S23: calculating and exports any label node to the Eigen-frequencies between other all label nodes.
3. a kind of contents semantic method for digging of unstructured high amount of traffic according to claim 1, which is characterized in that institute
State step S3 the following steps are included:
Step S30: traversal is all from each starting text node to the path of each label node;
Step S31: the length in more each path simultaneously finds out the shortest path of length, wherein the shortest path of length
For the path of maximum weight;
Step S32: continue the length in more remaining each path and arranged;
Step S33: remaining each text node is successively subjected to the operation of step S30 to step S32, determines each text
Path permutations sequence of the node to all label nodes;
Step S34: one path length threshold of setting determines the text knot being less than or equal in the path of the path length threshold
Point meets the requirements;
Step S35: be directed to satisfactory text node, calculate respective markers node attribute and to each content node into
Row semantic classification constructs different content node classification set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610041935.3A CN105740329B (en) | 2016-01-21 | 2016-01-21 | A kind of contents semantic method for digging of unstructured high amount of traffic |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610041935.3A CN105740329B (en) | 2016-01-21 | 2016-01-21 | A kind of contents semantic method for digging of unstructured high amount of traffic |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105740329A CN105740329A (en) | 2016-07-06 |
CN105740329B true CN105740329B (en) | 2019-04-05 |
Family
ID=56247442
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610041935.3A Active CN105740329B (en) | 2016-01-21 | 2016-01-21 | A kind of contents semantic method for digging of unstructured high amount of traffic |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105740329B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106777124B (en) * | 2016-05-26 | 2018-06-22 | 中科鼎富(北京)科技发展有限公司 | Semantic knowledge method, apparatus and system |
CN111639181A (en) * | 2020-04-30 | 2020-09-08 | 深圳壹账通智能科技有限公司 | Paper classification method and device based on classification model, electronic equipment and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103236978A (en) * | 2013-04-17 | 2013-08-07 | 清华大学 | Determination method and device of topologic top AS (autonomous system) nodes |
CN104035917A (en) * | 2014-06-10 | 2014-09-10 | 复旦大学 | Knowledge graph management method and system based on semantic space mapping |
CN105005554A (en) * | 2015-06-30 | 2015-10-28 | 北京信息科技大学 | Method for calculating word semantic relevancy |
-
2016
- 2016-01-21 CN CN201610041935.3A patent/CN105740329B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103236978A (en) * | 2013-04-17 | 2013-08-07 | 清华大学 | Determination method and device of topologic top AS (autonomous system) nodes |
CN104035917A (en) * | 2014-06-10 | 2014-09-10 | 复旦大学 | Knowledge graph management method and system based on semantic space mapping |
CN105005554A (en) * | 2015-06-30 | 2015-10-28 | 北京信息科技大学 | Method for calculating word semantic relevancy |
Also Published As
Publication number | Publication date |
---|---|
CN105740329A (en) | 2016-07-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Heterogeneous network representation learning approach for ethereum identity identification | |
CN106250412B (en) | Knowledge mapping construction method based on the fusion of multi-source entity | |
Huang et al. | Mining triadic closure patterns in social networks | |
Nahar et al. | Sentiment analysis for effective detection of cyber bullying | |
CN103324665A (en) | Hot spot information extraction method and device based on micro-blog | |
CN107688870B (en) | Text stream input-based hierarchical factor visualization analysis method and device for deep neural network | |
CN110457404A (en) | Social media account-classification method based on complex heterogeneous network | |
CN106126502A (en) | A kind of emotional semantic classification system and method based on support vector machine | |
CN105740382A (en) | Aspect classification method for short comment texts | |
CN106886579A (en) | Real-time streaming textual hierarchy monitoring method and device | |
Yin et al. | A real-time dynamic concept adaptive learning algorithm for exploitability prediction | |
Wang et al. | Exploring the Combination of Dempster‐Shafer Theory and Neural Network for Predicting Trust and Distrust | |
Sharma et al. | A study of tree based machine learning techniques for restaurant reviews | |
CN105869058B (en) | A kind of method that multilayer latent variable model user portrait extracts | |
Ji et al. | Attention based meta path fusion for heterogeneous information network embedding | |
Gao et al. | Meta-circuit machine: Inferencing human collaborative relationships in heterogeneous information networks | |
Lin et al. | Early prediction of hate speech propagation | |
CN105740329B (en) | A kind of contents semantic method for digging of unstructured high amount of traffic | |
Zhang et al. | Characterization and edge sign prediction in signed networks | |
CN105205075B (en) | From the name entity sets extended method of extension and recommended method is inquired based on collaboration | |
Wei et al. | Graph learning and its advancements on large language models: A holistic survey | |
Ayala et al. | A neural network for semantic labelling of structured information | |
Tanaka et al. | Comparison of centrality indexes in network Japanese text analysis | |
Alnasrawi et al. | Improving sentiment analysis using text network features within different machine learning algorithms | |
Chen et al. | Mining E-commercial data: A text-rich heterogeneous network embedding approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |