CN109189936A

CN109189936A - A kind of label semanteme learning method measured based on network structure and semantic dependency

Info

Publication number: CN109189936A
Application number: CN201810914904.3A
Authority: CN
Inventors: 王嫄; 杨巨成; 李政; 赵婷婷; 陈亚瑞; 赵青
Original assignee: Tianjin University of Science and Technology
Current assignee: Beijing Contention Technology Co ltd
Priority date: 2018-08-13
Filing date: 2018-08-13
Publication date: 2019-01-11
Anticipated expiration: 2038-08-13
Also published as: CN109189936B

Abstract

The present invention relates to a kind of label semanteme learning methods measured based on network structure and semantic dependency, including obtain truth labels network G based on user behavior fact init Tag network；Label network G after constructing specification according to truth labels network G^R；In label network G^RThe random walk strategy of upper application enhancements constructs the label network G based on random walk strategy^C；Label network G is constructed based on the relevant text information of label^T；To label network G^C, label network G^TIt is normalized, learning label semantic vector by random walk strategy and term vector learning method indicates.The present invention has rational design, not only make full use of network topology structure, the relevant textual information that node includes is taken into account again, can within a short period of time from topological structure and text representation study to be easy to operation, confidence level be high, expression sufficiently, the label semantic vector of low noise, can be widely applied to the label network study and the study of label semanteme of the text collection containing label.

Description

A kind of label semanteme learning method measured based on network structure and semantic dependency

Technical field

The invention belongs to network representation learning art fields, especially a kind of to be measured based on network structure and semantic dependency Label semanteme learning method.

Background technique

In network representation learning art, text semantic study is mainly to the character representation of text, i.e., by target text Object (word, sentence, section, a piece) is expressed as numerical value (single value, vector or matrix) form.In view of the need calculated and semantic modeling is applied It asks, currently used model is latent semantic analysis LSA (the Latent Semantic based on Singular Value Decomposition Using Analysis), the potential Di Li Cray based on probabilistic model is distributed LDA (Latent Dirichlet Allocation), is based on The term vector of Neural Networks Solution indicates model NNLM (Neural Network Language Model) and word2vec etc., These models are mainly used for long text.In face of high sparse, high noisy short text, researcher is proposed using external corpus, Such as WordNet, Mesh and Wikipedia are formed to expand original short text corpus, or according to certain rule polymerization short text The method of " pseudo- document ", still, both methods has obvious drawback, the former is limited to external corpus, and the latter is limited to manually Rule.By the user behavior analysis to short text, the information extension that user largely uses label to expand short text is found, for Short text containing label, researcher using label propose Tag-LDA, TWTM (Tag-Weighted Topic Model), TWDA (Tag-Weighted Dirichlet Allocation), Labeled LDA, MB-LDA and HGTM (Hashtag- Graph Based Topic Model) the methods of carry out theme semantic modeling, obtain preferable effect.At the same time, model The vector table for also outputing label is shown as byproduct.

Classification, cluster, outlier detection, link prediction in network research require to link in network sparse Property, network node is expressed as vector, then can effectively avoid the problem.Important branch in network representation study is to learn net The vector of network node indicates.It is (Graph Factorization) and special based on Laplce that conventional method mainly has figure to decompose Levy the method (Laplacian Eigenmaps) etc. of vector.Random walk is that one kind is widely used in progress node phase in figure Like the method for inquiry learning, Perozzi et al. learns the linear order into network topology structure based on random walk, uses for reference Mikolov Skip-Gram term vector learning method and learning network node indicate, however the method for Deepwalk is based only upon section Connection between point is learnt, and the content information that node may include is not accounted for.Yang et al. demonstrates Deepwalk reality It is equivalent to decompose the transition probability matrix of overlapping network in matter, on this basis, be divided jointly by introducing node diagnostic Solution, proposes the associated depth migration model of content (Text Associated Deep Walk).Tang et al. is proposed to big Type network carries out the model LINE (Large-scale Information Network Embedding) of learning model building, but should Model only considers the once node with two degree, to a large amount of beta prunings of the transmitting of information on network, and three steps four between different nodes Step or even farther association are the important components of the global structure of network, are had great importance.Based on this, Cao et al. GraRep is proposed by defining different transfer matrixes, defining different loss functions for different step numbers sufficiently to learn node Between related information.

In conclusion existing algorithm is difficult to solve original label network from the unconfined collaboration fact of user, information Height is made an uproar the problem of changeable, theme boundary is unintelligible or topic drift, semantic model fail.

Summary of the invention

It is an object of the invention to overcome the deficiencies in the prior art, propose a kind of based on network structure and semantic dependency degree The label semanteme learning method of amount, utilize text information and network topology structure, excavate it is stable, label theme can be embodied Related information obtains the core path of the semantic transmitting of user's common recognition, based on the semantic expressiveness of this study label, solves original Label network is made an uproar unintelligible changeable, theme boundary or topic drift and language from user's unconfined collaboration fact, information height The problem of adopted model fails.

The present invention solves its technical problem and adopts the following technical solutions to achieve:

A kind of label semanteme learning method measured based on network structure and semantic dependency, comprising the following steps:

Step 1 is based on user behavior fact init Tag network, obtains truth labels network G；

Step 2 constructs the label network G after specification according to truth labels network G^R；

Step 3, in label network G^RThe random walk strategy of upper application enhancements constructs the label based on random walk strategy Network G^C；

Step 4 constructs label network G based on the relevant text information of label^T；

Step 5, to label network G^C, label network G^TIt is normalized, passes through random walk strategy and term vector Learning method, which learns label semantic vector, to be indicated.

Further, the truth labels network G are as follows:

The label network G={ V, E } based on co-occurrence inside text is defined, V is all labels in entire text collection D；Such as Any label i of fruit and label j are appeared in simultaneously in a text d, then have a line between them, be denoted as e_ij；Define the network The weight g on middle side_ijAre as follows:

Wherein, D_i,jFor the text collection for simultaneously including label i and j, h_dFor the tag set of document d.

Further, the processing method of the step 2 are as follows:

Firstly, considering the associated random noise of semantic node: subtracting branch and remove the stronger weak rigidity of randomness, have to network Effect side is constrained with noise reduction；Incidence matrix after enabling beta pruning is expressed as T, then wherein element t_ijAs label i and label j beta pruning Relating value afterwards, is expressed as follows:

Wherein, δ be can be truncated 20% low frequency side phase threshold, g_ijFor the weight on side in network；

Then, consider that label node degree of divergence is different, for the side in any one network, according to two, side endpoint institute The weight of the different adjustment web tab incidence relations of associated endpoint number, enhancing label theme association:

t′_ij=t_ij*log(N/N(i))*log(N/N(j))。

Wherein, t '_ijFor G^RThe weight on side in network, number of the N by network comprising node, N (i), N (j) are label i, j Out-degree on network, logN/N (i) are that current label and the related Probability p (i) of nodes=N (i)/N are reciprocal It is logarithmic.

Further, the improved random walk strategy of the step 3 be the complex network structures of high noisy are sampled to it is a plurality of Linear order obtains the microcosmic description information of network part with breadth-first search method, is obtained with Depth Priority Searching The global macroscopic information of network is slided on linear order using window according to the linear order of sampling and obtains new label pass Connection relationship, to obtain new side right weight.

Further, the processing method of the step 4 are as follows:

Firstly, there are the lexical set W of cooccurrence relation with label i for definition_i, use w_ijIndicate time of label i and vocabulary j co-occurrence Number；

Then, w is weighted using inverse document frequency IDF, word t_iIdf value calculating method it is as follows:

Wherein, | D | it is the number of files in set, | { j:t_i∈d_j| to include word t_iNumber of files；

Secondly, calculating w again_ijWith idf_jProduct, obtain the text representation vector of label；

Finally, calculating label text two-by-two indicates the cosine similarity of vector, definitionFor the low frequency side that can be truncated 80% Interceptive value removes cosine similarity and is less thanSide, retain side weight be labeled as cosine similarity value.

Further, the step 5 introduces figure sampling preference parameter, in label network G in network random walk process^T With label network G^CBetween switch, be associated with benefiting from network structure with two kinds of text information；In semantic vector renewal process, It is regarded sentence by the linear order obtained using network random walk, semantic based on left and right context study label.

The advantages and positive effects of the present invention are:

1, the present invention combine text semantic study and network representation learning art, by multistage noise reduction, network specification, with Machine migration, Text similarity computing strategy, reduce network noise, obtain core network, and on this basis learn label to Amount indicates, not only makes full use of network topology structure, and can take into account the relevant textual information that node includes, can be when shorter In from topological structure and text representation study to be easy to operation, confidence level be high, expression sufficiently, the label semanteme of low noise to Amount, can be widely applied to the text collection containing label label network study and label semanteme study and text mining it is each Kind application.

2, the present invention has rational design, and on the basis of factural information, the core for carrying out label related network is abstract, on figure Side be sampled with weight update, help to obtain stable label network structure, enable model to noise jamming not Sensitivity promotes modeling generalization.

Detailed description of the invention

Fig. 1 is overall architecture schematic diagram of the invention.

Specific embodiment

The embodiment of the present invention is further described below in conjunction with attached drawing.

Design philosophy of the invention is: being based primarily upon statistical machine learning theory and Text Mining Technology, utilizes user's Behavioral data constructs label network with a high credibility to learn the pure semantic expressiveness of label.It is true just first with user behavior Beginningization label network secondly based on specification technology and improves random walk technology reengineering label network, related using label again Text similarity building label network, finally the label network based on reconstruct and the label network based on text similarity study Label semantic vector indicates.Network reconfiguration can regard the automatic strong theme association discovery of multipath label as in the present invention Filter, weaker Partial filtration will be associated with and fallen, the noise associated influence unrelated with theme is reduced, thus help to find with Enhancing the semantic more consistent and close label network of theme indicates, to obtain the label semantic vector of theme enhancing.

Based on above-mentioned design philosophy, the label semanteme study side of the invention measured based on network structure and semantic dependency Method, as shown in Figure 1, comprising the following steps:

Step 1: being based on user behavior fact init Tag network, obtain the label network G based on the fact the user behavior.

In internet text this application field, in essence, " label=theme ".Here theme can be fine granularity , such as specific event, time, place, it is also possible to coarseness, such as news, the abstract concept of polymerization etc..Label Appearance be that generation is write by the text of user, the relationship between label is that behavior based on user is true.Therefore, naturally Ground, there are cooccurrence relations with text for label.

Most direct relationship is the cooccurrence relation in single text between label, defines the mark based on co-occurrence inside text Network G={ V, E } is signed, V is all labels in entire text collection D.If any label i and label j appears in one simultaneously In text d, then there is a line between them, be denoted as e_ij.Define the weight g on side in the network_ijFor simultaneously comprising the two labels Textual data.I.e.

Wherein, D_i,jFor the text collection for simultaneously including label i and j, h_dFor the tag set of document d.Intuitively, two Label appears in the more text of quantity simultaneously, and the weight of the two line is bigger, illustrates that the semantic association of the two is closer.

So far, the label network based on the user behavior fact has been obtained.

Other than the cooccurrence relation in text described above, this step further comprises making for indirect utilization user and label The label fact network constructed with relationship or the cooccurrence relation of external linkage and label.

Step 2: based on the fact that the label network G after label network G building specification^R。

The label network noise for being directly based upon true building is larger, these noises mostly come from the randomness that user uses And otherness.It is arbitrarily that randomness, which refers to that user adds label for text, it is more likely that there is a situation where that more words are synonymous, example Such as " Tsinghua University " and " Tsing-Hua University "；Otherness refers to that the difference of the intension to one text expression understands, so that add different Label, such as different user are " can wear and how much wear how many " addition label " wrapping up in wadded jacket " or " bikini ".

In order to agglomerate semanteme, this step is based on four simple hypothesis.(1) two label degree of incidence is more, between label Theme association it is closer.The identical neighbour of (2) two labels is more, and the theme association between label is closer.(3) same to theme Tag set have the building-up effect of similar block, community.(4) label relevant to a large amount of labels is associated with the theme of other labels It is not close instead.

The associated random noise of semantic node is considered first.Subtract branch and remove the stronger weak rigidity of randomness, has to network Effect side is constrained with noise reduction.Incidence matrix after enabling beta pruning is expressed as T, then wherein element t_ijAs label i and label j beta pruning Relating value afterwards.

Wherein, δ be can be truncated 20% low frequency side phase threshold.

Secondly consider that label node degree of divergence is different, for the side in any one network, according to two, side endpoint institute The weight of the different adjustment web tab incidence relations of associated endpoint number, enhancing label theme association:

t′_ij=t_ij*log(N/N(i))*log(N/N(j))。

Herein, t '_ijFor G^RThe weight on side in network, number of the N by network comprising node, N (i), N (j) are label i, j Out-degree on network, i.e., and the associated number of edges in the vertex.LogN/N (i) is that current label is relevant with nodes Reciprocal logarithmic of Probability p (i)=N (i)/N, embody the information content of label i in a network, can be described as inverse association frequency (IRF, Inverse Relation Frequency), the value is bigger, illustrates that theme semantic space indexes better, theme semantic association Middle effective information is more.

Step 3: the label network G after step 2 obtains specification^RUpper application enhancements random walk strategy building based on The label network G of machine migration strategy^C。

This step is based on improved random walk strategy and samples the relevant node of each node.In step 2, we obtain The weight t ' of network top_ij, defined herein two degree of transfer weights instruct to shift.It defines after from label t migration to label i, under One step can arrive the not normalized transition probability of label j are as follows:

WhereinFor regulatory factor, it is defined as follows:

d_tjThe shortest distance on the figure of label t to label j is defined, only defines the node that distance is up to 2 herein.Label t For from preposition node when i random walk.Work as d_tjWhen=0, label t i.e. label j, π is referred to_i ^t _jIt refers to walking from label j Travel back to the probability of label j after to label i from label i again.

According to definitionIt after this normalization, carries out m n and walks random walk, obtain the path that m step-length is n, The sliding that window size is s is carried out in these paths, between the label of co-occurrence in window plus a line, weight add 1.It is all sliding After the completion of dynamic, weight is the sum of label co-occurrence window, obtains new label network G according to this^C。

Step 4: label network G is constructed based on the relevant text information of label^T。

This step utilizes the associated text of label, i.e., constructs label network using the content characteristic of network node itself.This The label and text vocabulary of invention have cooccurrence relation, and co-occurrence number is more, illustrate that vocabulary can more embody meaning represented by label. There are the lexical set W of cooccurrence relation with label i for definition_i, use w_ijIndicate the number of label i and vocabulary j co-occurrence.

For example, text collection includes three texts, it is respectively as follows:

" South Africa # gold brick is met #[current political news eye Shu gold brick and is met: the city " clapping gold brick " of gold] "

" South Africa # gold brick meets # economic community, and science and technology is not broken up the family "

" South Africa # gold brick meets # gold brick country, and friendship often exists；With wind resistance rain, win-win is shared.Wish our motherland more It is powerful, it is happy and peace."

Wherein, word " gold brick " the weight w of label " South Africa # gold brick meets # " is 3.

After text representation, using inverse document frequency IDF (Inverse Document Frequency) to w into Row weighting.Word t_iIdf value calculating method it is as follows:

Wherein, | D | it is the number of files in set, | { j:t_i∈d_j| to include word t_iNumber of files, calculate w again later_ijWith idf_jProduct, obtain the text representation vector of label.

Calculating label text two-by-two indicates the cosine similarity of vector, similar step 2, definitionFor can be truncated 80% it is low The interceptive value on frequency side.Remove cosine similarity to be less thanSide.The weight for retaining side is labeled as the value of cosine similarity.

Step 5: the label network G that step 3, step 4 are obtained^C, label network G^TIt is normalized, passes through random walk Strategy and term vector learning method study label semantic vector indicate.

The G that step 3 obtains^CEmbody the label association based on network structure, the G that step 4 obtains^TIt embodies based on semanteme The label of correlation is associated with.This step will merge G^CAnd G^TTwo networks learn the semantic table of label on fused network Show.

Since the proportion range of two networks is different, the weight of two networks is uniformly arrived between [0,1] first.Use line Property function is normalized, and function is as follows:

Equal proportion scaling, x are carried out to initial data_normFor the value after x normalization, x_minAnd x_maxRespectively side in network The minimum value and maximum value of weight.Network after note normalization is respectively G^C-normAnd G^T-norm.Then carried out on two networks Random walk obtains neighbour's sequence of each node, and word2vec study is carried out in the sequence, uses SkipGram method The vector for updating label node indicates.The calculation process of label semantic vector Φ is as follows:

Wherein, Shuffle () reordering for label node, to avoid preference caused by operation successively. RandomWalk(G^C-norm, j, t) and it completes in figure G^C-normOn from j, random walk t walk, obtain a length be t label Sequence node W_j, RandomWalk (G^T-norm, j, t) and similar.SkipGram(Φ,W_j, d) and it is term vector learning method, according to section Point sequence W_jUpdate the label semantic vector Φ that dimension is d.

It is emphasized that embodiment of the present invention be it is illustrative, without being restrictive, therefore packet of the present invention Include and be not limited to embodiment described in specific embodiment, it is all by those skilled in the art according to the technique and scheme of the present invention The other embodiments obtained, also belong to the scope of protection of the invention.

Claims

1. a kind of label semanteme learning method measured based on network structure and semantic dependency, it is characterised in that: including following Step:

Step 3, in label network G^RThe random walk strategy of upper application enhancements constructs the label network based on random walk strategy G^C；

Step 5, to label network G^C, label network G^TIt is normalized, passes through random walk strategy and term vector study side Calligraphy learning label semantic vector indicates.

2. a kind of label semanteme learning method measured based on network structure and semantic dependency according to claim 1, It is characterized by: the truth labels network G are as follows:

The label network G={ V, E } based on co-occurrence inside text is defined, V is all labels in entire text collection D；If appointed Meaning label i and label j are appeared in simultaneously in a text d, then have a line between them, be denoted as e_ij；Define side in the network Weight g_ijAre as follows:

3. a kind of label semanteme learning method measured based on network structure and semantic dependency according to claim 1, It is characterized by: the processing method of the step 2 are as follows:

Firstly, considering the associated random noise of semantic node: subtracting branch and remove the stronger weak rigidity of randomness, to the effective edge of network It is constrained with noise reduction；Incidence matrix after enabling beta pruning is expressed as T, then wherein element t_ijAfter as label i and label j beta pruning Relating value is expressed as follows:

Then, consider that label node degree of divergence is different, for the side in any one network, according to associated by the endpoint of two, side The weight of the different adjustment web tab incidence relations of endpoint number, enhancing label theme association:

t′_ij=t_ij*log(N/N(i))*log(N/N(j))。

Wherein, t '_ijFor G^RThe weight on side in network, number of the N by network comprising node, N (i), N (j) are label i, j in net Out-degree on network figure, logN/N (i) are current label and the related Probability p (i) of nodes=N (i)/N inverse pair Numerical expression.

4. a kind of label semanteme learning method measured based on network structure and semantic dependency according to claim 1, It is characterized by: the improved random walk strategy of step 3 be the complex network structures of high noisy are sampled to it is a plurality of linear Sequence obtains the microcosmic description information of network part with breadth-first search method, obtains network with Depth Priority Searching Global macroscopic information slided on linear order using window according to the linear order of sampling and obtain new label association and close System, to obtain new side right weight.

5. a kind of label semanteme learning method measured based on network structure and semantic dependency according to claim 1, It is characterized by: the processing method of the step 4 are as follows:

Firstly, there are the lexical set W of cooccurrence relation with label i for definition_i, use w_ijIndicate the number of label i and vocabulary j co-occurrence；

Finally, calculating label text two-by-two indicates the cosine similarity of vector, definitionFor can be truncated 80% low frequency side truncation Threshold value is removed cosine similarity and is less thanSide, retain side weight be labeled as cosine similarity value.

6. a kind of label semanteme learning method measured based on network structure and semantic dependency according to claim 1, It is characterized by: the step 5 in network random walk process, introduces figure sampling preference parameter, in label network G^TAnd mark Sign network G^CBetween switch, be associated with benefiting from network structure with two kinds of text information；In semantic vector renewal process, utilize It is regarded sentence by the linear order that network random walk obtains, semantic based on left and right context study label.