CN106970910A

CN106970910A - A kind of keyword extracting method and device based on graph model

Info

Publication number: CN106970910A
Application number: CN201710208956.4A
Authority: CN
Inventors: 王亮
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2017-03-31
Filing date: 2017-03-31
Publication date: 2017-07-21
Anticipated expiration: 2037-03-31
Also published as: CN106970910B

Abstract

The embodiments of the invention provide a kind of keyword extracting method and device based on graph model, methods described includes：Pending text is obtained, and participle is carried out to the pending text, the corresponding candidate keywords of the pending text are obtained；The corresponding term vector of the candidate keywords is searched in term vector model, the term vector model includes the term vector of the candidate keywords；The Word similarity matrix of the candidate keywords is built according to the term vector；The candidate keywords are ranked up according to the Word similarity matrix of the candidate keywords, the keyword of the pending text is extracted.Using the embodiment of the present invention, the accuracy rate of keyword extraction is effectively improved.

Description

A kind of keyword extracting method and device based on graph model

Technical field

The present invention relates to keyword extraction techniques field, more particularly to a kind of keyword extracting method based on graph model And device.

Background technology

Keyword has been widely used in information retrieval, text point as word representative in one section of text In terms of class.Wherein, the keyword extracting method based on graph model has been widely used in searching order, citation analysis, society In terms of friendship network and natural language processing (such as keyword extraction, article theme line are extracted).Graph model is class figure Come the general name of the class technology that represents probability distribution, a text can be mapped as one using word as between node, word Incidence relation be side network.Two basic assumptions of the keyword extracting method based on graph model are：1st, quantum hypothesis： A certain node and other node link numbers are more, then the node is more important；2nd, quality is assumed：The node matter being connected with node A Amount is different, and the high node of quality can transmit more weights by linking to other nodes, so the high node chain of quality Node A is connected to, node A is more important.Therefore, the key of the keyword extracting method based on graph model is the calculating of link weight, And the link weight between node is the similarity between word and word.

The existing keyword extracting method based on graph model, by text segmentation into some component units (word, sentence Son) and graph model is set up, the component units in text are ranked up using voting mechanism, the forward composition of sequence is then chosen Unit is used as keyword.Specifically, first given text is split according to complete words；Then carried out for each sentence Participle and part-of-speech tagging processing, obtain word and the corresponding part-of-speech tagging of word；According to word and part-of-speech tagging, these word intermediaries are filtered out The stop words such as word, auxiliary word, conjunction, interjection, retain noun, verb, the word of the specified part of speech such as adjective, and by specified part of speech Word is used as candidate keywords；Further according to candidate keywords, candidate keywords graph model is built, i.e., is closed candidate keywords for candidate Incidence relation between the node of keyword graph model, candidate keywords as keyword graph model side, wherein, candidate keywords Between incidence relation by calculating candidate keywords between similarity obtain.In the keyword extracting method based on graph model In, the similarity between word and word is built by the way of adding window, the word allowed in each window the window ballot adjacent to it is thrown The weight of ticket depends on the poll of oneself, because each window window adjacent with it has the word of co-occurrence, therefore can also excuse Similarity between word is obtained by word with Term co-occurrence；Finally the poll of candidate keywords passes through iteration on this map Ballot, can obtain the poll sequence of candidate keywords, and choose the forward candidate keywords of poll as keyword.

But, the existing keyword extracting method based on graph model will can just obtain word by co-occurrence between word and word Similarity between word, so, just having some in overweight weighting, such as candidate keywords to the word repeated can not As keyword, but it is repeated several times the word occurred, such as content, calculating, processing, solution, most high, causes keyword extraction accurate Rate is not high.In addition, the result for extracting keyword is more sensitive to the size of window, artificially set due to the size needs of window For example, a sentence is made up of following word successively：W1, w2, w3, w4, w5 ... wn, set the size of window as k, then w1, w2, W3 ... wk, w2, w3, w4 ... wk+1, w3, w4, w5 ... wk+2 etc. is a window, any two words correspondence in a window Node between there is undirected side had no right, then, the selection of different size window may cause completely different result, Also cause keyword extraction accuracy rate not high.

The content of the invention

The purpose of the embodiment of the present invention is to provide a kind of keyword extracting method and device based on graph model, improves and close The accuracy rate that keyword is extracted.Concrete technical scheme is as follows：

The embodiment of the invention discloses a kind of keyword extracting method based on graph model, methods described includes：

Pending text is obtained, and participle is carried out to the pending text, the corresponding time of the pending text is obtained Select keyword；

The corresponding term vector of the candidate keywords is searched in term vector model, the term vector model includes described wait Select the term vector of keyword；

The Word similarity matrix of the candidate keywords is built according to the term vector；

The candidate keywords are ranked up according to the Word similarity matrix of the candidate keywords, wait to locate described in extraction Manage the keyword of text.

Optionally, the Word similarity matrix that the candidate keywords are built according to the term vector, including：

According to formula：

The cosine value of corresponding term vector angle between the candidate keywords is calculated, wherein, θ represents the candidate key Vectorial angle between word, x_1kRepresent corresponding vectorial characteristic value, x in one of candidate keywords n-dimensional space_2kTable Show corresponding vectorial characteristic value in other in which candidate keywords n-dimensional space, n represents the dimension of vector space；

According to the cosine value of the term vector angle, the candidate keywords similarity matrix is built.

Optionally, the Word similarity matrix according to the candidate keywords is ranked up to the candidate keywords, Including：

The Word similarity matrix of the candidate keywords is calculated according to PageRank algorithms, the candidate keywords are obtained Corresponding PageRank value；

The candidate keywords are ranked up according to the PageRank value, the important journey of the candidate keywords is obtained Degree；

According to the significance level, the keyword of the pending text is extracted.

Optionally, the Word similarity matrix that the candidate keywords are calculated according to PageRank algorithms, including：

According to the Word similarity order of matrix number, the initial value of the PageRank algorithms is determined；

According to the initial value and the Word similarity matrix, the initial characteristicses vector value of the candidate keywords is calculated；

According to formula：

p_t=M^Tp_t-1

The characteristic vector value of the candidate keywords is calculated, wherein, in t=1, then p₁Represent the initial characteristicses vector Value, p₀Represent the initial weight, p_tRepresent the characteristic vector value that the Word similarity matrix is walked in t, p_t-1Represent institute's predicate The characteristic vector value that similarity matrix is walked in t-1, M represents the Word similarity matrix of the candidate keywords, M^TRepresent described The transposition of Word similarity matrix, t represents the step number calculated, and t value is more than or equal to 1；

Described in being less than when the norm of the characteristic vector value that the t is walked and the t-1 characteristic vector values walked During the fault tolerance of PageRank algorithms, the characteristic vector value of the t steps is the corresponding of the candidate keywords PageRank value.

Optionally, it is described to obtain pending text, and participle is carried out to the pending text, obtain the pending text This corresponding candidate keywords, including：

Pending text is obtained, and participle is carried out to the pending text, the word of stop words and specified part of speech, institute is obtained Stating stop words at least includes preposition, auxiliary word, conjunction, interjection, and the word of the specified part of speech at least includes noun, verb, described Word；

The stop words is filtered out, the word of the specified part of speech is obtained, the word of the specified part of speech is the pending text This corresponding candidate keywords.

Optionally, the term vector is obtained by word2vec training.

The embodiment of the invention also discloses a kind of keyword extracting device based on graph model, described device includes：

Acquisition module, carries out participle for obtaining pending text, and to the pending text, obtains described pending The corresponding candidate keywords of text；

Searching modul, for searching the corresponding term vector of the candidate keywords, the term vector in term vector model Model includes the term vector of the candidate keywords；

Processing module, the Word similarity matrix for building the candidate keywords according to the term vector；

The candidate keywords are arranged by extraction module for the Word similarity matrix according to the candidate keywords Sequence, extracts the keyword of the pending text.

Optionally, the processing module, including：

First computing unit, for according to formula：

Construction unit, for the cosine value according to the term vector angle, builds the candidate keywords similarity matrix.

Optionally, the extraction module, including：

Second computing unit, the Word similarity matrix for calculating the candidate keywords according to PageRank algorithms, is obtained To the corresponding PageRank value of the candidate keywords；

Sequencing unit, for being ranked up according to the PageRank value to the candidate keywords, obtains the candidate The significance level of keyword；

Extraction unit, for according to the significance level, extracting the keyword of the pending text.

Optionally, second computing unit, including：

First determination subelement, for according to the Word similarity order of matrix number, determining the PageRank algorithms Initial value；

First computation subunit, for according to the initial value and the Word similarity matrix, calculating the candidate key The initial characteristicses vector value of word；

Second computation subunit, for according to formula：

p_t=M^Tp_t-1

Second determination subelement, for when the t characteristic vector values walked and the t-1 characteristic vector values walked When norm is less than the fault tolerance of the PageRank algorithms, the characteristic vector value of the t steps is the candidate keywords Corresponding PageRank value.

Optionally, the acquisition module, including：

Acquiring unit, carries out participle for obtaining pending text, and to the pending text, obtains stop words and refer to Determine the word of part of speech, the stop words at least includes preposition, auxiliary word, conjunction, interjection, and the word of the specified part of speech at least includes name Word, verb, adjective；

Processing unit, for filtering out the stop words, obtains the word of the specified part of speech, the word of the specified part of speech is The corresponding candidate keywords of the pending text.

Optionally, the term vector is obtained by word2vec training.

A kind of keyword extracting method and device based on graph model provided in an embodiment of the present invention, are calculated by term vector Similarity in text between word and word, and build similarity matrix so that the keyword extracted reflects to a certain extent Its semantic importance in current text.When building similarity matrix, the similarity between word and word does not rely on word Between word co-occurrence, but calculated based on term vector, so, it is to avoid during keyword extraction using word and word it Between the word weighting that repeats caused by co-occurrence it is excessive the problem of, it is similar by semanteme without the size for being manually set window Degree selects the keyword for more meeting document subject matter, improves the accuracy rate of keyword extraction.Certainly, any of the present invention is implemented Product or method must be not necessarily required to while reaching all the above advantage.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 be the existing keyword extracting method based on graph model in graph model structural representation；

Fig. 2 is a kind of flow chart of the keyword extracting method based on graph model provided in an embodiment of the present invention；

Fig. 3 is a kind of structure chart of the keyword extracting device based on graph model provided in an embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

Keyword extracting method based on graph model is a kind of effective method for extracting keyword, wherein, graph model is One class represents the general name of a class technology of probability distribution with figure, a text can be mapped as one by node of word, Incidence relation between word is the network on side.As shown in figure 1, Fig. 1 is the existing keyword extraction side based on graph model W1, w2, w3 ... w10, w11 in the structural representation of graph model in method, Fig. 1 are respectively candidate keywords, are also graph model Node, the side that line between node and node is constituted represents the incidence relation of each candidate keywords, the power on the thicker expression side of line Again bigger, i.e. the incidence relation of the two keywords that this side is connected is bigger, and the present invention is exactly the base in this graph model Keyword is extracted on plinth.

Referring to Fig. 2, Fig. 2 is a kind of flow of the keyword extracting method based on graph model provided in an embodiment of the present invention Figure, comprises the following steps：

S201, obtains pending text, and carries out participle to pending text, obtains the corresponding candidate of pending text and closes Keyword.

Specifically, obtaining pending text, participle first is carried out to the pending text of acquisition, the purpose of participle is will to wait to locate Manage text and carry out word segmentation processing by the certain rule of certain basis, so as to extract candidate keywords.Chinese because itself often with The forms of expression such as word, phrase, common saying, therefore Chinese word segmentation has very big uncertainty.Segmenting method main at present：Base In the segmenting method of string matching, i.e. mechanical Chinese word segmentation, algorithm maturation is using extensively, and its core is by mail text and dictionary The matching of vocabulary, realizes participle, and key is the complete degree using dictionary；Segmenting method based on understanding, i.e. artificial intelligence Method, the precision of word segmentation is high, and algorithm is complicated；Segmenting method based on statistics, advantage is identification unregistered word and proper noun, but Training text amount is big.These segmenting methods all have higher participle accuracy rate and quick Words partition system.Here, by existing Segmenting method participle is carried out to pending text, can automatic fitration fall preposition, auxiliary word, conjunction, interjection etc. in these words and stop Word, retains the word of the specified parts of speech such as noun, verb, adjective, and regard the word of specified part of speech as candidate keywords.So, The corresponding candidate keywords of pending text are just obtained.

S202, searches the corresponding term vector of candidate keywords in term vector model, and term vector model includes candidate key The term vector of word.

Generally, neutral net exports the vector representation of a low dimensional this word using the word in vocabulary as input, Then parameter is continued to optimize with the method for backpropagation.The low-dimensional vector of output is the parameter of neutral net first layer.Generate word The neural network model of vector is divided into two kinds, and one kind is by word2vec or GloVe (Global Vectors for Word ) etc. Representation the term vector model that training is obtained, the purpose of this class model is exactly to generate term vector, it is another be by Term vector is produced as byproduct, and both differences are that amount of calculation is different.Two kinds of models another difference is that training Target is different：Word2vec and GloVe purpose is that training can represent the term vector of semantic relation, and they can be used for subsequently Task in；If follow-up work need not use semantic relation, the term vector generated in this manner does not have any use. Another model then needs to train term vector according to particular task.Certainly, if specific task is exactly to Language Modeling, then two The term vector for planting model generation is closely similar.

Specifically, the problem of being converted into machine learning the problem of by natural language understanding, then just first to look for a kind of method These symbolic mathematics.It is the usual way for representing word feature and term vector has the good feature of semanteme.Term vector is The semanteme and grammatical relation in natural language are contained in multidimensional real number vector, vector.Every one-dimensional value of term vector represents one The individual feature that there is certain semanteme and grammatically explain.Therefore can be by every one-dimensional referred to as word feature of term vector.Word Vector represents that a kind of low-dimensional real number is vectorial with Distributed Representation (distribution is represented).Term vector meter At last by the method for training, the word in language vocabulary is mapped to the vector that a length is fixed.Distributed Representation is dense, low-dimensional a real number limitation, a potential feature of its every one-dimensional representation word, the spy Levy and capture useful syntax and semantic feature, be characterized in the different syntax and semantic feature distributions of word to each of it Individual dimension gets on to represent.Here, using lower dimensional space representation, not only solve the problems, such as dimension disaster, and excavated word it Between relating attribute, by calculating the distance between term vector, just can obtain the similarity between two words, thus improve to The degree of accuracy of amount semantically.

Term vector model includes the term vector corresponding to candidate keywords, and candidate keywords are found out in term vector model Corresponding term vector, primarily to calculate the size of the distance between candidate keywords so that obtain candidate keywords it Between similarity.The present invention introduces term vector in the existing keyword extracting method based on graph model, passes through term vector meter The similarity between candidate keywords is calculated, so as to avoid in existing method the phase built by the way of adding window between word and word Like degree, and need to be manually set window size, cause candidate keywords to extract the problem of accuracy rate is not high.

S203, the Word similarity matrix of candidate keywords is built according to term vector.

Specifically, the size of the COS distance between term vector represents the distance of relation between word, i.e., by calculating word COS distance between vector, obtains the similarity between candidate keywords.Here, it is similar between the candidate keywords obtained Degree is represented with numerical value, and these numerical value are constituted into the element in Word similarity matrix.Wherein, matrix is N rank determinants.Such as table Shown in 1, the numerical value that A, B, C, D, E, F, G, H in table represent in the term vector corresponding to each candidate keywords, table is term vector Between COS distance, that is, the similarity between candidate keywords size.

Table 1

Then according to the size of the similarity between these candidate keywords, the similarity matrix of candidate keywords is built, M is used Represent, i.e.,

Candidate keywords are ranked up by S204 according to the Word similarity matrix of candidate keywords, extract pending text Keyword.

Specifically, by the keyword sort algorithm in the keyword extracting method based on graph model, calculating candidate key The Word similarity matrix of word, obtains the corresponding sort algorithm value of candidate keywords.Then according to sort algorithm value to candidate key Word is ranked up.Finally, the forward candidate keywords of sequence are chosen as the keyword of pending text.Here, according to reality Need the number of the forward candidate keywords of selection sequence.

As can be seen here, a kind of keyword extracting method based on graph model provided in an embodiment of the present invention, passes through term vector The similarity between word and word in text is calculated, and builds similarity matrix so that the keyword extracted is to a certain extent Reflect its semantic importance in current text.Build similarity matrix when, the similarity between word and word be not according to By co-occurrence between word and word, but calculated based on term vector, so, it is to avoid during keyword extraction using word with The problem of word weighting repeated between word caused by co-occurrence is excessive, without the size for being manually set window, passes through semanteme Similarity selects the keyword for more meeting document subject matter, improves the accuracy rate of keyword extraction.

In an optional embodiment of the present invention, the Word similarity matrix of candidate keywords, bag are built according to term vector Include：

According to formula：

The cosine value of corresponding term vector angle between calculating candidate keywords, wherein, θ is represented between candidate keywords The angle of vector, x_1kRepresent corresponding vectorial characteristic value, x in one of candidate keywords n-dimensional space_2kRepresent wherein another Corresponding vectorial characteristic value in one candidate keywords n-dimensional space, n represents the dimension of vector space.

According to the cosine value of term vector angle, candidate keywords similarity matrix is built.

Specifically, by calculating the distance between term vector, to obtain the similarity between word and word.And between term vector Distance be to be calculated by the cosine value of the angle between term vector, therefore, the present invention by calculate candidate keywords it Between corresponding term vector angle cosine value, then according to the cosine value of term vector angle, build candidate keywords similarity moment Battle array.

The cosine value of corresponding term vector angle is calculated by n-dimensional space co sinus vector included angle value between candidate keywords What formula was obtained, in n-dimensional space, it is vectorial A (x respectively for example to have two vectors₁₁, x₁₂…x_1n) and vector B (x₂₁, x₂₂… x_2n), then the calculation formula of the cosine value of vectorial A and vector B angle is：

Wherein, θ represents vectorial A and vector B angle, x_1kRepresent the corresponding characteristic values of vector A, x_2kRepresent vector B correspondences Characteristic value, n represents the dimension of vector space

Here, in two-dimensional space, it is vectorial A (x respectively for example to have two vectors₁₁, x₁₂) and vector B (x₂₁, x₂₂), that The calculation formula of the cosine value of vector A and vector B angle is：

Wherein, θ represents vectorial A and vector B angle, x₁₁And x₁₂Represent the corresponding characteristic values of vector A, x₂₁And x₂₂Represent The corresponding characteristic value of vectorial B.

In three dimensions, it is vectorial A (x respectively for example to have two vectors₁₁, x₁₂, x₁₃), vector B (x₂₁, x₂₂, x₂₃), that The calculation formula of the cosine value of vector A and vector B angle is：

Wherein, θ represents vectorial A and vector B angle, x₁₁、x₁₂And x₁₃Represent the corresponding characteristic values of vector A, x₂₁、x₂₂With x₂₃Represent the corresponding characteristic values of vector B.

It is numerous to list herein for the cosine value of the angle between two vectors in more higher dimensional space, it is every to meet n Dimension space co sinus vector included angle value calculation formula, belong to the scope of protection of the invention.

In embodiments of the present invention, candidate keywords are ranked up according to the Word similarity matrix of candidate keywords, wrapped Include：

The Word similarity matrix of candidate keywords is calculated according to PageRank algorithms, the corresponding of candidate keywords is obtained PageRank value；

Specifically, Pagerank (page rank) algorithm is a part for Google rankings algorithm (ranking formula), It is that Google is used for a kind of method of grade/importance for presentation web page, is that Google is used for weighing the good of website Bad sole criterion.The present invention is ranked up by the principle of Pagerank algorithms to keyword.Pass through PageRank algorithm meters The Word similarity matrix of candidate keywords is calculated, by the algorithm of this iterative regression, the corresponding of candidate keywords is finally given PageRank value.

Candidate keywords are ranked up according to PageRank value, the significance level of candidate keywords is obtained；

Here, the Pagerank values of candidate keywords are maximum, when showing user's search keyword, the keyword be user most Keyword interested, other keywords successively decrease successively, meanwhile, the Pagerank values of candidate keywords are maximum, also illustrate candidate Keyword is more important.For example, the sequence of obtained candidate keywords is B successively：1.47、H：1.41、E：1.39、A：1.30、F： 1.14、G：1.12、D：1.09、C：1.08, illustrate that candidate keywords B's is most important, the significance level root of other candidate keywords Successively decrease successively according to sequence.

According to significance level, the keyword of pending text is extracted.

Here, according to actually required, the candidate keywords of forward (top N) of sorting is extracted and are used as the key of pending text Word.

In embodiments of the present invention, the Word similarity matrix of candidate keywords is calculated according to PageRank algorithms, including：

The initial value of PageRank algorithms is determined according to Word similarity order of matrix number；

Specifically, determining the initial value of PageRank algorithms according to the size N of matrix, i.e.,p₀Represent The initial value of PageRank algorithms.Here, due to PageRank algorithms assume the probability of each webpage be it is equal, therefore, According to PageRank algorithms assume probability that each candidate keywords occur be it is equal, i.e.,And willIt is used as PageRank The initial value of algorithm.According to initial value and the initial characteristicses vector value of Word similarity matrix computations candidate keywords；

Specifically, according to formula

p₁=M^Tp₀

The initial characteristicses vector value of candidate keywords is calculated, wherein, p₁Represent the initial characteristicses vector of PageRank algorithms Value, p₀The initial value of PageRank algorithms is represented, M represents the Word similarity matrix of candidate keywords, M^TRepresent Word similarity matrix Transposition.

According to formula：

p_t=M^Tp_t-1

The characteristic vector value of candidate keywords is calculated, wherein, in t=1, then p₁Represent the initial characteristicses vector value, p₀ Represent the initial weight, p_tRepresent the characteristic vector value that Word similarity matrix is walked in t, p_t-1Represent that Word similarity matrix exists The characteristic vector value of t-1 steps, M represents the Word similarity matrix of candidate keywords, M^TRepresent the transposition of Word similarity matrix, t The step number calculated is represented, t value is more than or equal to 1；

Specifically, PageRank algorithms are a kind of algorithms of iterative regression, by by the Word similarity square of candidate keywords Battle array iterates calculating, obtains the corresponding PageRank value of final candidate keywords, so so that extracts is crucial Accuracy rate is more accurate.

When the norm of the characteristic vector value that t is walked and the t-1 characteristic vector values walked is less than the error of PageRank algorithms During tolerance, the characteristic vector value of t steps is the corresponding PageRank value of candidate keywords.

Here, because the calculating process of vector has error, so PageRank algorithms can preset an error allowance ∈, when the norm of the characteristic vector value that t is walked and the t-1 characteristic vector values walked is less than the fault tolerance of PageRank algorithms When, the PageRank value corresponding to candidate keywords now obtained is the extraction for more accurately being conducive to improving keyword Accuracy rate.Specific algorithm is as follows：

Specific process：

First, PageRank algorithms are by inputting random, an irreducible, aperiodic matrix M, the size of matrix N, error allowance ∈.Here, matrix M is built by term vector, i.e., the present invention in Word similarity matrix, matrix it is big Small N is order of matrix number.Further, since there is error in the calculating process of vector, so PageRank algorithms can preset a mistake Poor tolerance ∈.

Then, PageRank algorithms calculate the characteristic vector value of candidate keywords by following steps：

1st step, determines the initial value of PageRank algorithms, i.e., according to the size N of matrixp₀Represent The initial value of PageRank algorithms.Here, due to PageRank algorithms assume the probability of each webpage be it is equal, therefore, According to PageRank algorithms assume probability that each candidate keywords occur be it is equal, i.e.,And willIt is used as PageRank The initial value of algorithm.

2nd step, t=0, here, t represent the step number that PageRank algorithms are calculated, then t=0 is represented also not to similar Degree matrix M is calculated.

3rd step and the 4th step, according to t=t+1, start to repeat constantly to calculate.

5th step, according to formula

p_t=M^Tp_t-1

Word similarity matrix characteristic vector value is calculated, wherein, p_tRepresent the characteristic vector that Word similarity matrix is walked in t Value, p_t-1The characteristic vector value that Word similarity matrix is walked in t-1 is represented, M represents the Word similarity matrix of candidate keywords, t tables Show the step number of calculating.Here, because PageRank algorithms are the algorithms of an iterative regression, so needing constantly to Word similarity Matrix M is iterated calculating, could more be accurately obtained the characteristic vector value of Word similarity matrix.

6th step, δ=| | p_t-p_t-1||

7th step, until δ ＜ ∈, here until the characteristic vector value that Word similarity matrix is walked in t, with Word similarity square Battle array is less than error allowance ∈ in the norm of the t-1 characteristic vector values walked, just stops calculating.

8th step, return p_t, obtain final Word similarity matrix characteristic vector value.

Finally, output characteristic vector P, i.e., final Word similarity matrix characteristic vector value p_t。

In embodiments of the present invention, pending text is obtained, and participle is carried out to pending text, pending text is obtained Corresponding candidate keywords, including：

Pending text is obtained, and participle is carried out to pending text, the word of stop words and specified part of speech, stop words is obtained At least include preposition, auxiliary word, conjunction, interjection, specifying the word of part of speech at least includes noun, verb, adjective.

Specifically, the word obtained after pending text progress participle can be divided into two classes：The word of stop words and specified part of speech. In information retrieval, to save memory space and improving search efficiency, before processing natural language data (or text) or it After can automatic fitration fall some words or word, these words or word are to be referred to as stop words.Stop words is filtered out, obtains specifying part of speech Word, the word for specifying part of speech is the corresponding candidate keywords of pending text.Wherein, stop words refers to largely occur in the text, But to characterize text feature almost without word, such as in text " I, then, be, so, in addition " these function words There is no any effect to text feature.Stop words is filtered, construction is first had to and disables vocabulary, the pair that mainly context is mentioned Word, conjunction, preposition, auxiliary words of mood etc..So after Chinese word segmentation, stop words must be filtered out, so can not only effectively it carry The density of high keyword, while can also substantially reduce the dimension of text, it is to avoid the appearance of " dimension disaster ".

In embodiments of the present invention, term vector is trained by word2vec, and vocabulary is reached to the form of vector.

Specifically, a height that word is characterized as to real number value vector that Word2vec is Google to increase income in year in 2013 Effect instrument, it utilizes the thought of deep learning, the processing to content of text can be reduced to K gts by training In vector operation, and similarity in vector space can be for representing the similarity on text semantic.Word2vec is used Be Distributed Representation term vector representation.Distributed Representation are earliest Proposed by Hinton in 1986.Its basic thought be by train by each word be mapped to K tie up real number vector (K is generally mould Hyper parameter in type), the language between them is judged by the distance between word (such as cosine similarities, Euclidean distance etc.) Adopted similarity.It uses one three layers of neutral net, input layer-hidden layer-output layer.The technology for having individual core is according to word frequency Encoded with Huffman so that the content of the similar word hidden layer activation of all word frequency is basically identical, the higher word of the frequency of occurrences Language, the hiding number of layers that they activate is fewer, so effectively reduces the complexity of calculating.Word2vec algorithms are based on depth Processing to content of text, by model training, is reduced to the vector operation in K gts by study.By vectorial empty Between on similarity term vector can be converted into vector for representing the similarity on text semantic, can find same Adopted word.

By a kind of keyword extracting method based on graph model proposed by the present invention, the extraction side with existing keyword Method is compared, and achieves preferable effect.Table 2 shows the keyword that the extracting method of keyword proposed by the invention is obtained Sequence, and the sequence of keyword obtained by the extracting method of existing keyword contrast.

Table 2

It can be drawn by table 2, the 1st and the 2nd text belong to short text, due to each candidate keywords in the text Only occur once, therefore each candidate keywords are identicals as the probability that keyword is extracted, it is seen then that pass through existing close The extracting method of keyword, text 1 and text 2 can not accurately extract keyword, and the keyword that is provided by the present invention being carried The sequence of each candidate keywords can be obtained by taking method, so as to extract keyword.3rd text belongs to go out in long text, text Existing each candidate keywords also repeat in the text, from the results, it was seen that passing through the extracting method of existing keyword " popularity, reporter, media, leave for, quite by " in the sequence of resulting keyword does not have actual meaning as keyword, Simply these words due to the number of times repeated in the text it is more, and be used as candidate keywords；By proposed by the invention The sequence for the keyword that the extracting method of keyword is obtained so that the extraction accuracy rate of keyword is higher.

Referring to Fig. 3, Fig. 3 is a kind of structure of the keyword extracting device based on graph model provided in an embodiment of the present invention Figure, the device includes following module：

Acquisition module 301, carries out participle for obtaining pending text, and to pending text, obtains pending text Corresponding candidate keywords；

Searching modul 302, for searching the corresponding term vector of candidate keywords, term vector model bag in term vector model Include the term vector of candidate keywords；

Processing module 303, the Word similarity matrix for building candidate keywords according to term vector；

Candidate keywords are ranked up by extraction module 304 for the Word similarity matrix according to candidate keywords, are extracted The keyword of pending text.

Further, processing module 303, including：

First computing unit, for according to formula：

The cosine value of corresponding term vector angle between calculating candidate keywords, wherein, θ is represented between candidate keywords The angle of vector, x_1kRepresent corresponding vectorial characteristic value, x in one of candidate keywords n-dimensional space_2kRepresent wherein another Corresponding vectorial characteristic value in one candidate keywords n-dimensional space, n represents the dimension of vector space；

Construction unit, for the cosine value according to term vector angle, builds candidate keywords similarity matrix.

Further, extraction module 304, including：

Second computing unit, the Word similarity matrix for calculating candidate keywords according to PageRank algorithms, is waited Select the corresponding PageRank value of keyword；

Sequencing unit, for being ranked up according to PageRank value to candidate keywords, obtains the important of candidate keywords Degree；

Extraction unit, for according to significance level, extracting the keyword of pending text.

Further, the second computing unit, including：

First determination subelement, for according to Word similarity order of matrix number, determining the initial value of PageRank algorithms；

First computation subunit, for according to initial value and Word similarity matrix, calculating the initial characteristicses of candidate keywords Vector value；

Second computation subunit, for according to formula：

p_t=M^Tp_t-1

Second determination subelement, for being less than when the norm of the t characteristic vector values and the t-1 characteristic vector values walked walked During the fault tolerance of PageRank algorithms, the characteristic vector value of t steps is the corresponding PageRank value of candidate keywords.

Further, acquisition module 301, including：

Acquiring unit, carries out participle for obtaining pending text, and to pending text, obtains stop words and specified word Property word, stop words at least include preposition, auxiliary word, conjunction, interjection, specify part of speech word at least include noun, verb, describe Word；

Processing unit, for filtering out stop words, obtains specifying the word of part of speech, the word for specifying part of speech is pending text pair The candidate keywords answered.

Further, term vector is obtained by word2vec training.

As can be seen here, a kind of keyword extracting device based on graph model provided in an embodiment of the present invention, by handling mould The term vector of block calculates the similarity between word and word in text, and builds similarity matrix so that the keyword extracted exists Its semantic importance in current text is reflected to a certain extent.When building similarity matrix, the phase between word and word Like degree do not rely on co-occurrence between word and word, but calculated based on term vector, so, it is to avoid keyword extraction process The problem of word weighting repeated between middle use word and word caused by co-occurrence is excessive, without being manually set the big of window It is small, the keyword for more meeting document subject matter is selected by semantic similarity, the accuracy rate of keyword extraction is improved.

It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or deposited between operating In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Nonexcludability is included, so that process, method, article or equipment including a series of key elements not only will including those Element, but also other key elements including being not expressly set out, or also include being this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there is other identical element in process, method, article or equipment including key element.

Each embodiment in this specification is described by the way of related, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.It is real especially for system Apply for example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.

Presently preferred embodiments of the present invention is these are only, is not intended to limit the scope of the present invention.It is all in this hair Any modification, equivalent substitution and improvements made within bright spirit and principle etc., are all contained in protection scope of the present invention.

Claims

1. a kind of keyword extracting method based on graph model, it is characterised in that methods described includes：

Pending text is obtained, and participle is carried out to the pending text, the corresponding candidate of the pending text is obtained and closes Keyword；

The corresponding term vector of the candidate keywords is searched in term vector model, the term vector model is closed including the candidate The term vector of keyword；

The candidate keywords are ranked up according to the Word similarity matrix of the candidate keywords, the pending text is extracted This keyword.

2. according to the method described in claim 1, it is characterised in that described that the candidate keywords are built according to the term vector Word similarity matrix, including：

According to formula：

c o s (θ) = \frac{Σ_{k = 1}^{n} x_{1 k} x_{2 k}}{\sqrt{Σ_{k = 1}^{n} {x_{1 k}}^{2}} \sqrt{Σ_{k = 1}^{n} {x_{2 k}}^{2}}}

Calculate the cosine value of corresponding term vector angle between the candidate keywords, wherein, θ represent the candidate keywords it Between vectorial angle, x_1kRepresent corresponding vectorial characteristic value, x in one of candidate keywords n-dimensional space_2kRepresent it In corresponding vectorial characteristic value in another candidate keywords n-dimensional space, n represents the dimension of vector space；

3. according to the method described in claim 1, it is characterised in that the Word similarity matrix according to the candidate keywords The candidate keywords are ranked up, including：

The Word similarity matrix of the candidate keywords is calculated according to PageRank algorithms, the correspondence of the candidate keywords is obtained PageRank value；

The candidate keywords are ranked up according to the PageRank value, the significance level of the candidate keywords is obtained；

4. method according to claim 3, it is characterised in that described that the candidate key is calculated according to PageRank algorithms The Word similarity matrix of word, including：

According to formula：

p_t=M^Tp_t-1

The characteristic vector value of the candidate keywords is calculated, wherein, in t=1, then p₁Represent the initial characteristicses vector value, p₀ Represent the initial weight, p_tRepresent the characteristic vector value that the Word similarity matrix is walked in t, p_t-1Represent that institute's predicate is similar The characteristic vector value that degree matrix is walked in t-1, M represents the Word similarity matrix of the candidate keywords, M^TRepresent institute's predicate phase Like the transposition for spending matrix, t represents the step number calculated, and t value is more than or equal to 1；

When the norm of the characteristic vector value that the t is walked and the t-1 characteristic vector values walked is calculated less than the PageRank During the fault tolerance of method, the characteristic vector value of the t steps is the corresponding PageRank value of the candidate keywords.

5. the method according to any one of Claims 1-4, it is characterised in that the pending text of acquisition, and to described Pending text carries out participle, obtains the corresponding candidate keywords of the pending text, including：

Pending text is obtained, and participle is carried out to the pending text, the word of stop words and specified part of speech is obtained, it is described to stop Word at least includes preposition, auxiliary word, conjunction, interjection, and the word of the specified part of speech at least includes noun, verb, adjective；

The stop words is filtered out, the word of the specified part of speech is obtained, the word of the specified part of speech is the pending text pair The candidate keywords answered.

6. the method according to any one of Claims 1-4, it is characterised in that the term vector is instructed by word2vec Get.

7. a kind of keyword extracting device based on graph model, it is characterised in that described device includes：

Acquisition module, carries out participle for obtaining pending text, and to the pending text, obtains the pending text Corresponding candidate keywords；

Searching modul, for searching the corresponding term vector of the candidate keywords, the term vector model in term vector model Include the term vector of the candidate keywords；

The candidate keywords are ranked up, carried by extraction module for the Word similarity matrix according to the candidate keywords Take the keyword of the pending text.

8. device according to claim 7, it is characterised in that the processing module, including：

First computing unit, for according to formula：

c o s (θ) = \frac{Σ_{k = 1}^{n} x_{1 k} x_{2 k}}{\sqrt{Σ_{k = 1}^{n} {x_{1 k}}^{2}} \sqrt{Σ_{k = 1}^{n} {x_{2 k}}^{2}}}

9. device according to claim 7, it is characterised in that the extraction module, including：

Second computing unit, the Word similarity matrix for calculating the candidate keywords according to PageRank algorithms, obtains institute State the corresponding PageRank value of candidate keywords；

Sequencing unit, for being ranked up according to the PageRank value to the candidate keywords, obtains the candidate key The significance level of word；

10. device according to claim 9, it is characterised in that second computing unit, including：

First determination subelement, for according to the Word similarity order of matrix number, determining the initial of the PageRank algorithms Value；

First computation subunit, for according to the initial value and the Word similarity matrix, calculating the candidate keywords Initial characteristicses vector value；

Second computation subunit, for according to formula：

p_t=M^Tp_t-1

Second determination subelement, for when the norm of the t characteristic vector values and the t-1 characteristic vector values walked walked Less than the PageRank algorithms fault tolerance when, the characteristic vector value of t step is pair of the candidate keywords The PageRank value answered.

11. the device according to any one of claim 7 to 10, it is characterised in that the acquisition module, including：

Acquiring unit, carries out participle for obtaining pending text, and to the pending text, obtains stop words and specified word Property word, the stop words at least includes preposition, auxiliary word, conjunction, interjection, the word of the specified part of speech at least include noun, Verb, adjective；

Processing unit, for filtering out the stop words, obtains the word of the specified part of speech, the word of the specified part of speech is described The corresponding candidate keywords of pending text.

12. the device according to any one of claim 7 to 10, it is characterised in that the term vector is by word2vec What training was obtained.