CN102298576B

CN102298576B - Method and device for generating document keywords

Info

Publication number: CN102298576B
Application number: CN201010208994.8A
Authority: CN
Inventors: 孙军; 谢宣松; 姜珊珊; 赵利军; 郑继川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2010-06-25
Filing date: 2010-06-25
Publication date: 2014-07-02
Anticipated expiration: 2030-06-25
Also published as: CN102298576A

Abstract

The invention provides a method and device for generating document keywords. The method comprises the following steps of: acquiring a plurality of initial keyword sets and combining the plurality of initial keyword sets to obtain a candidate keyword set; determining a word score for each word in the candidate keyword set; clustering words based on the word score of each word and the similarity of words; and assigning keywords for a document based on each cluster obtained by clustering. According to the method and the device, existing word sets or word extraction algorithm information can be comprehensively utilized, so that more diversified and balanced results are obtained.

Description

Document keyword generates method and apparatus

Technical field

The present invention relates to information processing and information extraction, relate more specifically to the method and apparatus that document keyword generates.

Background technology

How keyword abstraction or generate generally refers to from one piece of document (or many pieces of relevant documentations) Automatic Extraction and goes out or generate several words or the phrase that can represent well document subject matter.Keyword abstraction technology is widely used in the various intelligent text field of information processing such as information retrieval, text classification/cluster, information filtering, documentation summary.

A lot of keyword abstractions or generation technique have been proposed.For example, denomination of invention is put forward the methods in the U.S. Patent Application Publication US20050278325A1 of " Graph-basedranking algorithms for text processing ", from natural language text, obtain some text units, thereby construct a text unit figure, and from text figure, obtain the sequence of some text units.For another example, be entitled as " Clustering to Find Exemplar Terms for Keyphrase Extraction ", Zhiyuan Liu, Peng Li, Yabin Zheng, Maosong Sun, has proposed a kind of method of utilizing clustering technique to extract key phrase from document in the article of relevant meeting EMNLP 2009, the 257-266 pages of natural language processing.The method first obtains the cluster of single word (word) and single word.Then from document, select key words based on some simple POS (part of speech) patterns.Again for example, denomination of invention is a kind of method of the middle proposition of U.S. Pat 7478092 (B2) of " Key term extraction ", attempt to filter candidate's set of words by some simple elimination techniques, be close to dittograph language such as detecting by an everyday expressions set.

In addition, for a given document, the vocabulary of some appointments may exist, such as some key wordses of author's appointment in some scientific papers.In addition, user can specify " seed " word to document according to some prioris or individual preference, " seed " word that user specifies, refer to the word relevant with user's request, for example, when user the provides inquiry word of input, to for example, in the situation that document exists multiple theme, only extract the keyword with certain Topic relative.Moreover, as mentioned above, exist a lot of existing word extraction algorithms to utilize.Given document is applied to these algorithms and can obtain some different set of words.Each existing word extraction algorithm generally solves word Assignment Problems from a different angle.

Therefore, exist for the needs of following method and apparatus, comprehensively these existing vocabulary or the word that obtained by existing word extraction algorithm extract the more excellent key words set of result with acquisition document for they.

Summary of the invention

For this reason, the present invention has been proposed.

According to an aspect of the present invention, provide a kind of document key words generation method, can comprise: obtained multiple preliminary key words set, merge the plurality of preliminary key words set, thereby obtain the set of candidate keywords language; For word mark determined in each word in the set of candidate keywords language; Similarity between word mark and each word based on each word, carries out cluster to word; And each bunch obtaining based on cluster, assign key words to document.

Can based on following three because usually calculating the objective function of cluster: the similarity sum between the word in each bunch; Each bunch bunch in the entropy of distribution of word number; And each bunch bunch in word mark and the entropy of distribution.

The plurality of preliminary key words set can comprise following in every at least one: the set of the set of the set of noun and noun phrase composition in the multiple corresponding key words set that obtains by multiple existing keyword abstraction algorithms, document, key words that document author is specified or index word composition, " seed " word composition that user specifies.

Determining word mark for each word in the set of candidate keywords language can comprise: utilize each word structure word recommendation figure in the set of initial candidate key words, each node of figure is corresponding to each word, link between word node in word recommendation figure is set up according to the recommendation relation between word and is had a weight corresponding with this recommendation relation, each word node has initial score, the mark of each word node is propagated to its neighbours' word node iteratively, each mark of taking turns the rear each word node of propagation has retained its raw score of a part, communication process ends at convergence or reaches a certain maximum iteration time.

Each take turns and propagate after the mark from neighbor node that obtains of each word node can depend on the weight linking between the mark of neighbor node and this word node and neighbor node.

If two words belong to a sentence, the weight of the link between two words can depend on the grammatical relation between these two words in the syntax tree of this sentence.

Can the co-occurrence statistics information in predetermined collection of document determine the similarity between these two words according to two words or the contained word of two words.

Assign key words to comprise to document: for each bunch, the word mark of the each word self based in this bunch and this word and bunch in similarity between other word, calculate the selection mark of this word; Similarity between each word in each word and this bunch based in the dictionary of predetermined field, the selection mark of this word in the dictionary of calculating field; And from each word of field dictionary and bunch each word, select to have the word of MAXIMUM SELECTION mark, as the key words of the document.

Assign key words to comprise to document: for each bunch, during the word mark of the each word self based in this bunch, this word mark account for bunch the ratio of word mark summation and this word and bunch in similarity between other word, calculate the selection mark of this word; Similarity between each word in each word and this bunch based in the dictionary of predetermined field, the selection mark of this word in the dictionary of calculating field; And from each word of field dictionary and bunch each word, selection has the first predetermined number word of MAXIMUM SELECTION mark, by selected the first predetermined number word for each bunch, according to selecting mark unification to carry out descending sort, the second predetermined number word that selection sequence ranks forefront is as the key words of the document.

According to a further aspect in the invention, a kind of document key words generating apparatus is provided, can have comprised: the set of candidate keywords language has obtained parts, for obtaining multiple preliminary key words set, merge the plurality of preliminary key words set, thereby obtain the set of candidate keywords language; Word mark determining means, is used to each word in the set of candidate keywords language to determine word mark; Cluster parts, for the similarity between word mark and each word based on each word, carry out cluster to word; And key words assignment component, for each bunch obtaining based on cluster, assign key words to document.

Utilize method and apparatus of the present invention, because each existing word extraction algorithm generally solves word Assignment Problems from a different angle, existing set of words also has it and stresses the aspect of considering.Therefore method and apparatus of the present invention can fully utilize existing word collection or word extraction algorithm information, thereby obtains more diversified and result balance.

In addition, by consider the similarity sum between the word in each bunch in the objective function of cluster; Each bunch bunch in the entropy of distribution of word number; And each bunch bunch in word mark and the entropy of distribution, and can further obtain more diversified and result balance.

In addition,, under the assistance of field dictionary, word assigning method of the present invention can be exported the vocabulary aspect some that but do not describe better former document in former document.

Accompanying drawing explanation

Fig. 1 is the overall flow figure of document key words generation method according to an embodiment of the invention;

Fig. 2 is the process flow diagram of the concrete example of document key words generation method according to another embodiment of the present invention;

Fig. 3 illustrates the block diagram of document key words generating apparatus according to an embodiment of the invention; And

Fig. 4 illustrates the figure of the example of the hardware configuration of document key words generating apparatus according to an embodiment of the invention.

Embodiment

In order to make those skilled in the art understand better the present invention, below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.

At the application's document, word and single word (word) refer in Chinese a word in a word or English.Word (term) both can refer to single word, can also refer to the phrase (or phrase) that multiple words become.

In addition, for fear of obscuring main points of the present invention, in the application's document, known feature or structure are not described, for example, in keyword abstraction, conventionally first to carry out participle to text, and in the time that the importance of word is assessed, consider that some word features are as the position of word frequency, inverted entry frequency, word, word length, part of speech etc.There are a lot of known technology as participle technique ICTCLAS etc. about participle, for selecting so same with word feature, but these aspects that to be not the present invention pay close attention to, therefore not to it for being described in detail, but, it should be noted that, this does not represent that the present invention cannot comprise these known feature or structures, and these participles and word feature selection technology may be used to the present invention on the contrary.

Fig. 1 is the overall flow figure of document key words generation method according to an embodiment of the invention.

As shown in Figure 1, document key words generation method can comprise that the set of initial candidate key words obtains step S110, word mark determining step S120, sorting procedure S130, key words appointment step S140 according to an embodiment of the invention.Below each step is specifically described.

At step S110, obtain multiple preliminary key words set, merge the plurality of preliminary key words set, thereby obtain the set of candidate keywords language.

The multiple preliminary key words set source is here arbitrarily, for example, can comprise following in every at least one: the set of the set of the set of noun and noun phrase composition in the multiple corresponding key words set that obtains by multiple existing keyword abstraction algorithms, document, key words that document author is specified or index word composition, " seed " word composition that user specifies.Above-mentioned existing keyword abstraction algorithm is arbitrarily, for example the mentioned extraction algorithm of background technology part.In addition, the set of noun and noun phrase composition has only been enumerated in above-mentioned source, and this is because termini generales and noun phrase can represent that the possibility of document subject matter is higher.But, this is only exemplary, and the set of for example verb of the word of other part of speech or phrase and verb phrase composition also can be applied the present invention.

The plurality of preliminary key words set of merging here can comprise simple union operation, for example, only removes the word of redundancy.But the multiple preliminary key words set of this merging also can comprise the operation of some complexity, for example:

(1) filter operation, as filtered out those according to prerequisite as the little word of the possibility of key words, the for example very high word of the frequency of occurrences in those daily lifes, as " today ", " * month * day " etc., or filter out some stop words as " still ", " but " etc.;

(2) give initial word mark for each candidate keywords language.The calculating of initial word mark can be according to the source of this candidate keywords language and difference.Suppose that merged multiple preliminary key words set note is T ₁, T ₂..., T _m, each preliminary key words set T _i={ Term ₁, Term ₂..., Term _j... }.If preliminary key words set Ti is obtained by known keyword abstraction algorithm, this known keyword abstraction algorithm can be for the keyword Term extracting conventionally _jcalculating has mark or weight Score (Term _j), and each keyword abstraction algorithm k can self have different marks or confidence level Conf according to the difference of its performance _k, the mark Score (Term in the preliminary key words set that can originate at it according to an initial key word _j) and the confidence level Conf of corresponding keyword abstraction algorithm self _ktrying to achieve the initial word mark of this initial key word, for example, is Conf _k* Score (Term _j), if keyword abstraction algorithm does not calculate word mark in addition, this word mark can rule of thumb manually be specified; If initial key word derives from key words or the set of index word composition or the set of " seed " word composition that user specifies that document author is specified, at this moment can give higher initial word mark for this candidate keywords language; If candidate keywords language derives from the set of noun and noun phrase composition in document, can give lower initial word mark for this candidate keywords language.

(3) in the time removing the word of redundancy, also increase its initial word mark according to the multiplicity of word.

At step S120, for word mark determined in each word in the set of candidate keywords language.

The word mark of each word here can be the initial word mark of each word of obtaining in step S110.But, preferably, can also carry out according to the recommendation relation between word the word mark of further refinement word.Below provide with reference to Fig. 2 the description of optimizing word mark by structure word recommendation figure and word mark communication process.In addition, can also utilize some prioris, for example, be the report about world cup at known one piece of document, can promote its word mark for the word relevant with football for example " goal ", " shooting " etc.

At step S130, the similarity between word mark and each word based on each word, carries out cluster to word.

Similarity between two words can the co-occurrence statistics information in predetermined collection of document be calculated according to two words or the contained word of two words.

Suppose two word term ₁and term ₂be expressed as

with

wherein

j word term _jin i word word _i, word term ₁be made up of p word, word term2 is made up of q word.These two word term ₁and term ₂between the specific implementation formula (1) of a measuring similarity can be:

similarity ({term}_{1}, {term}_{2}) = ({word}_{1}^{(1)} {word}_{2}^{(1)} . . . {word}_{p}^{(1)}, {word}_{1}^{(2)} {word}_{2}^{(2)} . . . {word}_{q}^{(2)})

(1)

= Σ_{i = 1}^{p} Σ_{j = 1}^{q} similarity ({word}_{i}^{(1)}, {word}_{j}^{(2)}) / (p \cdot q)

Wherein

represent word

with

between measuring similarity.Measuring similarity between two words (word) can obtain by the co-occurrence statistics data in certain collection of document.

For example, by mutual information method, calculate two word word with following formula (2) _iand word _jbetween similarity similarity (word _i, word _j):

similarity ({word}_{i}, {word}_{j}) = \log \frac{p ({word}_{i}, {word}_{j})}{p ({word}_{i}) p ({word}_{j})} - - - (2)

Wherein, p (word _i), p (word _j) represent respectively word word _iand word _jthe probability occurring in document, p (word _i, word _j) expression word word _iand word _jthe probability of co-occurrence in same sentence or predetermined size windows in document.

Similarity between probability of occurrence information, the word of above-mentioned word, the similarity between word can be determine and be stored in word similarity database in advance, can be also that scene calculates from processed object document.

The above-mentioned similarity of utilizing mutual information method to calculate between word is only example, can also utilize the statistical methods such as log-likelihood ratio (Log Likelihood Ratio), Chi-square Test (Chi-squared), and the knowledge method that gives dictionary (for example WordNet, knows net) calculates.

In addition, the similarity between word can determine before cluster, also can be in cluster process in-situ determination.

Cluster is a kind of non-supervisory machine learning algorithm, and for each individuality or sample are gathered for some classes, each individuality can be considered as a point in feature space.Its basic thought is that it is a class or cluster that point nearer and intensive feature space middle distance is gathered.

In word cluster herein, each word is each sample, and the similarity between word can be considered as the distance between word.Thus, what for example background technology of existing various clustering algorithm was quoted is entitled as " Clustering to Find Exemplar Terms for Keyphrase Extraction ", and the clustering algorithm of mentioning in relevant meeting EMNLP 2009 articles of natural language processing all can be applied to the present invention.

Obtain about last cluster bunch number c can be predetermined, be for example the number of user or the system key words of specifying, or can be also uncertain, determine according to the operation result that clustering algorithm is last.

According to one embodiment of present invention, bunch number c be the number of user or the system key words of specifying, word cluster process can be as follows:

1), at random word is divided into c bunch;

2), for each word, drop it off one different bunches exploratoryly, calculate the variation of objective function, this word is put into objective function is increased in maximum bunch;

3) if for all words, the value of objective function no longer increases, and terminator returns to current clustering result.

Objective function in clustering algorithm is very important, can design according to actual needs.According to one embodiment of present invention, in to word cluster process, both considered the similarity between word, also considered the word mark of each word, and wished to obtain a kind of compared with the cluster of balance.The balance here can be embodied in for example two aspects: each bunch bunch in the mark of word and relatively balance of size distribution; The element of each bunch is the relatively balance of number distribution of word.

Therefore, according to one embodiment of present invention, in the design of the objective function of cluster, the similarity between general bunch interior element and bunch between the similarity of element, it is also conceivable that the factor of above-mentioned balance.Comprise at least one in following items according to the target of the cluster of the embodiment of the present invention:

1, relatively balance of the mark of each bunch and size distribution;

2, the element number of each bunch distributes and compares balance;

3, the similarity between bunch interior element of each bunch is the bigger the better;

4, bunch and bunch between element between similarity the smaller the better.

According to one embodiment of present invention, the objective function of considering the balance clustering of node mark and word similarity comprises following three factors: each bunch bunch in similarity sum; Each bunch bunch in the entropy of distribution of word number; Each bunch bunch in word mark and the entropy of distribution.

The concrete form of objective function can have multiple choices, for example, can adopt any in following formula (3), (4), (5):

Σ_{k = 1}^{c} \underset{i, j &Element; π_{k}}{Σ} e (i, j) + α \cdot H (n_{1 ~ c}) + β \cdot H (S_{1 ~ c}) - - - (3)

Σ_{k = 1}^{c} \frac{\underset{i, j &Element; π_{k}}{Σ} e (i, j)}{S_{k}} + α \cdot H (n_{1 ~ c}) - - - (4)

Σ_{k = 1}^{c} \frac{\underset{i, j &Element; π_{k}}{Σ} e (i, j)}{\underset{i &Element; π_{k}}{Σ} e (i, j)} + α \cdot H (n_{1 ~ c}) + + β \cdot H (S_{1 ~ c}) - - - (5)

Wherein,

E (i, j) is the similarity between word i and j;

π _kk word bunch;

S _1～cs ₁, S ₂..., S _cwrite a Chinese character in simplified form, wherein

be S _kthe mark s of all word i in k word bunch _iand;

N _1～cn ₁, n ₂..., n _cwrite a Chinese character in simplified form, wherein n _iit is the word number in i word bunch;

H (S _1～c) be each bunch bunch in word mark and the entropy of distribution.The example of the definition of an entropy is exactly classical Shannon (Shannon) entropy, and it is defined as follows shown in the formula (6) of face:

H (S_{1 ~ c}) = Σ_{k = 1}^{c} - \frac{S_{k}}{Σ_{i = 1}^{c} S_{i}} \log (\frac{S_{k}}{Σ_{i = 1}^{c} S_{i}}) - - - (6)

H (n _1～c) be each bunch bunch in the entropy of distribution of word number.The example of the definition of an entropy is exactly classical Shannon (Shannon) entropy, and it is defined as follows shown in the formula of face:

H (n_{1 ~ c}) = Σ_{k = 1}^{c} - \frac{n_{k}}{Σ_{i = 1}^{c} n_{i}} \log (\frac{n_{k}}{Σ_{i = 1}^{c} n_{i}}) - - - (7)

α is a parameter of specifying in advance, is used to indicate the significance level of word number distribution equilibrium in each bunch;

β is a parameter of specifying in advance, be used to indicate word mark in each bunch and the significance level of distribution equilibrium.

Utilize the above-mentioned clustering method of the embodiment of the present invention, because considered simultaneously the big or small harmony of word bunch and word mark and harmony, so, directly word is done to cluster as the situation of the similarity between word in only considering bunch than of the prior art, can obtain the result of balance more, each bunch that cluster obtains embodied the theme of document better.

The cluster process of above-described embodiment is only example.In fact, conventional by one bunch (for example, element number less bunch) near another bunch being aggregated to, or one bunch (for example, element number more bunch) is divided into the technology etc. of two bunches all can be for the present invention.

In addition, the form of above-mentioned objective function is only example, other objective function for example also by bunch and bunch between element between similarity include the objective function of limit of consideration in also can be for the present invention.

In addition, in the concrete formula (3) of above-mentioned objective function, (4), (5) with word mark and the word mark that characterizes bunch of the entropy of distribution and the balance of size distribution, the balance that the word number characterizing bunch with the entropy of the distribution of word number distributes, this is only exemplary, and other statistical indicator for example marks deviation (Standard Deviation) also can levy balance for mark.

At step S140, each obtaining based on cluster bunch, assigns key words to document.

After cluster obtains each bunch, can from each bunch, select simply the word of word mark maximum, and set it as key words and be assigned to document.Or, can from each bunch, select with this bunch in the word of similarity maximum of other word be used as key words.Or, can consider the word mark of word and its with bunch in the similarity of other word determine whether setting it as key words and be assigned to document.

In addition, can also be in conjunction with field dictionary, from bunch or from the dictionary of field, select to be assigned to the key words of document.Below with reference to Fig. 2 to assigning the example of key words to be specifically described in conjunction with field dictionary.

By the method for the invention described above embodiment, can consider the result of existing appointment vocabulary or multiple existing word extraction algorithm, the large advantage of one is that the word that the many aspects of document can be assigned covers, and this is because the diversified feature of various aspects that existing appointment word or existing word extraction algorithm may be considered given document.

In addition, utilize the method for above-described embodiment, in the implementation of cluster, both considered the similarity between word, also consider word mark, the word of output bunch can cover the diversified theme in specified documents.The method of this embodiment not only makes the relatively balance of element number distribution of each bunch, and make the mark of each bunch and size distribution compare balance, thereby make to export word cocooning tool and have following characteristic: can cover the sub-topics of comparison refinement for the important theme of document comparison, can cover the sub-topics of comparison coarseness for the less important theme of document comparison.This characteristic makes the word of assigning can more effectively represent and describe former document.

Below with reference to Fig. 2, the process flow diagram of the concrete example of document key words generation method is in accordance with another embodiment of the present invention described.In Fig. 2, the frame table of dull gray shows input, intermediate result or output, and the frame table of white shows process or operation.

As shown in Figure 2, in step S210, obtain multiple preliminary key words set.Particularly, step S210 can be made up of sub-step S201, S202, S203, S204.Generate all minimum tentatively candidate's words at sub-step S201, for example, formed by nouns all in document and noun phrase; At sub-step S202, utilize multiple existing word extraction algorithms to generate multiple preliminary key words set; At sub-step S203, obtain the set of key words or the index word composition of document author appointment, if any; At sub-step S204, obtain the set of " seed word " composition of user's appointment, if any.

At step S220, the multiple preliminary key words set that utilizes step S210 to obtain carrys out the candidate keywords language set of the initial word mark of structural belt, and the each candidate keywords language in this candidate keywords language set all has initial word mark.About initial word mark, can be as described in reference to Figure 1, try to achieve according to the mark of each word in the confidence level of key words extraction algorithm self and the preliminary key words set that obtained by this algorithm, or rule of thumb come to specify in advance.In addition, as previously mentioned, remove redundancy word, filter the word of stopping using, all can carry out in this stage operations such as word mark weightings according to word occurrence number.

At step S230, utilize each word structure word recommendation figure in the set of candidate keywords language.Particularly, word recommends each node of figure corresponding to each word in the set of candidate keywords language, link between word node in word recommendation figure is set up according to the recommendation relation between word and is had a weight corresponding with this recommendation relation, each word node has initial score, the initial word mark of the word that this node is corresponding.

It is a variety of that word recommends relation to have, and below provides four examples:

The word relation that a kind of weight is higher is made up of child's node of the same father's node in sentence syntax tree.Sentence for example:

“The?My?Colors?features?include?positive?film，lighter?skin?tone，darker?skin?tone，vivid?blue，vivid?green，vivid?red，color?accent?and?color?swap.”

In the syntax tree of above-mentioned sentence, following vocabulary has common father's node, thereby between these vocabulary, have this kind of relation: " positive film ", " lighter skin tone ", " darker skin tone ", " vivid blue ", " vivid green ", " vivid red ", " color accent " and " colorswap ".

The second word relation of recommending forms by being between any two words of same sentence, or by being in same sentence and distance is less than between two words of certain threshold value and forms.

The third word recommendation relation is, is in the recommendation to the word in the head and the tail theme line of this paragraph of word in the sentence of centre of a paragraph.

The 4th kind of word recommendation relation be, the recommendation of the word in the content of the recommendation of the word in full to the word in article title and chapters and sections to the word in the title of these chapters and sections.

Those skilled in the art can consider other word recommendation relation as required, and can give predetermined weight for every kind of word recommendation relation.

Recommend relation if meet one or more words between two words, between these two words, establish the link, and the weight of link can be made as to the weighted mean of different word recommendation relations.

At step S240, on word recommendation figure, propagate word mark.In some sense, on word recommendation figure, propagating word mark can be considered as human society to recommend the one of relation to simulate, particularly, for example, in human society relation, a people's evaluation can comprise two factors, a factor is this people's self performance mark, another factor is that other people evaluation or the recommendation to this people gone up by society, and different people can be determined by two factors this people's recommendation degree, a factor is referrer's significance level, and another factor is that referrer and this people are the tightness degree of relation between recommended people.

Particularly, it can be the process of iteration that word mark is propagated, in each iterative process, the word of each node is passed to its neighbours mark, and (nominator's role played the part of in the word of this node, and its neighbours are presentees), self also obtain the mark (presentee's role played the part of in the word of this node, and its neighbours are nominators) from neighbours.。Each is taken turns after propagation, the mark of each word node is made up of two parts, Part I is retained its original mark of a part, Part II is the obtained mark from neighbor node, should can depend on the weight linking between the mark of neighbor node and this word node and neighbor node from the mark of neighbor node.

After all word marks are stable gradually or after maximum iteration time arrival, the process of mark propagation can stop, and the mark that returns to current word is as final word mark.

According to one embodiment of the invention, every mark of taking turns the node i after iteration can be calculated by following mark propagation formula:

s_{i}^{(l + 1)} = a + b \cdot s_{i}^{(0)} + (1 - a - b) \cdot \underset{j &Element; N (i)}{Σ} g (w_{j, i}, d_{i}, d_{j}, D_{i}, D_{j}, s_{j}^{(l)}) - - - (8)

Wherein,

A is a smoothing parameter of specifying in advance;

B is a parameter of specifying in advance, is used to indicate the shared ratio of initial score;

S _i ⁽⁰⁾it is the initial merging mark of word i;

S _i ⁽¹⁾the mark of word i after the 1st iteration;

S _i ⁽¹⁺¹⁾the mark of word i after (1+1) inferior iteration;

N (i) is the neighbours of i word node in word figure;

W _{j, i}it is word node j in the word figure weight to the link of word node i;

D _ibe the number of degrees of word node i, equal to point to take word node i as starting point the number of the link of other nodes;

D _ibe the number of degrees of the weighting of word node i, equal the weight summation of the link of pointing to other nodes take word node i as starting point.

Note 0≤a≤1,0≤b≤1,0≤a+b≤1.

be the function of a predefined, be called recommendation function here, this recommendation function can meet characteristic below: 1. this function is about w _{j, i}with

monotone increasing; 2. this function is about d _i, d _j, D _i, D _jmonotone decreasing.The implementation of this function has multiple, below lifts three instantiations:

g (w_{j, i}, d_{i}, d_{j}, D_{i}, D_{j}, s_{j}^{(l)}) = \frac{w_{j, i} s_{j}^{(l)}}{\sqrt{d_{i} d_{j}}} - - - (9)

g (w_{j, i}, d_{i}, d_{j}, D_{i}, D_{j}, s_{j}^{(l)}) = \frac{w_{j, i} s_{j}^{(l)}}{\sqrt{D_{i} D_{j}}} - - - (10)

g (w_{j, i}, d_{i}, d_{j}, D_{i}, D_{j}, s_{j}^{(l)}) = \frac{2 w_{j, i} s_{j}^{(l)}}{D_{i} + D_{j}} - - - (11)

But, those skilled in the art should be clear, above-mentioned mark propagation formula (8), recommended formula (9), (10), (11) are only exemplary, not as restriction of the present invention, can design as required other formula form.

By step S240, finally determine the word mark of each word in the set of candidate keywords language.

Then,, at step S250, structure word similarity figure, to obtain the similarity between each word in the set of candidate keywords language.In word similarity figure, each node, corresponding to each word in the set of candidate keywords language, links between each node between two, and the weight of link represents the similarity of two node words.Between two node words, between the calculating of similarity and the word described with reference to the step S130 of figure 1, the calculating of similarity is basic identical, is not repeating here.

Next, advance to step S260, each node in word similarity figure carries out balance clustering, at this, and the word recommendation figure based on obtaining in step S240, the each node in word similarity figure can have the word mark of corresponding word.

Exemplarily, as described in the step 130 of earlier in respect of figures 1, the objective function of the cluster process here can adopt aforementioned formula (3),

Σ_{k = 1}^{c} \underset{i, j &Element; π_{k}}{Σ} e (i, j) + α \cdot H (n_{1 ~ c}) + β \cdot H (S_{1 ~ c})

In this formula three embodied respectively each bunch bunch in similarity sum; Each bunch bunch in the entropy of distribution of word number; Each bunch bunch in word mark and the entropy of distribution.

According to this example, the number of cluster is determined to be equivalent to the number of the key words that will obtain, and is made as c, foregoing, and concrete cluster process is:

1), at random word is divided into c bunch;

3) if for all words, the value of objective function no longer increases or the number of times of iteration reaches preassigned maximum times, and terminator returns to current clustering result.

The clustering method of above-described embodiment not only makes the relatively balance of element number distribution of each bunch, and makes mark and the relatively balance of size distribution of each bunch, thereby the word bunch of output can cover the diversified theme in specified documents.

After cluster, advance to step S270, each bunch based on after cluster and field dictionary are chosen document key words.

In the dictionary of field, contain the most frequently used term in certain field and be commonly referred to field dictionary, can determine or choose by experiment by domain expert.Using field dictionary is herein based on following consideration: in some cases, compared with existing word in one bunch, the theme of document described better in certain or some words in the dictionary of possible field, this in the field dictionary here certain or some word should with this bunch in existing word there is larger similarity or the degree of correlation.

Particularly, as the first example, each bunch based on after cluster and field dictionary are chosen the process of document key words and can be carried out as follows:

For each bunch,

The word mark of the each word self based in this bunch and this word and bunch in similarity between other word, calculate the selection mark of this word;

Similarity between each word in each word and this bunch based in the dictionary of predetermined field, the selection mark of this word in the dictionary of calculating field; And

From each word of field dictionary and bunch each word, select to have the word of MAXIMUM SELECTION mark, as the key words of the document.

Again for example, as the second example, each bunch based on after cluster and field dictionary are chosen the process of document key words and also can be carried out as follows:

For each bunch,

During the word mark of the each word self based in this bunch, this word mark account for bunch the ratio of word mark summation and this word and bunch in similarity between other word, calculate the selection mark of this word;

From each word of field dictionary and bunch each word, select to have the first predetermined number word of MAXIMUM SELECTION mark,

By selected the first predetermined number word of each bunch, according to selecting mark unification to carry out descending sort, the second predetermined number word that selection sequence ranks forefront is as the key words of the document.

In the first example, suppose to obtain 5 bunches after cluster, by means of field dictionary, each word bunch is condensed into a key words, finally obtain 5 key wordses.

In the second example, after same hypothesis cluster, obtain 5 bunches, for each bunch, from this bunch or field dictionary, select for example 2 words, finally obtain 10 words, select mark by order sequence from high to low according to word these 10 words, and the word of selecting for example user that mark is the highest or system to specify number, for example select 8 words that mark is the highest as document key words.

The word of a word is selected mark to characterize and is selected the evaluation factor of this word as document key words.

An example of mark selected in the word that provides the word in word and field dictionary in compute cluster below.

Suppose that the set note of the word of the mixed number generating in the step S240 is above { (t ₁, s ₁), (t ₂, s ₂) ..., (t _u, s _u), wherein t _ii word, s _ibe the mark of i word, u is the word number in the set of candidate keywords language.Suppose that dictionary note in field is { t _u+1, t _u+2..., t _u+v, wherein t _u+ibe i word in the dictionary of field, v is the word number in the dictionary of field.Here we suppose the word mark s of each word in the dictionary of field _u+1=s _u+2=...=s _u+v=0.For the word i in word bunch k and this word bunch or field dictionary, definition (the i ∈ π of mark selected in a kind of word _kor u+1≤i≤u+v, wherein π _krefer to k word bunch) can be calculated by following formula (12):

\frac{1}{n_{k}} \underset{j &Element; π_{k}}{Σ} e (i, j) + γ \cdot s_{i} + η \cdot \frac{s_{i}}{\underset{j &Element; π_{k}}{Σ} s_{j}} - - - (12)

Wherein n _kit is the word number in word bunch k; γ is the parameter of specifying in advance, the significance level of representative this word mark in word selection mark; η is the parameter of specifying in advance, representative word select this word mark in mark account for bunch in word mark and the significance level of ratio.

The computing formula (12) of mark selected in above-mentioned word is only example.Can design as required other computing formula.For example, alternatively, can use following computing formula (13) to calculate word and select mark.

\frac{1}{n_{k}} \underset{j &Element; π_{k}}{Σ} e (i, j) + γ \cdot s_{i} + η \cdot \frac{s_{i}}{s_{\max (k)}} - - - (13)

Wherein n _kit is the word number in word bunch k; s _{max (k)}it is word mark maximum in word bunch k; γ is the parameter of specifying in advance, the significance level of representative this word mark in word selection mark; η is the parameter of specifying in advance, representative word select this word mark in mark with bunch in the significance level of ratio of maximum word mark.

Above-mentioned word is selected in the computing formula (12), (13) of mark, considered three factors, during the word mark of each word self, this word mark account for bunch the ratio of word mark summation and this word and bunch in similarity between other word.But, this is only example, can only consider one or two factors wherein, or can also consider as required more other factors.

At this embodiment, by from field dictionary or bunch choose document key words, can export not in former document but describe better word aspect some of former document as key words.

Fig. 3 shows the block diagram of document key words generating apparatus according to an embodiment of the invention.

The document key words generating apparatus can comprise: the set of candidate keywords language obtains parts 102, for obtaining multiple preliminary key words set, merges the plurality of preliminary key words set, thereby obtains the set of candidate keywords language; Word mark determining means 103, is used to each word in the set of candidate keywords language to determine word mark; Cluster parts 104, for the similarity between word mark and each word based on each word, carry out cluster to word; And key words assignment component 105, for each bunch obtaining based on cluster, assign key words to document.

Cluster parts 104 can based on following three because usually calculating the objective function of described cluster: the similarity sum between the word in each bunch; Each bunch bunch in the entropy of distribution of word number; And each bunch bunch in word mark and the entropy of distribution.

Multiple preliminary key words set can comprise following in every at least one: the set of the set of the set of noun and noun phrase composition in the multiple corresponding key words set that obtains by multiple existing keyword abstraction algorithms, document, key words that document author is specified or index word composition, " seed " word composition that user specifies.

Word mark determining means 103 is that each word in the set of candidate keywords language determines that word mark can comprise: utilize each word structure word recommendation figure in the set of initial candidate key words, each node of figure is corresponding to each word, link between word node in word recommendation figure is set up according to the recommendation relation between word and is had a weight corresponding with this recommendation relation, each word node has initial score, the mark of each word node is propagated to its neighbours' word node iteratively, each mark of taking turns the rear each word node of propagation has retained its raw score of a part, communication process ends at convergence or reaches a certain maximum iteration time.

In addition, can the co-occurrence statistics information in predetermined collection of document determine the similarity between these two words according to two words or the contained word of two words.

Key words assignment component 105 assigns key words to comprise to document: for each bunch, the word mark of the each word self based in this bunch and this word and bunch in similarity between other word, calculate the selection mark of this word; Similarity between each word in each word and this bunch based in the dictionary of predetermined field, the selection mark of this word in the dictionary of calculating field; And from each word of field dictionary and bunch each word, select to have the word of MAXIMUM SELECTION mark, as the key words of the document.

Key words assignment component 105 assigns key words to comprise to document: for each bunch, during the word mark of the each word self based in this bunch, this word mark account for bunch the ratio of word mark summation and this word and bunch in similarity between other word, calculate the selection mark of this word; Similarity between each word in each word and this bunch based in the dictionary of predetermined field, the selection mark of this word in the dictionary of calculating field; And from each word of field dictionary and bunch each word, selection has the first predetermined number word of MAXIMUM SELECTION mark, by selected the first predetermined number word for each bunch, according to selecting mark unification to carry out descending sort, the second predetermined number word that selection sequence ranks forefront is as the key words of the document.

Finally, provide as the description of example of hardware configuration of carrying out above-mentioned document key words generating apparatus with reference to Fig. 4.CPU (CPU (central processing unit)) 701 carries out various processing according to the program being stored in ROM (ROM (read-only memory)) 702 or storage area 708.For example, CPU carries out the program of the document key words generation of describing in the above-described embodiments.RAM (random access memory) 703 suitably stores the program carried out by CPU 701, data etc.CPU 301, ROM 702 and RAM 703 interconnect by bus 704.

CPU 701 is connected in input/output interface 705 by bus 704.Comprise the importation 706 of keyboard, mouse, microphone etc. and comprise that the output of display, loudspeaker etc. is connected in input/output interface 705.CPU 701 carries out various processing according to the instruction of inputting from importation 706.The result that CPU 701 processes to output 707 outputs.

The storage area 708 that is connected in input/output interface 705 comprises for example hard disk, and stores program and the various data carried out by CPU701.Communications portion 709 is come and communication with external apparatus by the network such as the Internet, LAN (Local Area Network) etc.

The driver 710 that is connected in input/output interface 705 drives the removable medium 711 such as disk, CD, magneto-optic disk or semiconductor memory etc., and acquisition is recorded in the program, data etc. there.The program obtaining and data are transferred to storage area 708 when needed, and are stored in there.

Ultimate principle of the present invention has below been described in conjunction with specific embodiments, but, it is to be noted, for those of ordinary skill in the art, can understand whole or any steps or the parts of method and apparatus of the present invention, can be in the network of any calculation element (comprising processor, storage medium etc.) or calculation element, realized with hardware, firmware, software or their combination, this is that those of ordinary skills use their basic programming skill just can realize in the situation that having read explanation of the present invention.

Therefore, object of the present invention can also realize by move a program or batch processing on any calculation element.Described calculation element can be known fexible unit.Therefore, object of the present invention also can be only by providing the program product that comprises the program code of realizing described method or device to realize.That is to say, such program product also forms the present invention, and the storage medium that stores such program product also forms the present invention.Obviously, described storage medium can be any storage medium developing in any known storage medium or future.

Also it is pointed out that in apparatus and method of the present invention, obviously, each parts or each step can decompose and/or reconfigure.These decomposition and/or reconfigure and should be considered as equivalents of the present invention.And, carry out the step of above-mentioned series of processes and can order naturally following the instructions carry out in chronological order, but do not need necessarily to carry out according to time sequencing.Some step can walk abreast or carry out independently of one another, for example, between the determining step of word mark and word the calculation procedure of similarity can be sequentially, carry out independently concurrently or with any order.

Above-mentioned embodiment, does not form limiting the scope of the invention.Those skilled in the art should be understood that, depend on designing requirement and other factors, various modifications, combination, sub-portfolio can occur and substitute.Any modification of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection domain of the present invention.

Claims

1. a document key words generation method, comprising:

Obtain multiple preliminary key words set, merge the plurality of preliminary key words set, thereby obtain the set of candidate keywords language;

For word mark determined in each word in the set of candidate keywords language;

Similarity between word mark and each word based on each word, carries out cluster to word, make each bunch bunch in word number distribution equilibrium in mark and size distribution balance and each bunch of word;

Each obtaining based on cluster bunch, assigns key words to document.

2. according to the key words generation method of claim 1, based on following three because usually calculating the objective function of described cluster:

Similarity sum between word in each bunch;

Each bunch bunch in the entropy of distribution of word number; And

Each bunch bunch in word mark and the entropy of distribution.

3. according to the key words generation method of claim 1, described multiple preliminary key words set comprise following in every at least one: the set of the set of the set of noun and noun phrase composition in the multiple corresponding key words set that obtains by multiple existing keyword abstraction algorithms, document, key words that document author is specified or index word composition, " seed " word composition that user specifies.

4. according to the key words generation method of claim 1, describedly determine that word mark comprises for each word in the set of candidate keywords language:

Utilize each word structure word recommendation figure in the set of initial candidate key words, each node of figure is corresponding to each word, link between word node in word recommendation figure is set up according to the recommendation relation between word and is had a weight corresponding with this recommendation relation, each word node has initial score, the mark of each word node is propagated to its neighbours' word node iteratively, each mark of taking turns the rear each word node of propagation has retained its raw score of a part, and communication process ends at convergence or reaches a certain maximum iteration time.

5. according to the key words generation method of claim 4, described each take turns propagate after the mark from neighbor node that obtains of each word node depend on the weight linking between the mark of neighbor node and this word node and neighbor node.

6. according to the key words generation method of claim 5, if two words belong to a sentence, the weight of the link between two words depends on the grammatical relation between these two words in the syntax tree of this sentence.

7. according to the key words generation method of claim 1, wherein, according to two words or the contained word of two words, the co-occurrence statistics information in predetermined collection of document is determined the similarity between these two words.

8. according to the key words generation method of claim 1, describedly assign key words to comprise to document:

For each bunch,

9. according to the key words generation method of claim 1, describedly assign key words to comprise to document:

For each bunch,

By selected the first predetermined number word for each bunch, according to selecting mark unification to carry out descending sort, the second predetermined number word that selection sequence ranks forefront is as the key words of the document.

10. a key words generating apparatus for document, comprising:

The set of candidate keywords language obtains parts, for obtaining multiple preliminary key words set, merges the plurality of preliminary key words set, thereby obtains the set of candidate keywords language;

Word mark determining means, is used to each word in the set of candidate keywords language to determine word mark;

Cluster parts, for the similarity between word mark and each word based on each word, carry out cluster to word, make each bunch bunch in word number distribution equilibrium in mark and size distribution balance and each bunch of word; And

Key words assignment component, for each bunch obtaining based on cluster, assigns key words to document.