CN114912446A

CN114912446A - Keyword extraction method and device and storage medium

Info

Publication number: CN114912446A
Application number: CN202210473957.2A
Authority: CN
Inventors: 施震; 黄晨; 汤文华; 文卫东; 李旭晖
Original assignee: China Securities Credit Investment Co Ltd
Current assignee: China Securities Credit Investment Co Ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-08-16

Abstract

The invention discloses a keyword extraction method, a keyword extraction device and a storage medium. The method comprises the following steps: performing word segmentation on the text to be extracted; constructing a word segmentation word graph; generating corresponding word vectors according to the sememes of the word segments; calculating word meaning similarity between adjacent participles in the participle word graph according to the word vector of each participle, and calculating the initial score of each participle according to the word meaning similarity so as to obtain candidate keywords through screening; and processing the initial score according to the word frequency-reverse file frequency value of each candidate keyword to obtain a final score, thereby screening to obtain the keywords. On the basis of a word graph model, the word meaning of the participle is fused with the semantic information, so that word vectors of the participle with multiple meanings are distinguished in different contexts, then the score of each participle is calculated by combining the co-occurrence relation among the participles and the word meaning information of the participle, and the score of the participle is corrected according to the word frequency and the reverse file frequency, so that the keyword extraction effect is improved.

Description

Keyword extraction method and device and storage medium

Technical Field

The present invention relates to the field of natural language processing, and in particular, to a keyword extraction method, device and storage medium.

Background

In recent years, text keyword extraction methods are mainly classified into two types, namely unsupervised methods and supervised methods, according to different model training modes. The supervised method is to extract and convert the keywords into a binary problem or a sequence labeling problem for judging whether each word in the text is the keyword. With the rapid development of deep learning technology, a supervising method for extracting keywords by adopting a deep learning model is endless, and high accuracy and recall rate are achieved. However, the training of such models relies on large-scale corpora and high-quality manual labeling, and a large amount of resources are consumed. In contrast, the unsupervised method does not depend on large-scale corpora and manual labeling, and is convenient and quick. The existing unsupervised keyword extraction method mainly comprises four categories, namely statistics-based, theme-based, clustering-based and graph model-based, wherein compared with other methods, the keyword extraction method based on the graph model fully considers the structural characteristics and the association characteristics among words of a text, has a good keyword extraction effect, and is widely applied.

Disclosure of Invention

The inventor finds that the existing unsupervised method for extracting the text keywords has limited accuracy and recall rate of extracting the text keywords, and the effect of extracting the keywords has larger promotion space. In order to at least partially solve the technical problems in the prior art, the inventor makes the present invention, and provides the following technical solutions through specific embodiments:

in a first aspect, an embodiment of the present invention provides a keyword extraction method, including the following steps:

performing word segmentation on a text to be extracted to obtain a word segmentation set;

constructing a participle word graph corresponding to the participle set according to a preset word graph model;

respectively generating word vectors corresponding to the participles according to the sememes of the participles in the participle set;

calculating word meaning similarity between adjacent participles in the participle word graph according to word vectors of the participles, and calculating initial scores of the participles in the participle word graph according to the word meaning similarity;

screening the participles in the participle set according to the initial scores to obtain at least one candidate keyword;

determining a word frequency-reverse file frequency value of each candidate keyword, and processing the word frequency-reverse file frequency value and the initial score to obtain a final score of each candidate keyword;

and screening the at least one candidate keyword according to the final score to obtain at least one keyword.

Further, the generating word vectors corresponding to the participles according to the sememes of the participles in the participle set includes:

determining a meaning item corresponding to each participle in the participle set and a sememe corresponding to the meaning item;

generating a meaning item vector of each meaning item according to a meaning source vector of a meaning source corresponding to the meaning item;

and according to the attention mechanism, respectively carrying out weighted summation on the semantic item vectors of the semantic items corresponding to the participles to obtain the word vectors corresponding to the participles.

Further, the generating a semantic item vector of each semantic item according to the semantic item corresponding to the semantic item specifically includes:

and calculating the average value of the semantic element vectors of all the semantic elements corresponding to the semantic elements to obtain the semantic element vectors corresponding to the semantic elements.

Further, according to the attention mechanism, the weighted summation of the semantic item vectors of the semantic items corresponding to the participles respectively adopts the following calculation formula:

wherein e represents a word vector of the participle w,

a semantic term vector representing the jth semantic term of the participle w,

representing the weight of the jth sense of the participle w;

the weight of the jth meaning term of the participle w is calculated by adopting the following calculation formula:

wherein the content of the first and second substances,

a term vector representing the jth and kth terms of the participle w, w _c ' denotes an average value of word vectors of a predetermined number of divided words before and after the divided word w.

Further, the initial scores of the participles in the participle word graph obtained by calculation according to the word sense similarity adopt the following calculation formula:

wherein, w _i 、w _j 、w _k Respectively representing the ith, jth and kth participles in the participle word graph, S (w) _i )、S(w _j ) Respectively represent participles w _i And word segmentation w _j Initial fraction of (c), In (w) _i ) Indicating the directional participle w in the participle word graph _i The word segmentation set of (2); 0ut (w) _j ) Representing the participle w in the participle word graph _j Set of word segments pointed to, d is a smoothing factor, Sim (w) _i ,w _j ) Representing a participle w _i And w _j Similarity of sense between them, Sim (w) _k ,w _j ) Representing a participle w _k And w _j Word sense similarity between them.

Further, the word meaning similarity between adjacent participles in the participle word graph obtained by calculation according to the word vector of each participle adopts the following calculation formula:

wherein, Sim (w) _i ,w _j ) Representing a participle w _i And w _j Similarity of sense between e _i 、e _j Respectively represent words w _i 、w _j The word vector of (2).

Further, the determining a word frequency-inverse document frequency value of each candidate keyword, and processing the word frequency-inverse document frequency value and the initial score to obtain a final score of each candidate keyword includes:

respectively calculating the word frequency-reverse file frequency value of each candidate keyword according to the word frequency of each candidate keyword in the text to be extracted and the reverse file frequency in a preset corpus;

and aiming at each candidate keyword, carrying out normalization processing on the word frequency-reverse file frequency value and the initial score, and carrying out weighted summation according to a preset weighting coefficient to obtain a final score of each candidate keyword.

Further, the word segmentation is performed on the text to be extracted to obtain a word segmentation set, which includes:

and according to the knowledge field to which the text to be processed belongs, segmenting the text to be extracted by using the dictionary in the corresponding field to obtain a segmentation set.

In a second aspect, an embodiment of the present invention provides a keyword extraction method and apparatus, including:

the text preprocessing module is used for segmenting words of a text to be extracted to obtain a word segmentation set;

the word graph construction module is used for constructing a word graph corresponding to the word set according to a preset word graph model;

the word vector generation module is used for respectively generating word vectors corresponding to the participles according to the sememes of the participles in the participle set;

the score calculation module is used for calculating word meaning similarity between adjacent participles in the participle word graph according to word vectors of the participles and calculating initial scores of the participles in the participle word graph according to the word meaning similarity;

the candidate keyword screening module is used for screening the participles in the participle set according to the initial scores to obtain at least one candidate keyword;

the score correction module is used for determining a word frequency-reverse file frequency value of each candidate keyword, and processing the word frequency-reverse file frequency value and the initial score to obtain a final score of each candidate keyword;

and the keyword screening module is used for screening the at least one candidate keyword according to the final score to obtain at least one keyword.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the keyword extraction method according to any one of the above schemes.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

the method comprises the steps of performing word segmentation on a text to be extracted to obtain a word segmentation set, constructing a word segmentation word graph according to the word segmentation set, and generating word vectors of the word segmentation according to the sememes corresponding to the word segmentation; then, calculating word meaning similarity between adjacent participles in the participle word graph according to the word vector of each participle, and further calculating to obtain an initial score of each participle; and processing the initial score according to the word frequency-reverse file frequency value of the word segmentation to obtain a final score of the word segmentation, and determining the text keywords according to the final score. On the basis of a word graph model, word vectors containing more semantic information are obtained for the word senses of the participles by fusing the semantic information with the word senses of the participles, so that the word vectors of the participles with multiple senses are distinguished in different contexts, and the scores of the participles are calculated by combining the co-occurrence relation among the participles and the word sense information of the participles, so that the score calculation result of the participles is more accurate, and the keyword extraction effect is improved; on the basis, the scores of the participles are corrected according to the word frequency and the reverse file frequency, so that the final scores of the low-frequency keywords are improved, the influence of high-frequency irrelevant words on the keyword extraction result is reduced, and the keyword extraction effect is further improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic flowchart of a keyword extraction method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating another keyword extraction method according to an embodiment of the present invention;

FIG. 3 is a topological structure diagram of a word segmentation graph according to a first embodiment of the present invention;

FIG. 4 is a diagram illustrating semantic items and semantic information of a word segmentation according to a first embodiment of the present invention;

fig. 5 is a block diagram illustrating a structure of a keyword extraction apparatus according to a second embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Example one

The existing keyword extraction method based on the graph model takes a TextRank algorithm as a main representative. In the TextRank algorithm, based on the basic thought of a word graph model, the word graph can be constructed by the composition words of a text according to the context co-occurrence relation among the words, the importance of each word in the word graph is calculated through a random walk algorithm, and the keywords are determined according to the importance sorting.

In the process of calculating the importance of the words, the TextRank algorithm only considers the structure information of the text, considers that the influence degree of each adjacent word in the word graph on the central word is the same, and ignores the word meaning information among the words. A keyword is a group of words that can express the subject matter of a text, and words associated with the keyword appear in a context close to the keyword. Therefore, if the co-occurrence relationship between words in the text and the word sense information can be combined, the importance of the words in the word graph can be calculated more accurately, and the keywords of the text can be extracted more efficiently.

Based on this, as shown in fig. 1, an embodiment of the present invention provides a keyword extraction method, including the following steps:

and S1, performing word segmentation on the text to be extracted to obtain a word segmentation set.

Specifically, a word segmentation tool is used for segmenting the text to be extracted, and the word segmentation tool may be a preset word segmentation model, such as an open source jieba word segmentation device. Obtaining a word segmentation set after word segmentation, wherein the word segmentation in the word segmentation set is arranged according to the context sequence of each word segmentation in the extracted text, and the word segmentation can be expressed as:

W＝[w1,w2,w3,…,wn]

wherein W represents a participle set, and wn represents the nth participle in the participle set.

Preferably, the dictionary of the corresponding field is used for word segmentation aiming at the knowledge field of the text to be extracted, so that the accuracy of the word segmentation result is improved. For example, when a financial text is segmented, a preset financial domain dictionary is loaded into a custom dictionary of a jieba segmenter, so that financial terms in the financial text are prevented from being split by mistake, and the accuracy of segmentation results is improved. The preset financial domain dictionary can be a dictionary formed by combining a highly-contained financial vocabulary English-Chinese dictionary with open financial domain hot words in the Internet.

Preferably, the participle set obtained by the participle is also subjected to stop word processing to obtain the participle set after the stop word is removed. The stop words generally include punctuation marks and irrelevant words such as common words, tone words and the like, such as's','s' and the like, and specifically, the stop words can be processed by using an open source Chinese stop word list from the internet.

And S2, constructing a participle word graph corresponding to the participle set according to a preset word graph model.

It should be noted that the word graph model used in the present embodiment may be a general word graph model in the field of keyword extraction, and specifically, refer to a word graph model in the prior art. Specifically, according to the co-occurrence relationship of the participles in the participle set in the co-occurrence window with the preset length, the participles are used as nodes, the co-occurrence relationship is edges, and a participle word graph G ═ V, E is constructed in a preset word graph model, wherein V is a node set, and E is an edge set. In the keyword extraction task, each node v represents a participle w. When dividing word w _i And word segmentation w _j When the two edges exist in the same co-occurrence window, two directed edges are added to the word segmentation word graph, namely v _i →v _j And v _j →v _i . The topology of the word segmentation graph can refer to fig. 3.

In this embodiment, the length of the co-occurrence window refers to the size of the word-taking window, that is, the context word-taking range using the central word as the reference point, for example, when the length of the co-occurrence window is a, the number of the obtained context words is 2A, where a represents a positive integer. It should be noted that the length of the co-occurrence window may affect the extraction effect of the final keyword, and the specific length may be determined according to the experimental result, for example, in one embodiment, the length of the co-occurrence window is set to 3.

And S3, respectively generating word vectors corresponding to the participles according to the sememes of the participles in the participle set.

Specifically, the information of each participle in the participle set and the semantic meanings of the participle is input into a preset word vector training model for training, the preset word vector training model utilizes the co-occurrence relation among the context participles, and generates word vectors representing different word meanings according to different semantic meanings of the participle by maximizing the conditional probability of generating surrounding words by a central word. In addition, the model adopts an attention (attention) mechanism to endow each word meaning of the participle with different weights, so that the participle can obtain different word vector representations according to the difference of the word meanings of the participle under different contexts. For example, 'apple' has two senses of brand and fruit, and under the context of 'i own one-step iphone', the weight value of the sense of brand is higher. Regarding the hyper-parameters of the preset word vector training model, the length of a word vector training window may be set to be 3, and the dimension of a word vector may be set to be 200.

In an embodiment, as shown in fig. 2, the step S3 specifically includes:

s31, determining a meaning item (Sense) corresponding to each Word (Word) in the Word segmentation set and a Sememe (Sememe) corresponding to the meaning item.

Specifically, each meaning item corresponding to the word segmentation and the corresponding meaning source of each meaning item are determined according to the word meaning item and the meaning source information in the preset knowledge base. The preset knowledge base has specified meaning item and sense information corresponding to each word, such as a Hownet (Hownet) knowledge base, and the like, taking "apple" as an example, and the meaning item and sense information corresponding to the apple "in the Hownet knowledge base are shown in fig. 4.

S32, generating the meaning item vector of each meaning item according to the meaning original vector of the meaning item corresponding to the meaning item.

Specifically, the word segmentation and the semantic item semantic information of the word segmentation are input into the preset word vector training model for training, a semantic vector is generated for each semantic, and then a semantic item vector corresponding to the semantic item is generated according to the semantic vector. When a sense corresponds to multiple senses, in one embodiment, the sense vector is obtained by calculating the average value of the sense vectors of the senses corresponding to the sense. In another embodiment, the semantic item vector may also be obtained by performing weighted summation on the semantic vector of each semantic item corresponding to the semantic item.

And S33, respectively carrying out weighted summation on the semantic item vectors of the semantic items corresponding to the participles according to the attention mechanism to obtain the word vectors corresponding to the participles.

Specifically, the weighted summation may be performed on the semantic item vector of the semantic item corresponding to the participle by using the following calculation formula:

wherein e represents a word vector of the participle w,

a semantic term vector representing the jth semantic term of the participle w,

representing the weight of the jth sense of the participle w;

wherein the content of the first and second substances,

a sense vector respectively representing the jth and kth senses of the participle w; w is a _c ' represents an average value of word vectors of a preset number of participles before and after the participle w, a value of the preset number is related to a training window length of the word vector, and an optimal value is obtained according to an experimental result, for example in one embodiment,the preset number is set to 2.

The semantic item vector may be an average value of semantic elements of the semantic items corresponding to the semantic items, and the semantic elements are obtained by training the preset word vector training model.

S4, according to the word vector of each participle, calculating to obtain the word meaning similarity between adjacent participles in the participle word graph, and according to the word meaning similarity, calculating to obtain the initial score of each participle in the participle word graph.

It is easy to know that, using the basic TextRank algorithm, the calculation formula of each node score in the word graph is as follows:

wherein: s (v) _i )、S(v _j ) Respectively represent nodes V _i 、V _j Fraction of (A), In (V) _i ) Is other node to V _i A set of nodes of (a); 0ut (V) _j ) Is node V _j Set of pointed-to nodes, | In (V) _i ) Is linked to V _i The node number of nodes, d, is a smoothing factor, which is typically set to a value of 0.85.

When the algorithm is used for solving the scores of all the participles in the participle word graph, only the structural information of the text to be extracted is considered, the influence degree of each adjacent participle in the participle word graph on the central participle is considered to be the same, and the word meaning relation among the participles is ignored. On the basis of a TextRank algorithm, the embodiment of the invention uses word sense similarity to replace uniform weight as the weight of the edge in the participle word graph, and calculates the initial score S of each participle, wherein the calculation formula is as follows:

wherein, w _i 、w _j 、w _k Respectively representing the ith, jth and kth participles in the participle word graph; s (w) _i )、S(w _j ) Respectively represent participles w _i And word segmentation w _j An initial score of (a); in (w) _i ) Indicating the directional participle w in the participle word graph _i Set of participles of, 0ut (w) _j ) Representing the participle w in the participle word graph _j A pointed participle set; d is a smoothing factor, which is typically set to a value of 0.85; sim (w) _i ,w _j ) Representing a participle w _i And w _j Similarity of sense between them, Sim (w) _k ,w _j ) Representing a participle w _k And w _j Word sense similarity between them.

In this embodiment, the word meaning similarity is calculated according to the word segmentation word vectors trained by the preset word vector training model, and the value of the word meaning similarity may be the cosine similarity between the word segmentation word vectors, and the specific calculation formula is as follows:

wherein e is _i 、e _j Are respectively the word w _i 、w _j And the word vectors are trained through the preset word vector training model.

The embodiment of the invention generates word vectors containing more word meaning information by integrating the sememes in the process of training the word vectors, calculates the word meaning similarity between the participles according to the word vectors, and measures the relation between the adjacent participles and the central word in the participle word graph by using the word meaning similarity between the participles, namely calculates the score of each participle by using the word meaning similarity between the participles as the weight of the edge in the participle word graph, thereby better embodying the importance of each participle in the text semantics.

S5, screening the participles in the participle set according to the initial scores to obtain at least one candidate keyword.

Specifically, the participles in the participle set are screened according to the initial score of the participle and the preset condition. The preset condition is preset and can be set according to an application scene, for example, the preset condition can be that the initial score ranks the top N segmented words in descending order, and the N candidate keywords are obtained after the screening according to the preset condition. Wherein, N is a positive integer, and the value thereof can be preset according to the application scene requirement. In an embodiment, N may also take the number of all the participles in the participle set, that is, all the participles in the participle set are taken as candidate keywords, so as to avoid missing keywords.

S6, determining a word frequency-inverse file frequency value (TF-IDF) of each candidate keyword, and processing the word frequency-inverse file frequency value and the initial score to obtain a final score of each candidate keyword.

It is understood that the method of calculating the score of the segmented word in step S4 may result in a higher initial score for words with higher frequency of occurrence in the text, while some common high-frequency words may not be keywords of the text, and may also result in a relatively lower initial score for keywords with lower frequency of occurrence in the text. In order to improve the score of the low-frequency keyword and reduce the score of the irrelevant high-frequency word, the embodiment introduces a word frequency-reverse file frequency value to correct the initial score.

In this embodiment, the word frequency-inverse document frequency value of a candidate keyword is equal to the word frequency (TF) multiplied by the Inverse Document Frequency (IDF) of the candidate keyword. Wherein, TF is the frequency of the candidate key words appearing in the text to be extracted; the IDF may be obtained by dividing the total number of texts in the predetermined corpus by the number of texts including the candidate keyword, and then taking the logarithm of the obtained quotient. The lower the text proportion of the candidate keyword contained in the preset corpus is, the larger the IDF of the candidate keyword is, the better the classification capability of the candidate keyword is.

After the word frequency-reverse file frequency value of each candidate keyword is obtained through calculation, normalization processing is carried out on the word frequency-reverse file frequency value TF-IDF and the initial score S aiming at each candidate keyword, and the value range is controlled in a [0,1] interval, so that the initial score of the candidate keyword and the word frequency-reverse file frequency value are located on the same order of magnitude. And then, according to a preset weighting coefficient alpha, carrying out weighted summation on the word frequency-reverse file frequency value and the initial score to obtain a final score of each candidate keyword. The calculation formula of the final score is as follows:

C(w _i )＝α*S(w _i )′+(1-α)*TF-IDF(w _i )′

wherein, C (w) _i ) Representing participles w _i The final score of (a); α represents a weighting coefficient; s (w) _i ) ' and TF-IDF (w) _i ) ' respectively denote participles w _i And a normalized value of the word frequency-inverse file frequency value.

In the present embodiment, the preset weighting coefficients are obtained according to experimental results, and in one embodiment, the obtained extraction effects on different preset weighting coefficients are shown in table 1.

TABLE 1

As can be seen from table 1, when the preset weighting factor α is 0.32, the accuracy (Precision), Recall (Recall) and F-value (F-Measure) of the keyword extraction are the highest, i.e., the extraction effect is the best.

S7, screening the at least one candidate keyword according to the final score to obtain at least one keyword.

Specifically, the candidate keywords are screened according to preset conditions according to the final scores of the candidate keywords. The preset condition is preset and can be set according to an application scenario, for example, the preset condition can be that M top-ranked participles are ranked from large to small according to final scores in the candidate keywords, and the M keywords are obtained after screening according to the preset condition. Wherein, M is a positive integer, and the value thereof can be preset according to the application scene requirement. Namely, at least one keyword is screened out to be used as the keyword of the text to be extracted.

The method comprises the steps of firstly segmenting words of a text to be extracted to obtain a segmentation set, constructing a segmentation word graph according to the segmentation set, and generating word vectors of the segmentation words according to the sememes corresponding to the segmentation words; then, calculating word meaning similarity between adjacent participles in the participle word graph according to the word vector of each participle, and further calculating to obtain an initial score of each participle; and processing the initial score according to the word frequency-reverse file frequency value of the word segmentation to obtain a final score of the word segmentation, and determining the text keywords according to the final score.

The word vector containing more semantic information is obtained for the word meaning fusion semantic information of the participles on the basis of the word graph model, so that the word vectors of the participles with multiple meanings are distinguished in different contexts, the scores of the participles are calculated by combining the co-occurrence relation among the participles and the word meaning information of the participles, the score calculation result of the participles is more accurate, and the keyword extraction effect is improved; on the basis, the scores of the participles are corrected according to the word frequency and the reverse file frequency, so that the final scores of the low-frequency keywords are improved, the influence of high-frequency irrelevant words on the keyword extraction result is reduced, and the keyword extraction effect is further improved.

Example two

Based on the inventive concept of the first embodiment, as shown in fig. 5, an embodiment of the present invention further provides a keyword extraction apparatus, including:

the text preprocessing module 100 is configured to perform word segmentation on a text to be extracted to obtain a word segmentation set;

the word graph building module 200 is configured to build a word graph corresponding to the word set according to a preset word graph model;

a word vector generating module 300, configured to generate word vectors corresponding to the participles according to the sememes of the participles in the participle set;

the score calculation module 400 is configured to calculate word sense similarity between adjacent segmented words in the segmented word graph according to word vectors of the segmented words, and calculate initial scores of the segmented words in the segmented word graph according to the word sense similarity;

a candidate keyword screening module 500, configured to screen the participles in the participle set according to the initial score to obtain at least one candidate keyword;

a score correction module 600, configured to determine a word frequency-inverse file frequency value of each candidate keyword, and process the word frequency-inverse file frequency value and the initial score to obtain a final score of each candidate keyword;

and a keyword screening module 700, configured to screen the at least one candidate keyword according to the final score to obtain at least one keyword.

Because the principle of the problem solved by the keyword extraction device is similar to that of the keyword extraction method in the first embodiment, the implementation of the keyword extraction device can refer to the implementation of the method in the first embodiment, and repeated details are not repeated.

There is also provided, according to an embodiment of the present invention, a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement any one of the keyword extraction methods according to the first embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A keyword extraction method is characterized by comprising the following steps:

2. The method for extracting keywords according to claim 1, wherein the generating word vectors corresponding to the participles according to the sememes of the participles in the participle set comprises:

generating a sense item vector of each sense item according to a sense element vector of a sense element corresponding to the sense item;

3. The method for extracting keywords according to claim 2, wherein the generating of the semantic item vector of each semantic item according to the semantic item corresponding to the semantic item specifically comprises:

4. The method for extracting keywords according to claim 3, wherein the weighted summation of the semantic item vectors of the semantic items corresponding to the participles according to the attention mechanism adopts the following calculation formula:

wherein e represents a word vector of the participle w,

a semantic term vector representing the jth semantic term of the participle w,

representing the weight of the jth sense of the participle w;

wherein the content of the first and second substances,

a vector of sense items, w, representing the j-th and k-th sense items of the participle w, respectively _c ' denotes an average value of word vectors of a predetermined number of divided words before and after the divided word w.

5. The method for extracting keywords according to claim 1, wherein the initial score of each participle in the participle word graph obtained by calculation according to the word sense similarity adopts the following calculation formula:

wherein, w _i 、w _j 、w _k Respectively representing the ith, jth and kth participles in the participle word graph, S (w) _i )、S(w _j ) Respectively represent participles w _i And word segmentation w _j Initial fraction of (a), In (w) _i ) Indicating the directional participle w in the participle word graph _i The word segmentation set of (2); 0ut (w) _j ) Representing the participle w in the participle word graph _j Set of word segments pointed to, d is a smoothing factor, Sim (w) _i ,w _j ) Representing participles w _i And w _j Similarity of sense between them, Sim (w) _k ,w _j ) Representing a participle w _k And w _j Word sense similarity between them.

6. The method for extracting keywords according to claim 5, wherein the word sense similarity between adjacent participles in the participle word graph obtained by calculation according to the word vector of each participle is calculated by the following formula:

wherein, Sim (w) _i ,w _j ) Representing a participle w _i And w _j Degree of word sense similarity therebetween, e _i 、e _j Respectively represent words w _i 、w _j The word vector of (2).

7. The method of claim 1, wherein the determining a word frequency-inverse document frequency value for each candidate keyword, and processing the word frequency-inverse document frequency value and the initial score to obtain a final score for each candidate keyword comprises:

8. The method for extracting keywords according to claim 1, wherein the segmenting words of the text to be extracted to obtain a segmentation set comprises:

and according to the knowledge field to which the text to be processed belongs, performing word segmentation on the text to be extracted by using the dictionary in the corresponding field to obtain a word segmentation set.

9. A keyword extraction device is characterized by comprising:

the text preprocessing module is used for segmenting words of the text to be extracted to obtain a word segmentation set;

the score calculation module is used for calculating word sense similarity between adjacent participles in the participle word graph according to the word vector of each participle and calculating initial scores of each participle in the participle word graph according to the word sense similarity;

10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the keyword extraction method according to any one of claims 1 to 8.