CN108920456B

CN108920456B - Automatic keyword extraction method

Info

Publication number: CN108920456B
Application number: CN201810611476.7A
Authority: CN
Inventors: 吕学强; 董志安
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2018-06-13
Filing date: 2018-06-13
Publication date: 2022-08-30
Anticipated expiration: 2038-06-13
Also published as: CN108920456A

Abstract

The invention relates to a keyword automatic extraction method, which comprises the following steps: extracting common words in the technical standard, extracting candidate keywords, filtering the common words aiming at the candidate keywords, calculating a candidate keyword weight score by integrating the position characteristics, the word co-occurrence characteristics and the context semantic characteristics, calculating a dynamic threshold according to the candidate keyword weight score range, and determining result keywords by using the dynamic threshold. The automatic keyword extraction method provided by the invention integrates the position characteristic, the word co-occurrence characteristic and the context semantic characteristic to extract the keywords, and comprehensively considers the weight influence of the internal position of the document and the context semantic characteristic on the keywords, so that higher accuracy and recall rate are achieved, the technical standard retrieval quality of the 3GPP is improved, the labor cost is reduced, and the requirement of practical application can be well met.

Description

Automatic keyword extraction method

Technical Field

The invention belongs to the technical field of automatic extraction of keywords, and particularly relates to an automatic extraction method of keywords facing to a 3GPP technical standard.

Background

The explosive development of mobile communication technology brings epoch-making changes to the human society. As a standard maker of The leading technology in The field of communications, The 3rd Generation Partnership Project (3 GPP) is dedicated to generalizing 3G standards based on evolved global system for mobile communications (GSM) core networks including WCDMA, TD-scdma, EDGE, etc.

In recent years, there are many cases of patent infringement litigation disputes among large-scale communication technology companies, and the stability of patent rights of the invention is challenged unprecedentedly. The 3GPP technical standards play an irreplaceable important role in the work of examining communication patents.

The 3GPP technical standard is a scientific non-patent document specific to the patent examination work in the communication field, and is usually used as a comparison document to measure the creativity and novelty of the patent application in the communication field.

The typical 3GPP technical standard Cover mainly includes a standard number, a release number, a document title and version number information, a form part explains the version number, a Scope part declares the application Scope, a Reference part gives a Reference list, Definitions and abbrevatients parts list important Definitions and abbreviations of documents, a Topic body section specifically introduces the technical background and details, and Annex mainly relates to the version change history.

In addition, there is a mutual citation association between the 3GPP technical standard and the patent literature, and the differences from the patent literature are shown in table 1.

Table 1 patent document differs from 3GPP technical standard

As can be seen from table 1, the 3GPP technical standard has its own unique organization and type. Of major interest in the actual patent examination are Technical Specifications (TS), Technical Reports (TR), and conference files. The technical specification and the technical report collectively describe relevant regulations, principles, simulation and experimental results of the technology, and the conference file mainly records specific conference information of each working group. In contrast, the technical specification and the technical report content format are similar, and the carried core technical information is richer, and greater mining value is accumulated.

In actual patent review, the search of 3GPP technical standards is mainly based on keywords manually selected by examiners. The quality of the retrieval result often depends on the quality of the keywords, and the traditional mode is time-consuming and labor-consuming and is difficult to ensure the hit rate of the comparison files. Compared with patent documents, the 3GPP technical standard has the characteristics of wide coverage, large information amount, irregular format and weak readability, and the characteristics directly determine that the 3GPP technical standard has higher difficulty in automatically extracting keywords than the patent documents. Therefore, the effect of automatically extracting the keywords of the 3GPP technical standard is improved, so that the method is not only beneficial to improving the examination efficiency of the communication patents, but also has great significance for maintaining the authorization stability of the patents.

The automatic keyword extraction method has a great deal of relevant research at home and abroad, and generally comprises two major branches of a supervised learning method and an unsupervised learning method. The supervised learning method generally converts the keyword extraction problem into a two-class or multi-class problem in machine learning, and mainly relates to classification models such as Naive Bayes (Naive Bayes), Maximum entropy (Maximum entropy), Support Vector Machines (Support Vector Machines) and the like. Although the method has a good prediction effect to a certain extent, the extraction effect often depends on the labeling quality of the training corpus and the scale of the training corpus, excessive human input cannot be avoided, and the method is difficult to adapt to practical applicationMedium-volume data scenes. The unsupervised learning method has the most obvious advantage over the supervised learning method in that the labor cost is greatly saved, and can be divided into an extraction method based on statistics, an extraction method based on a topic model and an extraction method based on a word graph model according to the algorithm idea. Wherein, the extraction method based on statistics generally combines the word frequency (term frequency) information, the word frequency-inverse document frequency (TF-IDF) and the χ ² The weight of the candidate keywords is measured by statistical indexes such as values, and the method is sensitive to frequency and is easy to omit part of important low-frequency words. The most classical representative of the extraction method based on the topic model is an LDA (latent Dirichlet allocation) model algorithm, the LDA model infers the probability distribution of 'document-topic' and the probability distribution of 'topic-term' from a known 'document-term' matrix through analyzing the training corpus, and the extraction effect of the method depends on the topic distribution characteristics of the training set. The extraction method based on the word graph is most widely applied to a TextRank algorithm, the thought of the TextRank algorithm is derived from a PageRank algorithm of Google, sentences or words in a text set form graph nodes, the similarity between the nodes (sentences or words) is used as the weight of edges, an iterative voting mechanism is adopted to carry out importance sequencing on the nodes in the graph model, and the method is not dependent on the number of texts, but has the limitation that only the internal information of the texts is considered and the distribution characteristics of vocabularies among different texts are ignored. The mainstream method at the present stage usually performs keyword extraction by fusing the advantages of different methods for specific problems, and has the following defects: the method has the defects of poor recognition effect, poor low-frequency keyword recognition effect and the like due to the lack of consideration on semantic features.

Disclosure of Invention

In view of the problems in the prior art, an object of the present invention is to provide an automatic keyword extraction method that can avoid the technical defects.

In order to achieve the above object, the present invention provides the following technical solutions:

an automatic keyword extraction method comprises the following steps: extracting common words in technical standards, extracting candidate keywords, filtering the common words aiming at the candidate keywords, calculating a candidate keyword weight score by integrating position characteristics, word co-occurrence characteristics and context semantic characteristics, calculating a dynamic threshold according to the candidate keyword weight score range, and determining result keywords by using the dynamic threshold.

Further, the automatic keyword extraction method comprises the following steps:

step 1) removing text noise in the 3GPP technical standard;

step 2) extracting common words in the technical standard;

step 3) extracting candidate keywords and filtering common words based on the syntactic analysis tree;

and 4) comprehensively considering the position characteristics, word co-occurrence characteristics and context semantic characteristics of the candidate keywords in the document, calculating weight scores and sequencing, finally calculating a dynamic threshold according to the actual score range of the technical standard, and adding the candidate keywords with the scores exceeding the threshold into a result keyword set.

Further, the step 1) is specifically as follows: and removing text noise in the 3GPP technical standard by adopting the Apache POI analysis technical standard.

Further, the text noise includes pictures, tables, formulas, special symbols, and illegal characters.

Further, step 2) comprises: extracting common words in the technical standard based on the word frequency-document distribution entropy, wherein the word frequency-document distribution entropy refers to uncertainty measurement of the distribution state of the words w in the technical standard set; let a document set consisting of n technical standards be denoted as D ═ D ₁ ，d ₂ ...d _i ...d _n And recording the word frequency of the word w and the document distribution entropy as H (w), and then H (w) has a calculation formula of

Wherein, P (w, d) _i ) For the occurrence of the word w in the technical standard d _i I is 1. ltoreq. n, according to the maximum likelihood estimation method, P (w, d) _i ) Is calculated by the formula

Wherein, f (w, d) _i ) For words w in technical standards d _i The number of occurrences in (c).

Further, extracting candidate keywords based on the dependency parsing tree includes:

step 1: traversing the technical standard set D, for each technical standard D in D _i Dividing the words into sentences according to punctuations, and representing the divided sentence sets as

n _s As a document d _i The number of Chinese sentences;

step 2: for set sequences (d) _i ) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d) _i ) Memory for recording

Wherein T is _i Indicates technical standard d _i The dependency parsing tree corresponding to the ith sentence;

step 3: cyclic read dependency parse tree set Trees (d) _i ) For any dependency syntax tree T _i ∈Trees(d _i ) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, traversing T in a medium-order and ordered mode _i If the current node is a leaf node, judging whether the part of speech of the node is a noun, a verb or an adjective, if the condition is met, adding the node into the candidate keyword set, otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a noun phrase, if so, continuing to recursively traverse the right subtree of the current node until no non-leaf node taking the noun phrase as a parent node exists in the subtree, and at the moment, adding the child nodes of the noun phrase into a candidate keyword set as a whole;

step 4: and further filtering the candidate keyword set by using the extracted general words, and removing an element containing the general words from the candidate keyword set if the element exists in the candidate keyword set.

Further, the calculation method of the position feature weight comprises the following steps: respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to titles of different levels of the 3GPP technical standard, sequentially numbering the sentences in the sentence sets from 1, and recording the technical standard d _i Middle candidate keyword set CK (d) _i )＝{ck ₁ ，ck ₂ ...ck _i ...ck _n In which ck _i For any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as

SP＝{Title，Scope，Reference，Definitions，Abbrevations，NOTE}，

locate(ck _i ) Representing candidate keywords ck _i Position of occurrence, defining a characteristic function Pos (ck) _i ) Representing candidate keywords ck _i The weight in the dimension of the appearance position is assigned

Wherein, Sno _cki Representing candidate keywords ck _i Number of sentence in it, Snu _cki Representing candidate keywords ck _i Number of sentences in text paragraph, len (ck) _i ) Representing candidate keywords ck _i The number of words contained; the weights appearing at different positions are averaged and are recorded as W (Pos (ck) _i ) Represents the average of the position weights, then

Of these, fre (ck) _i ) Representing candidate keywords ck _i The frequencies occurring in the same technical standard.

Further, the word co-occurrence feature weight calculation method comprises the following steps:

remember allThe set of candidate keywords for the technical standard is CK ═ CK (d) ₁ )，CK(d ₂ )...CK(d _i )...CK(d _n ) H.for technical standard d _i Any one of the candidate keywords ck _i Memory component ck _i Are respectively cw ₁ ，cw ₂ …cw _i …cw _m M is ck _i Number of words contained, note cw _i Co-occurring word set of (1) is cocur _i ＝{wco ₁ ，wco ₂ …wco _i …wco _p P is the size of the co-occurrence set, wherein wco _j Representing the word cw _i Wco _j ∈CK(d _i ) And satisfies wco ₁ ∩wco ₂ ∩…∩wco _j ∩…∩wco _p ＝{cw _i J is more than or equal to 1 and less than or equal to p, then cw _i For candidate keywords ck _i Is expressed as

Of these, fre (cor) _j ) Representing the word cw _i Co-occurrence of wco _j Frequency of occurrence, len (wco) _j ) Indicating co-occurrence of wco _j The number of words contained; when the candidate keyword ck _i When a plurality of words are included, candidate keywords ck are calculated _i The weight calculation formula on the dimension of word co-occurrence is

Further, the context semantic feature weight calculation method comprises the following steps:

the computing task is decomposed into the maximum probability value of each word forming the context (w) which is respectively and independently predicted by the current word w, and the target function is

Wherein c is _i E.g. context (w), D is technical standard corpus, theta is model parameter, conditional probability P (c) _i | w) is expressed as

Wherein,

and v _w Are respectively the word c _i And w, c' is all non-repeating words in the corpus, v _c′ Vector representation as c'; each technical standard D in the technical standard set D _i Viewed as being composed of a series of words w ₁ …w _i …w _n Composition, assuming mutual independence between words, to technical criteria d _i In each candidate keyword ck _i If the word type is used, the formula for calculating the prediction probability is

If the phrase type is used, the calculation formula is

logP (w) on the left after taking the logarithm of both sides of the above formula ₁ …w _i …w _n |ck _i ) As a measure of candidate keywords ck _i The weight metric in the semantic dimension is denoted as w (Sem (ck) _i ) logP (w) ₁ …w _i …w _n |ck _i ) Approximately as logP (c) ₁ …c _i …c _n |ck _i ) Wherein w is ₁ …w _i …w _n As candidate keywords ck _i A Context within the scope of a model window, abbreviated as Context (ck) _i ) Then W (Sem (ck) _i ) Is calculated as

Further, the step 4) comprises:

to technical standard d _i Any one of the candidate keywords ck _i Comprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, calculating the candidate keywords ck _i The weight scores in the three characteristic dimensions are formulated as

W(ck _i )＝W(Pos(ck _i ))+W(Coo(ck _i ))+W(Sem(ck _i ))；

Note d _i Of each candidate keyword ck _i Corresponding score

Score(d _i )＝{W(ck ₁ )...W(ck _i )...W(ck _n ) For Score (d) } for Score (d) _i ) The scores in the step (a) are ranked from high to low, a dynamic threshold lambda is set as the average value of all the scores, and the calculation formula is

If d is _i The medium candidate keyword satisfies W (ck) _i ) When the k is more than or equal to lambda, ck is _i And adding the result into the result keyword set.

The automatic extraction method of the keywords, provided by the invention, integrates the position characteristics, the word co-occurrence characteristics and the context semantic characteristics to extract the keywords, comprehensively considers the weight influence of the internal positions of the documents and the context semantic characteristics on the keywords, achieves higher accuracy and recall rate, improves the retrieval quality of the 3GPP technical standard, reduces the labor cost and can well meet the requirements of practical application.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a dependency parse tree;

FIG. 3 is a comparison of the CBOW model and Skip _ gram model frameworks.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides an automatic keyword extraction method, which comprises the steps of firstly extracting common words in a 3GPP technical standard based on a word frequency-document distribution entropy method, then extracting candidate keywords based on an algorithm of a dependency syntax analysis tree, calculating a candidate keyword weight score by integrating position characteristics, word co-occurrence characteristics and context semantic characteristics after filtering the common words aiming at the candidate keywords, calculating a dynamic threshold according to a candidate keyword weight score range of each technical standard, and finally determining result keywords by using the dynamic threshold. Specifically, as shown in fig. 1, an automatic keyword extraction method includes the following steps:

step 1) preprocessing the 3GPP technical standard, which mainly comprises adopting an Apache POI analysis technical standard to remove text noises such as pictures, tables, formulas, special symbols and illegal characters in the technical standard;

step 2) extracting common words in all technical standards based on the word frequency-document distribution entropy;

step 3) segmenting each technical standard into sentence subsets, performing dependency syntax analysis on each sentence, extracting candidate keywords based on a dependency syntax analysis tree and filtering common words;

The 3GPP technical standards include not only simple stop words such as "if", "at", "not", "or", but also general words throughout most technical standards, such as "Figure", "version", "general", "seven", etc. which are specific to the technical standards and have no representativeness and importance. It has been observed that both simple stop words and generic words that are unique within a technical standard occur at different frequencies in different versions and types of technical standards, with a high degree of currency, and generally without the ability to generalize or abstract out the content of a particular technical standard. These words are collectively referred to as common words.

Clearly, coverage is not comprehensive enough if only the manually gathered common deactivation vocabulary is selected. Therefore, in order to reduce the interference of the general words to the keyword extraction task as much as possible, the concept of word frequency-document distribution entropy is introduced by combining the information entropy principle to automatically obtain the technical standard general words. The information entropy is introduced into an information theory by Shannon at first and is used for measuring the uncertainty of the discrete random variable, and the larger the information entropy value is, the larger the uncertainty corresponding to the random variable is. Similarly, regarding the word w as a random variable, the definition of the word frequency-document distribution entropy is given as follows.

Definition 1 term frequency-document distribution entropy refers to a measure of uncertainty in the state of distribution of a word w in a set of technical standards.

Let a document set consisting of n technical standards be denoted as D ═ D ₁ ，d ₂ ...d _i ...d _n And (5) recording the word frequency of the word w and the document distribution entropy as H (w), then H (w) is calculated as shown in formula (1),

wherein, P (w, d) _i ) For the occurrence of the word w in the technical standard d _i I is 1. ltoreq. n, according to the maximum likelihood estimation method, P (w, d) _i ) Can be calculated from the formula (2),

wherein, f (w, d) _i ) Is the word w isTechnical Standard d _i The number of occurrences in (c). It can be seen that if the technical standard containing w is richer and w is distributed in the technical standard set more uniformly, the word frequency-document distribution entropy h (w) is larger, and at the same time, indicates that w is distributed in the technical standard set D with greater uncertainty, so that w is more likely to be a general word without importance in the technical standard set.

Statistics show that most keywords are real word phrases such as nouns, verbs and adjectives generally and do not contain stop words without practical meanings and universal words distributed uniformly in technical standards. Thus, the keyword category is defined as verbs, adjectives, nouns, and noun phrases after the removal of the common word. In order to extract candidate keywords with semantic consistency and syntactic modifier completeness, dependency syntactic analysis is firstly carried out on sentences in the 3GPP technical standard, and then noun phrases, verbs, adjectives and nouns meeting the syntactic modifier consistency are extracted by combining a dependency syntactic analysis tree and are added into a candidate keyword set. The NP with the minimum granularity in the syntactic analysis tree is used as a candidate keyword aiming at the noun phrase. And finally, filtering the common words according to the candidate keyword set. For example: the syntax analysis of the sentence "local channels are SAPs between MAC and RLC" shows the result as shown in FIG. 2.

As can be seen from fig. 2, the adjectives "local" modify the nouns "channels", "local" and "channels" to form a Noun Phrase (NP), and "SAPs" and "MAC and RLC" are Noun Phrases (NP), while "SAPs beta ween MAC and RLC" is also a Noun Phrase (NP) as a whole, but in the dependency parsing tree, the noun phrases "MAC and RLC" and "beta ween" form preposition phrases PP and "SAPs" which are child nodes of NP and have sibling node relationship therebetween. Clearly, the granularity is smaller for the noun phrase "MAC and RLC" than for the noun phrases "SAPs between MAC and RLC". Therefore, "local", "channels", "local channels", "are", "SAPs", "MAC", "RLC", and "MAC and RLC" are selected as candidate keywords in the example sentence, and the candidate keywords are filtered using the extracted common words. According to the above analysis, the candidate keyword extraction algorithm based on the dependency parsing tree comprises the following steps:

step 1: traversing the technical standard set D, for each technical standard D in D _i Dividing into sentences according to punctuations, and representing the divided sentence sets as

n _s As a document d _i The number of Chinese sentences.

Wherein T is _i Indicates technical standard d _i And (4) the dependency parsing tree corresponding to the ith sentence.

Step 3: cyclic read dependency parse tree set Trees (d) _i ) For any dependency syntax tree T _i ∈Trees(d _i ) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly mode _i If the current node is a leaf node (not the last leaf node), judging whether the part of speech of the node is a noun, a verb and an adjective, adding the node into the candidate keyword set if the conditions are met, and otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a Noun Phrase (NP) or not, if the current node is the noun phrase and the right subtree is not empty, continuing to recursively traverse the right subtree of the current node until no non-leaf node taking the NP as a father node exists in the subtree, and adding the child nodes of the NP into the candidate keyword set as a whole.

Step 4: because some technical standard common words still exist in the candidate keywords extracted in the previous step, the extracted common words are required to be used for further filtering the candidate keyword set, and if an element containing the common words exists in the candidate keyword set, the element is removed from the candidate keyword set.

By analyzing the characteristics of the 3GPP technical standard, the portions of Scope, Reference, Definitions and Abbreviations except the text can be found to have important Reference values for the whole document, and should be considered as the key positions. Each chapter content of the body part is generally expanded around the nearest title, so that the titles can be regarded as the concentration of the core content of the corresponding paragraph, and the candidate keywords appearing at the position should be given higher weight. Similarly, what appears in the section NOTEs (NOTEs) generally serves as additional emphasis or supplementary description herein and should therefore also be treated as a special location.

Therefore, the position of the candidate keyword appearing in the 3GPP technical standard is taken as a weight influence factor. The method comprises the steps of respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to titles of different levels of the 3GPP technical standard, numbering the sentences in the sentence subsets from 1 in sequence, and if the number of the sentence in which a candidate keyword is positioned is smaller, the keyword is closer to the title, which indicates that the keyword is more likely to be the key word. Recording technical standard d _i Middle candidate keyword set CK (d) _i )＝{ck ₁ ，ck ₂ ...ck _i ...ck _n In which ck _i For any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as

SP＝{Title，Scope，Reference，Definitions，Abbrevations，NOTE}，

locate(ck _i ) Representing candidate keywords ck _i Position of occurrence, defining a characteristic function Pos (ck) _i ) Representing candidate keywords ck _i The weight in the dimension of the occurrence position is assigned, then Pos (ck) _i ) Can be expressed as shown in equation (3).

Wherein, Sno _cki Representing candidate keywords ck _i Number of sentence in it, Snu _cki Representing candidate keywords ck _i Number of sentences in text paragraph, len(ck _i ) Representing candidate keywords ck _i The number of words contained. Denominator plus len (ck) _i ) To avoid the situation that the position weight appears 0, due to the candidate keyword ck _i In technical Standard d _i May occur multiple times at different locations, and thus the weights occurring at different locations are averaged, denoted W (Pos (ck) _i ) Is) represents the average value of the position weights, the calculation method thereof is as shown in equation (4).

Of these, fre (ck) _i ) Representing candidate keywords ck _i Frequency of occurrence in the same technical standard. The processing mode of taking the average value can enhance the candidate keywords ck _i The weight with lower frequency but appearing in a special position weakens the deviation caused by calculating the weight of the candidate keyword only by the frequency characteristic.

Word co-occurrence characteristics are a factor that is not negligible in keyword extraction. By observing the extracted candidate keywords of the 3GPP technical standard, it is found that there is a phenomenon that constituent words of one candidate word repeatedly appear in other candidate words of different lengths, for example: for three candidate keywords, "MCH", "MCH transmission", "MCH subframe allocation", which occurs in two other candidate keywords of different lengths, the "MCH transmission" and the "MCH subframe allocation" may be regarded as co-occurring words of the "MCH", which often express more specific information than a single constituent word. Therefore, if a word constituting a candidate keyword has many co-occurring words, the word is considered to have a richer meaning and should be given a higher weight. According to the analysis, the co-occurrence word frequency and the word length of the candidate keyword forming words are used as word co-occurrence characteristics to calculate the weight of the candidate keyword.

The candidate keyword set with all technical standards is CK ═ CK (d) ₁ )，CK(d ₂ )...CK(d _i )...CK(d _n ) H.for technical standard d _i Any one of the candidate keywordsck _i Memory component ck _i Are each cw ₁ ，cw ₂ …cw _i …cw _m M is ck _i Number of words contained, note cw _i Is cocur _i ＝{wco ₁ ，wco ₂ …wco _i …wco _p P is the size of the co-occurrence word set (i.e., the number of co-occurrences in the co-occurrence word set), wherein wco _j Representing the word cw _i Wco _j ∈CK(d _i ) And satisfies wco ₁ ∩wco ₂ ∩…∩wco _j ∩…∩wco _p ＝{cw _i J is more than or equal to 1 and less than or equal to p, then cw _i For candidate keywords ck _i The contribution of (c) can be expressed by equation (5);

wherein, fre (cor) _j ) Representing the word cw _i Co-occurrence of wco _j Frequency of occurrence, len (wco) _j ) Indicating co-occurrence of wco _j The number of words contained. When the candidate keyword ck _i When a plurality of words are included, candidate keywords ck are calculated _i The weight calculation method in the dimension of word co-occurrence is shown in equation (6).

It can be seen that when the candidate keyword ck _i The more frequently occurring component words in the co-occurrence words, the candidate keyword ck of each component word pair _i Is greater, so the candidate keyword ck _i The greater the weight in this dimension of the word co-occurrence characteristic.

Keywords generally highly condense the core content of technical standards, and often have the commonality of centralizing the gist of technical standards from different semantic levels. Therefore, the influence of semantic features of the candidate keywords in the context on the weights cannot be ignored. Considering that the Word vector can well represent semantic characteristics, Word2vec is introduced to calculate the weight of the candidate keywords in the dimension of the semantic characteristics.

Word2vec is a scheme implementation tool for solving the problems of lack of model generalization force, dimension disaster and the like in the calculation process of a statistical language model, which is provided by *** based on the deep learning idea. Word2vec comprises two training models of CBOW and Skip _ gram, in order to reduce the complexity of model solving, two training optimization methods of hierarchy software max (HS) and Negative Sampling (NS) are provided at the same time, and a training frame is formed by combining the training models and the optimization methods. As shown in fig. 3, the training frames formed by the two models have a common point in that they both include an Input Layer (Input Layer), a Projection Layer (Projection Layer), and an Output Layer (Output Layer), and the difference is that the training frame based on the CBOW model predicts the current word w according to the context semantic environment in which the words appear, and the training frame based on the Skip _ gram model predicts the context semantic information according to the current word w.

In order to solve the problem of predicting context (w) (window c) by current word w, the Skip _ gram model decomposes the calculation task into the probability maximum of predicting each word forming context (w) by current word w independently, and the objective function is

Wherein c is _i E.g. context (w), D is technical standard corpus, theta is model parameter, conditional probability P (c) _i | w) is expressed by Softmax normalization, as shown in equation (7);

wherein,

and v _w Are respectively the word c _i And w, c' is all non-repeating words in the corpus, the number of the words is large, and Hierachy Softmax or Negative Sampling can be adoptedOptimization, v _c′ Is a vector representation of c'. Each technical standard D in the technical standard set D _i Viewed as being composed of a series of words w ₁ …w _i …w _n Composition, assuming mutual independence between words, to technical standard d _i Of each candidate keyword ck _i If the prediction probability is a word type, calculating the prediction probability by using a formula (8), and if the prediction probability is a phrase type, calculating by using a formula (9);

wherein P (w) _j |ck _i ) By using the calculation of equation (7) for the variable substitution, it can be seen that when the probability P (w) is predicted ₁ …w _i …w _n |ck _i ) The larger the candidate keyword ck is, the larger the candidate keyword ck is _i The better the effect of predicting context information, the more likely it is a keyword that characterizes full-text information. In order to avoid as far as possible the occurrence of extremely small errors due to the excessively small conditional probability in the continuous multiplication calculation, the logP (w) on the left side is obtained by taking the logarithm of both sides of the above equation ₁ …w _i …w _n |ck _i ) As a measure of candidate keywords ck _i The weight measure in the semantic dimension is denoted as W (Sem (ck) _i ) And simultaneously considering that the relation is established for similar words when Word2vec training corpus is considered, logP (w) is used for simplifying calculation ₁ …w _i …w _n |ck _i ) Approximately as logP (c) ₁ …c _i …c _n |ck _i ) Wherein w is ₁ …w _i …w _n As candidate keywords ck _i Context within the scope of the model window, abbreviated as Context (ck) _i ) Then W (seq (ck) _i ) The calculation method is shown in formula (10);

to technical standard d _i Any one of the candidate keywords ck _i Comprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, and calculating the candidate keywords ck by adopting a formula (11) _i Weight scores in three feature dimensions.

W(ck _i )＝W(Pos(ck _i ))+W(Coo(ck _i ))+W(Sem(ck _i )) (11)。

The influence of the insufficiency of a single characteristic factor on the extraction effect of the key words can be avoided by fusing three different characteristics, and d is recorded _i In each candidate keyword ck _i Corresponding Score (d) _i )＝{W(ck ₁ )...W(ck _i )...W(ck _n ) For Score (d) } for Score (d) _i ) The scores in (1) are ranked from high to low, and a dynamic threshold lambda is set as the average value of all the scores, and the calculation mode is shown as a formula (12);

if d is _i The middle candidate keyword satisfies W (ck) _i ) When the k is more than or equal to lambda, ck is _i And adding the result into the result keyword set. The reason why the fixed threshold is not selected is that different technical standards have differences in length, and the candidate keyword score ranges calculated by the different technical standards are different, so that a dynamic threshold is set for the actual score range of a single technical standard.

The method is used for carrying out experiments, experimental data are selected from 2016 technical standards (including technical specifications and technical reports) on a 3GPP website, and 8000 pieces of experimental data are obtained in total after de-noising is carried out again. The effective series (series) number range of the technical standards is 01-12, 21-38, 41-46, 48-52 and 55, and 42 series are provided in total, each series comprises a plurality of versions and is 14G in size, and each technical standard consists of Cover, form, Scope, Reference, Definitions and Abbrevations, Topic body and Annex parts.

In the experiment, three evaluation indexes of accuracy (P), recall rate (R) and F-value (F-Score) which are commonly used in natural language processing tasks are adopted to evaluate the extraction effect of the keywords, and the calculation methods are respectively shown in formulas (13) to (15).

Extracting technical standard common words from the preprocessed technical standard by using a method based on the word frequency-document distribution entropy, obtaining the optimal threshold value of the word frequency-document distribution entropy to be 5.42 through multiple experiments, selecting words larger than the threshold value as technical standard common words, obtaining 13566 common words in total, wherein part of common word extraction results are shown in table 2.

Table 2 partial common word extraction results

Serial number	Common words	H(W)	Serial number	General words	H(W)
						1	version	10.9665	11	all	9.9539
2	should	10.8165	12	possible	9.8908
						3	latest	10.7022	13	foreword	9.8543
4	approve	10.6394	14	through	9.8097
						5	specifiction	10.5639	15	modify	9.7739
6	update	10.4934	16	restriction	9.6978
						7	present	10.2963	17	this	9.6536
8	within	10.1056	18	available	9.6281
						9	be	10.0572	19	release	9.5941
10	further	10.0188	20	when	9.5148

As can be seen from table 2, the algorithm based on the term frequency-document distribution entropy can extract not only the common stop words "all", "this", "while", and the like, but also common words in the technical standard, for example: "version", "specification", "release", and the like. By adopting the method, most technical standard common words can be effectively obtained.

And after filtering the candidate keyword set in each technical standard by using the general word list, respectively calculating the weights corresponding to the position characteristic, the word co-occurrence characteristic and the context semantic characteristic. When the context semantic features are calculated, a Skip-Gram model in Word2vec and a Huffman Softmax optimization method are selected for training 14G technical standards in experiments, a context window is set to be 10, vector dimensions are set to be 200, and 965.1M model files are obtained after 10 iterations. To analyze the effect of different features on the extraction of technical standard keywords, the combination of experimentally set comparison features is shown in table 3.

TABLE 3 combination of features

And (3) respectively calculating the scores of the candidate keywords of each technical standard under different feature combinations by combining formulas (3) to (11), calculating a dynamic threshold value by utilizing a formula (12), and screening out the candidate keywords meeting the conditions as the identified keywords. And simultaneously randomly extracting 1000 technical standards containing different series and versions from 8000 technical standards, and screening 2, 4, 6, 8 and 10 keywords from each technical standard as a reference keyword set in a form of intersection of three-person cross labeling. And respectively carrying out morphological reduction on the recognized keywords and the manually labeled reference keyword set, then comparing, if the recognized keywords and the labeled keywords have the same morphological shapes or are short and full names of each other, considering the recognized keywords as the correct recognition condition, meanwhile, counting the correct rate, the recall rate and the F-value of the recognized keywords under different keyword numbers by different feature combinations, and showing the experimental results in a table 4.

TABLE 4 extraction results of key words under different feature combinations

As can be seen from table 4, when the number of key words is 2, the Feature recognition recall rates of Feature1, Feature4, Feature5, and Feature7 are higher than those of other Feature combinations. This is because when the number of keywords is small, those candidate keywords appearing in a particular position are more likely to be correctly recognized as keywords. Meanwhile, the words in the special positions provide less context semantic information, so the position characteristics of the keywords appearing in the technical standard relatively dominate. When the number of the key words is increased from 2, the comparison between Feature1 and Feature3 shows that the recall rate corresponding to Feature1 is slowly increased and gradually decreases; the Feature2 obviously increases the recall rate of the correct rate when the number of the keywords is 4-8, and then the correct rate is reduced to some extent; when the number of keywords exceeds 6, Feature3 increases the recall ratio. The description shows that as the number of the keywords increases, the influence of the position on the weight of the keywords gradually decreases, and the influence of the word co-occurrence characteristics and the context semantic characteristics on the weight of the keywords gradually increases. Meanwhile, comparing Feature5 with Feature7 shows that the accuracy and recall rate are increased after the word co-occurrence Feature is added. This is because word co-occurrence factors help identify more phrase-type keywords that are likely to correspond to abbreviated keywords having a certain general meaning but not occupying a position advantage, and as the number of keywords increases, the keywords identified by word co-occurrence characteristics are more likely to be included in the reference keyword set. Comparing Feature4 with Feature7, it can be found that the recall rate is obviously increased from the keyword number of 4 after the context semantic features are added. The reason is that when the number of keywords is increased, the candidate keywords which are characterized by rich context semantic information are more likely to be selected as the keywords. When the number of key words is the same, comparing Feature1, Feature2, Feature3 and Feature7, it can be found that Feature7 achieves better recognition effect than any single Feature due to the advantage of combining different features.

The automatic keyword extraction method provided by the invention integrates the position characteristic, the word co-occurrence characteristic and the context semantic characteristic to extract the keywords, and comprehensively considers the weight influence of the internal position of the document and the context semantic characteristic on the keywords, so that higher accuracy and recall rate are achieved, the technical standard retrieval quality of the 3GPP is improved, the labor cost is reduced, and the requirement of practical application can be well met.

The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An automatic keyword extraction method is characterized by comprising the following steps: extracting common words, extracting candidate keywords, filtering the common words aiming at the candidate keywords, calculating a candidate keyword weight score by integrating position characteristics, word co-occurrence characteristics and context semantic characteristics, calculating a dynamic threshold according to a candidate keyword weight score range, and determining result keywords by using the dynamic threshold;

the method for calculating the position feature weight comprises the following steps: respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to titles of different levels of the 3GPP technical standard, sequentially numbering the sentences in the sentence sets from 1, and recording the technical standard d _i Middle candidate keyword set CK (d) _i )＝{ck ₁ ，ck ₂ ...ck _i ...ck _n In which ck _i For any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as

SP＝{Title，Scope，Reference，Definitions，Abbrevations，NOTE}，

Of these, fre (ck) _i ) Representing candidate keywords ck _i Frequency of occurrence in the same technical standard;

the word co-occurrence feature weight calculation method comprises the following steps:

the candidate keyword set with all technical standards is CK ═ CK (d) ₁ )，CK(d ₂ )...CK(d _i )...CK(d _n ) H.for technical standard d _i Any one of the candidate keywords ck _i Memory component ck _i Are respectively cw ₁ ，cw ₂ …cw _i …cw _m M is ck _i Number of words contained, let cw _i Co-occurring word set of (1) is cocur _i ＝{wco ₁ ，wco ₂ …wco _i …wco _p P is the size of the co-occurrence set, wherein wco _j Representing the word cw _i Wco _j ∈CK(d _i ) And satisfies wco ₁ ∩wco ₂ ∩…∩wco _j ∩…∩wco _p ＝{cw _i J is more than or equal to 1 and less than or equal to p, then cw _i For candidate keywords ck _i Is expressed as

Of these, fre (wco) _j ) Representing the word cw _i Co-occurrence of wco _j Frequency of occurrence, len (wco) _j ) Indicates a co-occurrence wco _j The number of words contained; when the candidate keyword ck _i When a plurality of words are included, candidate keywords ck are calculated _i The weight calculation formula on the dimension of word co-occurrence is

The method for calculating the context semantic feature weight comprises the following steps:

the calculation task is decomposed into the probability maximum value of each word forming the context (w) which is respectively and independently predicted by the current word w, and the objective function is

Wherein c is _i Epsilon Context (w), D is a technical standard corpus, theta is a model parameter, and the conditional probability P (c) _i | w) is expressed as

Wherein,

and v _w Are respectively the word c _i And w, c' is all non-repeating words in the corpus, v _c′ A vector representation of c'; each technical standard D in the technical standard set D _i Viewed as being composed of a series of words w ₁ …w _i …w _n Composition, assuming mutual independence between words, to technical standard d _i Of each candidate keyword ck _i If the word type is used, the formula for calculating the prediction probability is

W(ck _i )＝W(Pos(ck _i ))+W(Coo(ck _i ))+W(Sem(ck _i ))；

Note d _i Of each candidate keyword ck _i Corresponding score

Score(d _i )＝{W(ck ₁ )...W(ck _i )...W(ck _n ) For Score (d) } for Score (d) _i ) The scores in (1) are ranked from high to low, a dynamic threshold value lambda is set as the average value of all the scores, and the calculation formula is

2. The method for automatically extracting keywords according to claim 1, wherein the method for automatically extracting keywords comprises:

step 1) removing text noise in the 3GPP technical standard;

step 2) extracting common words in the technical standard;

3. The method for automatically extracting keywords according to claim 2, wherein the step 1) is specifically as follows: and removing text noise in the 3GPP technical standard by adopting the Apache POI analysis technical standard.

4. The method for automatically extracting keywords according to claim 2, wherein the step 2) comprises: extracting common words in the technical standard based on the word frequency-document distribution entropy, wherein the word frequency-document distribution entropy refers to uncertainty measurement of the distribution state of the words w in the technical standard set; let D be D ═ D in the document set composed of n technical standards ₁ ，d ₂ ...d _i ...d _n Recording the word frequency of the word w-the document distribution entropy is H (w), then H (w) is calculated by the formula

Wherein, P (w, d) _i ) For the word w to appear in technical standard d _i I is 1. ltoreq. n, according to the maximum likelihood estimation method, P (w, d) _i ) Is calculated by the formula

5. The method of any of claims 1-4, wherein extracting candidate keywords based on a dependency parsing tree comprises:

n _s As a document d _i The number of Chinese sentences;

Wherein T is _i Indicates technical standard d _i The dependency syntax analysis tree corresponding to the ith sentence;

step 3: cyclic read dependency parse tree set Trees (d) _i ) For any dependency syntax tree T _i ∈Trees(d _i ) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly mode _i If the current node is a leaf node, judging whether the part of speech of the node is a noun, a verb or an adjective, if the condition is met, adding the node into the candidate keyword set, otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a noun phrase, if so, continuing to recursively traverse the right subtree of the current node until the subtree does not have a non-leaf node taking the noun phrase as a parent node, and at the moment, adding the child nodes of the noun phrase into a candidate keyword set as a whole;

6. The method of claim 2 or 3, wherein the text noise includes pictures, tables, formulas, special symbols, and illegal characters.